GithubHelp home page GithubHelp logo

jasmendes / python-web-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rohan-bhautoo/python-web-scraper

0.0 0.0 0.0 30 KB

A python web scaper to extract content and data from a website.

Python 100.00%

python-web-scraper's Introduction

Web-Scraping

Version Python

Python Web Scraper is a simple web scraping tool built with Python. It allows you to scrape data from web pages, extract information from HTML elements, save data in text file, download all images, and store table data in a CSV file. The tool provides a user-friendly interface using the Tkinter library.

Prerequisites

Python 2.x

python --version

Library

Requests

Requests allows you to send HTTP/1.1 requests extremely easily.

pip install requests
BeautifulSoup

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

pip install beautifulsoup4

Installation

Clone the repository

https://github.com/rohan-bhautoo/Python-Web-Scraper.git

Usage

To run the Python Web Scraper, execute the following command:

python main.py

The application will open a GUI window where you can enter the URL of the web page you want to scrape. You can select various options such as extracting links, headings, images, paragraphs, meta data, CSS files, and scripts. You can also choose to download images and store the data in a CSV file.

Code Examples

Scrape Data from Web Page

import requests
from bs4 import BeautifulSoup

# Make request to website
response = requests.get(url)
html_content = response.content

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Find elements and extract data
# ...

# Store data in text file
# ...

Download Images from URL

import requests

url = image.get('src')

# send a GET request to the URL to download the image
response = requests.get(url)

# construct the file name to save the image as
filename = os.path.join(directory, 'image{}'.format(count))

# use os.path.splitext to split the filename into base name and extension
_, extension = os.path.splitext(url)

print(filename)

# save the image to the chosen file path
with open(f'{filename}{extension}', 'wb') as f:
    f.write(response.content)
    count += 1

Extract Table Data from Web Page

from bs4 import BeautifulSoup

# get URL from entry field
url = self.url_entry.get()

# make request to website
response = requests.get(url)
html_content = response.content

# parse HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# find table element
table = soup.find('table')

# create table header
table_header = []
for th in table.find_all('th'):
    table_header.append(th.text.strip())

# create table rows
table_rows = []
for tr in table.find_all('tr'):
    table_row = []
    for td in tr.find_all('td'):
        table_row.append(td.text.strip())
    table_rows.append(table_row)

Save Table Data in CSV file

now = datetime.utcnow()
format = now.strftime("%Y%m%d%H%M")
with open(f"csv/csv_{format}.csv", "w") as f:
    csvwriter = csv.writer(f, delimiter=",")

    if includeHeader == 1:
        print("save header:", table_header)
        csvwriter.writerow(table_header)

    for row_id in self.treeview.get_children():
        row = self.treeview.item(row_id)["values"]
        if row != "":
            print("save row:", row)
            csvwriter.writerow(row)

Limitation

  • The Python Web Scraper may not work on web pages with complex JavaScript-based content.
  • Some websites may have terms of service or robots.txt that prohibit scraping. Make sure to comply with any legal and ethical requirements.

Author

๐Ÿ‘ค Rohan Bhautoo

Show your support

Give a โญ๏ธ if this project helped you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.