GithubHelp home page GithubHelp logo

oxylabs / web-scraping-data-parsing-beautiful-soup Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 1.0 10 KB

Web Scraping and Data Parsing Using Beautiful Soup

Python 100.00%
beautifulsoup data-parsing python web-scraping github-python

web-scraping-data-parsing-beautiful-soup's Introduction

Web Scraping and Data Parsing Using Beautiful Soup

Oxylabs promo code

This project provides a clear and concise example of how to fetch content from a website using the Requests module and then parse it using BeautifulSoup.

Setting Up

To run this example you will need Python 3. We recommend setting up a virtual environment

Install dependencies by running

$ pip install requests
$ pip install BeautifulSoup4
$ pip install pandas

Note: You can also install them by using the requirements.txt file included in this repository.

$ pip install -r src/requirements.txt

Web Scraping

A mock bookstore website called https://books.toscrape.com is our scraping target.

Use the requests module to fetch a page from it

response = requests.get('https://books.toscrape.com')

Once the response is retrieved, check whether the request was successful or not by verifying the status_code property

if response.status_code != 200:
    print('Page not found')
    exit(1)

print('Successfully fetched the page')

Save the script as src/scrape.py and run it.

$ python3 src/scrape.py

Successfully fetched the page

The requests module has successfully retrieved the html content from the website and now all that's left is to parse it.

A working example can be found here

Parse HTML

Take a look at the structure of the HTML that you're trying to scrape.

<article class="product_pod">
    ...
    <h3>
        <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
    </h3>
    ...
</article>

The book info is neatly wrapped in an article tag. Inside the article, there's a heading (h3) that contains an anchor (a), which contains the title of the book inside an attribute.

<a ... title="A Light in the Attic">...</a>

To parse this HTML content use the BeautifulSoup4 library.

Firstly, import BeautifulSoup

from bs4 import BeautifulSoup

Then, create an instance of the BeautifulSoup class and load the HTML content that has been retrieved from the web page previously.

soup = BeautifulSoup(response.content, 'html.parser')

Retrieve all the article tags

articles = soup.find_all('article')

Define a titles array that will hold all the book titles extracted from the current HTML

titles = []

Iterate through every article to extract the title attribute of the anchor tag. You may want to print the title as well, just to see whether the script works as expected

for article in articles:
    title = article.h3.a.attrs['title']
    titles.append(title)
    print(title)

Save the script as src/parse.py and run it

$ python3 src/parse.py                                              
Successfully fetched the page         
A Light in the Attic       
Tipping the Velvet
Soumission
...

All the book titles have been parsed successfully!

A working example can be found here

Save to csv

Printing everything to standard output can become messy at times. Instead, it is a good idea to save the results into a CSV file.

Start by deleting the print function.

print(title) # delete this!

Next, create a data frame object by using the pandas library. In the constructor, pass a dictionary that contains the name of the column ("Title") and an array of titles that was parsed previously.

data_frame = pandas.DataFrame({'Title': titles})

Finally, save the data frame to a file by using the to_csv method

data_frame.to_csv('books.csv', index=False, encoding='utf-8')

Save the script as src/save.py and execute it.

$ cd src
$ python3 save.py
Successfully fetched the page

Use the cat Unix utility to print the csv file.

$ cat books.csv
Title
A Light in the Attic       
Tipping the Velvet
Soumission
...

The newly created file now contains all the book titles from the web page.

The final version of the script can be found here

web-scraping-data-parsing-beautiful-soup's People

Contributors

augustoxy avatar oxyjohan avatar oxylabsorg avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

flyinggh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.