GithubHelp home page GithubHelp logo

webscraping's Introduction

This repo contains two Scrapy Spiders used to scrap joboutlook website for jobs related information.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. I am using Scrapy for this purpose. Scrapy is an open source and collaborative framework for extracting the data you need from websites.

If you are thinking of Scraping a website, this page will help you understand the how Scrapy works from a beginner to an advance level. You won't need any other source after this - promise 

Here is what we will learn

  • Installing Anaconda on your local machine - Its not mandatory, but if you're afraid of configuring environment variable, then this is highly recommended. 
  • Installing Scrappy
  • Creating a scrappy project
  • Running a scrappy project - in scrappy specific terminology we will learn how to crawl the website with Spider 
  • Store the output in files (I will store it in JSON file, we can easily use the same method to store output in XML, CSV and a few other formats)
  • We will scrape this website https://joboutlook.gov.au/Industry.aspx. We will not only read information from this page, but hit subsequent links on this page and read information from the other pages in the same program as well.

Step-by-step guide

First thing first - Download Anaconda for windows with Python 3.* version from here https://repo.anaconda.com/archive/Anaconda3-5.2.0-Windows-x86_64.exe Install Anaconda - Just double click the .exe and follow the steps, its very simple standard installation. This guide might also be helpful https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444. Anaconda provides Python and R along with several libraries, RStudio, Spyder IDE, Jupyter notebook and several other tools out of box.

Install Scrapy - Open Anaconda prompt by going to Windows start button > All Programs > Anaconda > Anaconda Prompt and run the following command 

conda install -c conda-forge scrapy Detailed documentation over here. 

Scrapy Shell - Scrappy comes with an interactive shell where you can try and debug code. Run Anaconda Prompt, type scarpy and press enter. Then type the following command 

scrapy shell https://scrapy.org I would highly recommend to get used to shell before going any further. This one page document will be enough to know the workings of Scrapy shell https://doc.scrapy.org/en/latest/topics/shell.html

Build a new project - On the Anaconda prompt run the following command

scrapy startproject myproject

This will create a "myproject" directory with the following content

tutorial/ scrapy.cfg # deploy configuration file

myproject/             # project's Python module, you'll import your code from here
    __init__.py

    items.py          # project items definition file

    middlewares.py    # project middlewares file

    pipelines.py      # project pipelines file

    settings.py       # project settings file

    spiders/          # a directory where you'll later put your spiders
        __init__.py

Before going any further, I recommend to at least read the section Our first Spider of this tutorial  Lets create a "Spider" that "Crawls" the web. Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass scrapy.Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

Summary of what the code does The spider lands on https://joboutlook.gov.au/Industry.aspx (start_urls object in code). This is the first request made by the Spider. Spider method parse is called automatically by Scrapy which is in charge of processing the response and returning scraped data and/or more URLs to follow. In the following code parse method reads href attribute of HTML anchor tags that are inside and h2 heading (see the image below). Each HTML element can be accessed with two functions of response object, css or xpath, I prefer css.

You can see code behind the web page by right clicking the page and selecting Inspect in Chrome.

alt text

The Code Below is the code for our first Spider. Save it in a file named JobInfo.py under the myproject/spiders directory in your project.

The parse method loops over each heading(see image 1 above), gets the URL and creates a new request to that page. A separate method parse_industry_profiles parses the response returned from this new request. Similarly, this parse functions also loops over each section on this page(see image 2 above), gets the URL and makes a request to that page. The response of this call is handle by another parse function parse_profile_details, which returns the information mentioned in image 3 above.

import scrapy


class JobinfoSpider(scrapy.Spider):

    name = 'jobdetails'
    allowed_domains = ['joboutlook.gov.au']
    start_urls = ['https://joboutlook.gov.au/Industry.aspx']

    def parse(self, response):

        industry_details_page_urls = \
            response.css('article > div.career-title > h2 > a::attr(href)'
                         ).extract()

        for url in industry_details_page_urls:
            url = response.urljoin(url)
            yield scrapy.Request(url=url,
                                 callback=self.parse_industry_profiles)

    def parse_industry_profiles(self, response):
        profile_details_page_urls = \
            response.css('article > div.career-title > h3 > a::attr(href)'
                         ).extract()

        # Job profile details page

        for url in profile_details_page_urls:
            url = response.urljoin(url)
            yield scrapy.Request(url=url, callback=self.parse_profile_details)

    def parse_profile_details(self, response):
        fast_facts = \
            response.css('ul.snapshot > li > span.snapshot-data::text')

        yield {
            'industry': response.css('div.breadcrumb-col > ul > li:nth-of-type(3) > a::text').extract_first().strip(),
            'job_profile': response.css('div.page-title-col > h1::text').extract_first().strip(),
            'anzsco_code': response.css('div.anzsco > abbr::text').extract_first().replace('ANZSCO ID ', '').strip(),
            'avg_weekly_pay': fast_facts[0].extract().strip(),
            'future_growth': fast_facts[1].extract().strip(),
            'skill_level': fast_facts[2].extract().strip(),
            'employment_size': fast_facts[3].extract().strip(),
            'unemployment': fast_facts[4].extract().strip(),
            'male_share': fast_facts[5].extract().strip(),
            'female_share': fast_facts[6].extract().strip(),
            'full_time': fast_facts[7].extract().strip(),
        }

The following command executes the project and stores the returned data in JSON format in a single file

scrapy crawl jobdetails -o jobdetails.json

More Generic Approach The code below does the exact same thing as the previously explained code does, the only difference is we are using Items class to store the information. This is a more generic and preferable approach. Read more about Items class here. 

Lets create an Items class in existing project file items.py

import scrapy
from scrapy.item import Item, Field

class JoboutlookItem(scrapy.Item):
    main_page_url = Field()
    profile_page_url = Field()
    profile_detail_page_url = Field()
    industry = Field()
    job_profile = Field()
    anzsco_code = Field()
    avg_weekly_pay = Field()
    future_growth = Field()
    skill_level = Field()
    employment_size = Field()
    unemployment = Field()
    male_share = Field()
    female_share = Field()
    full_time = Field()

Lets create a Spider that uses the JoboutlookItem class to store and transfer data.

import scrapy
from joboutlook.items import JoboutlookItem

class JobinfoSpider(scrapy.Spider):
    name = 'jobinfo'
    allowed_domains = ['joboutlook.gov.au']
    start_urls = ['https://joboutlook.gov.au/Industry.aspx']

    def parse(self, response):
        urls = response.css(
            'article > div.career-title > h2 > a::attr(href)').extract()
        item = JoboutlookItem()
        item['main_page_url'] = response.url

        for url in urls:
            url = response.urljoin(url)
            request = scrapy.Request(url=url,
                                     callback=self.parse_industry_profiles)
            request.meta['item'] = item
            yield request

    def parse_industry_profiles(self, response):
        self.log('__fun parse_industry_profiles__')
        item = response.meta['item']
        item['profile_page_url'] = response.url

        profile_details_page_urls = response.css(
            'article > div.career-title > h3 > a::attr(href)').extract()

        for url in profile_details_page_urls:
            url = response.urljoin(url)
            yield scrapy.Request(url=url,
                                 callback=self.parse_profile_details, meta={'item': item})

    def parse_profile_details(self, response):
        item = response.meta['item']
        item['profile_detail_page_url'] = response.url

        fast_facts = response.css(
            'ul.snapshot > li > span.snapshot-data::text')

        item['industry'] = response.css(
            'div.breadcrumb-col > ul > li:nth-of-type(3) > a::text').extract_first().strip(),
        item['anzsco_code'] = response.css(
            'div.anzsco > abbr::text').extract_first().replace('ANZSCO ID ', '').strip()
        item['job_profile'] = response.css(
            'div.page-title-col > h1::text').extract_first().strip()
        item['avg_weekly_pay'] = fast_facts[0].extract().strip()
        item['future_growth'] = fast_facts[1].extract().strip()
        item['skill_level'] = fast_facts[2].extract().strip()
        item['employment_size'] = fast_facts[3].extract().strip()
        item['unemployment'] = fast_facts[4].extract().strip()
        item['male_share'] = fast_facts[5].extract().strip()
        item['female_share'] = fast_facts[6].extract().strip()
        item['full_time'] = fast_facts[7].extract().strip()

        yield item

Execute the project

scrapy crawl jobinfo -o jobinfo.json

webscraping's People

Contributors

alihammadbaig avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.