GithubHelp home page GithubHelp logo

webscraper-hhs's Introduction

Web Scraping Example - ASP Form

This is an example of scraping a website that contains a form. In this case, it is an employee directory for Health and Human Services (https://directory.psc.gov/employee.htm).

image

Approach

Since we need to interact with a form to retrieve the desired data, we can't crawl this like a typical website using something like the Elastic web crawler. So what we'll do is tackle this with some Python scripting to plug in different form values and then a library called BeautifulSoup which can parse the HTML page results.

Initial Algorithm

You'll notice that this search has a limit of 500 results per page, so in the first iteration of the algorithm I looped through all the last names from A-Z for each different Agency.

def get_employee_by_agency(agency):
    person_list = []

    for letter in alphabet:
        print(letter)
        current_count = len(person_list)
        form_data = {'LastNameOp': 'begins with', 'LastName': letter, 'AgencyOp':'equal to', 'Agency': agency, 'maxRows': 500}
        person_list = scrape_search_result(POST_URL, form_data, Person, person_list, agency)
        updated_count = len(person_list)
        delta = updated_count - current_count
        print('Person count to add: ' + str(delta))

    write_to_elasticsearch(person_list)

    return person_list

From the search results returned, the scraper makes a separate request to retrieve details for the employee.

image

image

Turns out that some of these agencies have more than 500 results returned when using the Last Name and Agency as a critiera. So an update was needed for this approach to make sure we didn't miss out on records.

Updated Algorithm

To get around that, I added a check for when the max result limit is read for a particular Last Name, it would then iterate the search criteria across the First Name field from A-Z. Fortunately, that was enough to ensure that the results stayed below 500 so we could get all the records.

max_limit_reached = soup.find_all(text="maximum")
            max_bool = len(max_limit_reached)
            print("Max limit reached: " + str(max_bool))

            if max_bool > 0:
                # Iterate over first names
                for letter in alphabet:
                    form_data['FirstNameOp']='begins with'
                    form_data['FirstName']=letter
                    print('First name iteration: ' + letter)
                    scrape_search_result(POST_URL, form_data, Person, person_list, agency)

How to Run

1 - Create an Elastic deployment. I used Elastic Cloud to get going quickly (https://www.elastic.co/cloud/). 2 - Get the Elastic Cloud ID 3 - Create an API Key 4 - Create a Python virtual environment

$ python3 -m venv env

5 - Install the Python libraries in requirements.txt within the virtual environment

$ pip install -r requirements.txt

6 - Set environment variables in terminal so the script can get the Cloud ID and API Key

export CLOUD_ID=<CLOUD_ID>
export CLOUD_API_KEY=<CLOUD_API_KEY>

Note: This is currently only being run on an Agency at a time as each Agency takes quite a bit of time. Future improvement would be to set this up for parallel processing.

Suggestions/Improvements

Feel free to make suggestions or improvements, always love to learn better ways and surely there is room for optimization.

webscraper-hhs's People

Contributors

izmaxxsun avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.