GithubHelp home page GithubHelp logo

fborowiec / pro-industry_scraper Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 32 KB

ProIndustry scraper - a project I made to show how I would approach web scraping.

Home Page: https://github.com/FBorowiec/pro-industry_scraper

Python 100.00%
beautifulsoup beautifulsoup4 pydantic python python3 selenium selenium-webdriver tenacity webscraping

pro-industry_scraper's Introduction

ProIndustry scraper

A project I made to show how I would approach web scraping. Here the biggest issue was the fact that the developers behind the pro-industry website are rendering a table of 15 job offers in JavaScript instead of a plain html. Well I guess a world without JavaScript would make a lot of RAM producers unhappy.

Description

The vacancy list is dynamically created using JavaScript which doesn't allow for fast scraping as we have to wait for JavaScript to render the table. Moreover, it is impossible to directly download the html data and scrape it using BeautifulSoup as the results wouldn't contain the table. Therefore, I used Selenium with a Firefox webdriver to allow the website to dynamically generate the JavaScript elements and then feed them to BS4.

The output results is a CSV file containing all the open positions.

Notes:

  • User parameters can be changed inside config/parameters.yaml
  • The attached results.csv contains the results of the first 10 pages
  • The pyproject.toml with mypy, pylint and more configurations is not included
  • The application can easily be Docker-ized
  • Before generating a DataFrame converted then in a CSV, the output is a list of pydantic models. It would be easy to feed it to an appropriately configured PostgreSQL database
  • Each new page needs to be properly rendered, which significantly slows down the scraping of the site

Setup with pipenv

This project uses a virtual environment. Please make sure to enable it by running:

pip install pipenv
pipenv shell
pipenv --python /usr/bin/python3
pipenv install -r requirements.txt

User inputs

The user settings can be changed inside the config/parameters.yaml file:

  • The default logging of each new page is enabled by default.
  • The page_limit is set to 10

How to run

Please make sure you're within the main.py file level.

python3 main.py

This will generate a results.csv CSV file inside your current directory.

pro-industry_scraper's People

Contributors

fborowiec avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.