Big data by web scraping with scrapy

In this project, you see python modules that have been organized for collecting data as much as a large-scale system and are being presented as charts with cutting-edge technologies at the moment.

Given this Data Science project, it covers many topics about how to handle data and besides its, is a multi-disciplinary subject. Highlighted updates were posted, and every step to be executed is following the DataPipeline proposed.

Not all techniques are being develated here, but the most useful are for ilustrating concepts and tools which Biosoft exploits with.

1 Requirements

Once the libraries needed to extract data are satisfied, below could be built up the project:

Scrapy
selenium
boto3
botocore

requests
urllib3
bs4

pandas
numpy
scipy
matplotlib
scikit-

jupyter-server
jupyter_client
jupyter_core
jupyterlab
jupyterlab-pygments
jupyterlab_server

2 How to install

Dockerfile for running up the container to set the work environment up

docker build -t YOUR_IMAGE_NAME .
docker run -v /dev/shm:/dev/shm --shm-size=2gb -d -p 80:8080

What about containers' high performance and its setting to collect huge datasets?

3 Toolkit

A diagram that shows the development enviroment with a toolkit as proposal.

4 DataPipeline for scraping

At this picture is ilustrated the process troughout how the files will be collected and storing for each provided URL.

4.1 Technology Stack

Communicating components through processes into data flow design from scanning the website, mining data to collect and ingest, processing up to the storing and plot are declared within this technology stack in order to be reliable and workable.

5 Arquitecture of component design

Here are presented three componentes throghourt data collect software cycle with Scrapy. Given the URL target, this is followed to find common and media files to store in AWS services such as DynamoDB for structured data (key-value) and S3 Bucket for pure documents. Also, it is shown two helping components for specific purpose. Once the data is retrieved they are ploted on tableu; wheremore, Selenium componen contain tools for clicking on dynamic JS events to download valid links of files.

5 Storing

A. Bitbucket Amazon S3

In the picture below is shown how the files are filled at distributed cloud storage by Amazon's bitbuckets. By web scraping over differents Chile's web sites this data storing is pure document database which each one has been retrieved of a variety of formats either PDF, CSV, XLS, Stata, and more.

References

BeautifulSoup: Interfaces for reliable connections to url as target.

Scrapy

Selenium

Xpath

Regular expressions (RegEx)

Storing data in the cloud (AWS)

Tableau - Data Visualization

DynamoDB and its purposes (AWS)

Big Data and Data flow design: Apache NiFi

NoSQL Databases and Graph queries

Extracting Data from NoSQL Databases - Master of Science Thesis

caeltarifa / big_data_web_scraping Goto Github PK

big_data_web_scraping's Introduction

Big data by web scraping with scrapy

1 Requirements

2 How to install

3 Toolkit

4 DataPipeline for scraping

4.1 Technology Stack

5 Arquitecture of component design

5 Storing

A. Bitbucket Amazon S3

References

big_data_web_scraping's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs