GithubHelp home page GithubHelp logo

caeltarifa / big_data_web_scraping Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 3.48 MB

Web scraping, machine learning and AWS cloud storage

Home Page: https://github.com/caeltarifa/big_data_web_scraping

License: MIT License

Jupyter Notebook 97.04% Dockerfile 0.46% Python 1.18% Shell 1.33%
data-analysis machine-learning nifi nosql-data-storage python scrapy tableau

big_data_web_scraping's Introduction

Big data by web scraping with scrapy

In this project, you see python modules that have been organized for collecting data as much as a large-scale system and are being presented as charts with cutting-edge technologies at the moment.

Given this Data Science project, it covers many topics about how to handle data and besides its, is a multi-disciplinary subject. Highlighted updates were posted, and every step to be executed is following the DataPipeline proposed.

Not all techniques are being develated here, but the most useful are for ilustrating concepts and tools which Biosoft exploits with.

1 Requirements

Once the libraries needed to extract data are satisfied, below could be built up the project:

Scrapy
selenium
boto3
botocore

requests
urllib3
bs4

pandas
numpy
scipy
matplotlib
scikit-

jupyter-server
jupyter_client
jupyter_core
jupyterlab
jupyterlab-pygments
jupyterlab_server

2 How to install

Dockerfile for running up the container to set the work environment up

docker build -t YOUR_IMAGE_NAME .
docker run -v /dev/shm:/dev/shm --shm-size=2gb -d -p 80:8080 

What about containers' high performance and its setting to collect huge datasets?

3 Toolkit

A diagram that shows the development enviroment with a toolkit as proposal.

Scrapy vscode googlecolab jupyter copilot drawio

4 DataPipeline for scraping

At this picture is ilustrated the process troughout how the files will be collected and storing for each provided URL.

2 DataPipeline

4.1 Technology Stack

Communicating components through processes into data flow design from scanning the website, mining data to collect and ingest, processing up to the storing and plot are declared within this technology stack in order to be reliable and workable.

TechnologyStack

5 Arquitecture of component design

Here are presented three componentes throghourt data collect software cycle with Scrapy. Given the URL target, this is followed to find common and media files to store in AWS services such as DynamoDB for structured data (key-value) and S3 Bucket for pure documents. Also, it is shown two helping components for specific purpose. Once the data is retrieved they are ploted on tableu; wheremore, Selenium componen contain tools for clicking on dynamic JS events to download valid links of files.

3 Arquitecture app drawio

5 Storing

A. Bitbucket Amazon S3

In the picture below is shown how the files are filled at distributed cloud storage by Amazon's bitbuckets. By web scraping over differents Chile's web sites this data storing is pure document database which each one has been retrieved of a variety of formats either PDF, CSV, XLS, Stata, and more.

bitbucket s3 aws

References

BeautifulSoup: Interfaces for reliable connections to url as target.

Scrapy

Selenium

Xpath

Regular expressions (RegEx)

Storing data in the cloud (AWS)

Tableau - Data Visualization

DynamoDB and its purposes (AWS)

Big Data and Data flow design: Apache NiFi

NoSQL Databases and Graph queries

big_data_web_scraping's People

Contributors

caeltarifa avatar

Watchers

 avatar  avatar

Forkers

jayamaleh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.