GithubHelp home page GithubHelp logo

thesagarsehgal / swatchbharaturbancrawler Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 1.27 MB

This is a Crawler built in Scrapy to crawl over the https://sbmurban.org/ website. This is the repository that crawls ASP.NET websites using Scrapy using the __VIEWSTATE.

Python 100.00%
crawler asp-net scrapy-crawler asp extracts crawled-data scrap scraper dataset dataset-generation

swatchbharaturbancrawler's Introduction

Swatch Bharat Urban Crawler

This is a crawler that crawls the complete website https://sbmurban.org/rrr-centers and extracts the complete information.

About

  • This crawler was build as a task for ATLAN.
  • The complete data from the website was crawled and stored in a single file
  • This was a new task and learned how to scrap the ASP.NET websites which use __VIEWSTATE (Used the https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition as a tutorial on how to Crawl ASP.NET websites).
  • Also at the end of complete data scrapping, a POST request would be made to the URL specified.
  • Also, the setup.py file has been added.
  • The extracted file contains the following columns:-
    • State
    • District
    • ULB Name
    • Ward
    • No. of Applications Received
    • No. of Applications Not Verified
    • No. of Applications Verified
    • No. of Applications Approved
    • No. of Applications Approved having Aadhar No.
    • No. of Applications Rejected
    • No. of Applications Pullback
    • No. of Applications Closed
    • No. of Constructed Toilet Photo
    • No. of Commenced Toilet Photo
    • No. of Constructed Toilet Photo through Swachhalaya

Doubts/Assumptions

  1. DOUBT=> How can we make a POST request every 5 minutes, since the data crawling itself takes a lot more of time.

ASSUMPTION=> For making the post request every 5 minutes, we can put the project in the ScrapingHub, and schedule it to crawl every 5 minutes. The crawler has been made such that it would make a POST request on completing the crawling, and the data would automatically be posted.

  1. DOUBT=> How many output files are required? Like 1 file containing all the information. Or the 4 Files containing information for 4 different levels like State, District, ULB and Ward Level.

ASSUMTIONS=> I have made 1 CSV file only whose table was shown in the task containing all the information. Since all other information can be easily extracted from that file.

How to Setup

  1. Clone the repository
git clone https://github.com/sagar-sehgal/SwatchBharaturban_Crawler
  1. Make a Virtual Environment
virtualenv venv --python=python3
  1. Activate the virtualenv
source venv/bin/activate
  1. Change the Repository
cd SwatchBharaturban_Crawler
  1. Install the dependencies
pip install -r requirements.txt
  1. Run the Crawler
scrapy crawl swatchbharaturban_crawler

The crawled data would be stored in the swatchbharaturban_crawler/data/swatchbharat_data.csv file.

swatchbharaturbancrawler's People

Contributors

thesagarsehgal avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.