GithubHelp home page GithubHelp logo

stl-public-meetings / city-scrapers-stl Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 18.0 405 KB

Scrape, standardize and share public meetings from local government websites in St. Louis

License: MIT License

Python 99.89% Shell 0.11%
city-scrapers open-data python scrapy web-scraping

city-scrapers-stl's People

Contributors

bchao99 avatar ledaliang avatar pjsier avatar vishalvishw10 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

city-scrapers-stl's Issues

Archive scraped pages and documents on the Wayback Machine

Great work getting this up and running so quickly! One of the aspects of the City Scrapers project we haven't documented very well is our use of a Python package we created, scrapy-wayback-middleware.

The overall goal of the City Scrapers project is to improve transparency and create an archive not just of upcoming meetings, but past meetings and related documents as well as how they change over time. An important part of that for us has been archiving (almost) every page and document we scrape on the Internet Archive's Wayback Machine as well as in our static output.

Having a second, more public and accessible location makes the meeting information more available regardless of how long the project goes. We've even used it to track potential violations of open meetings laws, since it provides an external source for seeing what content was or was not on a website at a given time. Here's an example of snapshots of the Chicago Plan Commission's website over time.

The con of this approach is that it can make cron builds take significantly longer, but currently we're well under the 6 hour GitHub Actions time limit with over 100 scrapers on the main City Scrapers repo.

If you're interested, you can add scrapy-wayback-middleware as a dependency, and then you'll likely want to subclass the middleware to also scrape any documents you find like we've done in our main middleware.py. Then you can add it in your settings/prod.py like we did in our settings.

We're only activating it when the WAYBACK_ENABLED environment variable is set, and the template cron.yml file already sets this so once it's added in your settings file you should be good to go!

Let me know if you have any questions, and I'm happy to put in a PR for this if it's helpful

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.