GithubHelp home page GithubHelp logo

public-law / open-gov-crawlers Goto Github PK

View Code? Open in Web Editor NEW
62.0 4.0 4.0 18.61 MB

Parse government documents into well formed JSON

Python 99.58% Shell 0.10% Crystal 0.32%
scrapy opengov crawler a2j opendata rome-statute

open-gov-crawlers's Introduction

Test Suite Maintainability

Open-gov spiders written in Python

Source code Dataset
Australia Family, domestic and sexual violence... parser | spider | tests json
Australia IP Glossary parser | spider | tests json
Canada Dept. of Justice Legal Glossaries parser | spider | tests json
Canada Glossary of Parliamentary Terms for... parser | spider | tests json
Intergovernmental Rome Statute parser | spider | tests json
Ireland Glossary of Legal Terms parser | spider | tests json
New Zealand Glossary parser | spider | tests json
USA US Courts Glossary parser | spider | tests json
USA USCIS Glossary parser | spider | tests json
USA / Georgia Attorney General Opinions parser | spider | tests
USA / Oregon Oregon Administrative Rules parser | spider | tests

The Ireland glossary parser is the best example of our coding style. See the wiki for a technical explanation of our parsing strategy.

Example: Oregon Administrative Rules Parser

The spiders retrieve HTML pages and output well formed JSON. It represents the source's structure. First, we can see which spiders are available:

$ scrapy list

aus_ip_glossary
can_doj_glossaries
int_rome_statute
...

Then we can run one of the spiders:

$ scrapy crawl --overwrite-output tmp/output.json usa_or_regs

This produces:

{
  "date_accessed": "2019-03-21",
  "chapters": [
    {
      "kind": "Chapter",
      "db_id": "36",
      "number": "101",
      "name": "Oregon Health Authority, Public Employees' Benefit Board",
      "url": "https://secure.sos.state.or.us/oard/displayChapterRules.action?selectedChapter=36",
      "divisions": [
        {
          "kind": "Division",
          "db_id": "1",
          "number": "1",
          "name": "Procedural Rules",
          "url": "https://secure.sos.state.or.us/oard/displayDivisionRules.action?selectedDivision=1",
          "rules": [
            {
              "kind": "Rule",
              "number": "101-001-0000",
              "name": "Notice of Proposed Rule Changes",
              "url": "https://secure.sos.state.or.us/oard/view.action?ruleNumber=101-001-0000",
              "authority": [
                "ORS 243.061 - 243.302"
              ],
              "implements": [
                "ORS 183.310 - 183.550",
                "192.660",
                "243.061 - 243.302",
                "292.05"
              ],
              "history": "PEBB 2-2009, f. 7-29-09, cert. ef. 8-1-09<br>PEBB 1-2009(Temp), f. &amp; cert. ef. 2-24-09 thru 8-22-09<br>PEBB 1-2004, f. &amp; cert. ef. 7-2-04<br>PEBB 1-1999, f. 12-8-99, cert. ef. 1-1-00",
              }
            ]
          }
        ]
      }
    ]
  }

(etc.)

The Wiki explains the JSON strategy.

Development Environment Notes

Python 3.10

I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.

Poetry for dependency management

So before I start working, I go into the virtual environment:

poetry shell

Making sure I have the current deps installed is always good to do:

poetry install

Pytest for testing

The pytest tests run easily:

pytest

Other tools

  • Java is required by the Python Tika package.
  • Pylance/Pyright for type-checking
  • Black for formatting

Dependencies; helpful links

It has a small glitch, though: it usually runs all the tests twice when I save in VS Code.

open-gov-crawlers's People

Contributors

dependabot[bot] avatar dogweather avatar lobst4r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

open-gov-crawlers's Issues

Data publishing

Try different methods of publishing the scraping results (the JSON data). E.g.:

  • Kaggle.
  • A GitHub repo, dedicated to this purpose.
  • Zyte's thing, whatever it's called.

After trying each of the above, I've chosen the second option: a dedicated repo for publishing data

https://github.com/public-law/datasets

  • Choose a license
  • Write the readme with a link to the Rome Statute English.
  • Add a Dublin Core Metadata object to the Rome Statute dataset: #93

Finish the Rome Statute — English

I've created coding-challenge style tests for this Issue: all pending, and they can be used as a to-do list, enabled one by one.

To do this Issue, the articles() function should be completed:

def articles(html: str) -> list[Article]:
    """Given the html document, return a list of Articles."""

    # TODO: finish this function, making the tests pass.
    return []

Here's the Article class from rome_statute.py:

class Article(NamedTuple):
    """An 'Article' in the Rome Statute; an actual readable
    section of the statute. An Article belongs to one Part."""

    name: str
    number: str  # Is string because of numbers like "8 bis".
    text: str
    part_number: int

Rome Statute — Russian

  • Find the current 2021 text, in HTML if possible, otherwise PDF.
  • Write the tests.
  • Write the parser.

Parse the OAR's

  • Can all the rule numbers be enumerated?

The Rules Search page loads with the list of chapters and names:

https://secure.sos.state.or.us/oard/ruleSearch.action

A chapter page can be retrieved via its internal id:

https://secure.sos.state.or.us/oard/displayChapterRules.action?selectedChapter=36

This returned chapter page has Divisions as well as all the rule numbers and names. Division URLs also use the internal id:

https://secure.sos.state.or.us/oard/displayDivisionRules.action?selectedDivision=4231

An individual rule can then be pulled up like this, using the public canonical identifier:

https://secure.sos.state.or.us/oard/view.action?ruleNumber=101-002-0010

Create a data-driven test case framework

The Idea: enable non-programmers to write test cases.

Context: The English version has its tests in rome_statute_test.py. Each language would need its own test file. Eventually these could be refactored together. But regardless, a programmer would need to do this work.

But we can allow people with good language skills to contribute if we completely remove the expected output from the test file and store it separately. Maybe in a spreadsheet or YAML file which the Python test file would read. This separate "database" would be 2-dimensional. One dimension are the tests such as "document title". The other dimension is the human languages such as English, Russian, etc.

See also:

Add exception for some glossary names

  • https://laws-lois.justice.gc.ca/eng/glossary/ -> Glossary of technical terms
  • https://www.justice.govt.nz/about/glossary/ -> Ministry of Justice glossary

Install msgpack-python

[HubstorageClient] Messagepack is not available, please ensure that msgpack-python library is properly installed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.