open-gov-crawlers's Introduction

Open-gov spiders written in Python

		Source code	Dataset
Australia	Family, domestic and sexual violence...	`parser` \| `spider` \| `tests`	`json`
Australia	IP Glossary	`parser` \| `spider` \| `tests`	`json`
Canada	Dept. of Justice Legal Glossaries	`parser` \| `spider` \| `tests`	`json`
Canada	Glossary of Parliamentary Terms for...	`parser` \| `spider` \| `tests`	`json`
Intergovernmental	Rome Statute	`parser` \| `spider` \| `tests`	`json`
Ireland	Glossary of Legal Terms	`parser` \| `spider` \| `tests`	`json`
New Zealand	Glossary	`parser` \| `spider` \| `tests`	`json`
USA	US Courts Glossary	`parser` \| `spider` \| `tests`	`json`
USA	USCIS Glossary	`parser` \| `spider` \| `tests`	`json`
USA / Georgia	Attorney General Opinions	`parser` \| `spider` \| `tests`
USA / Oregon	Oregon Administrative Rules	`parser` \| `spider` \| `tests`

The Ireland glossary parser is the best example of our coding style. See the wiki for a technical explanation of our parsing strategy.

Example: Oregon Administrative Rules Parser

The spiders retrieve HTML pages and output well formed JSON. It represents the source's structure. First, we can see which spiders are available:

$ scrapy list

aus_ip_glossary
can_doj_glossaries
int_rome_statute
...

Then we can run one of the spiders:

$ scrapy crawl --overwrite-output tmp/output.json usa_or_regs

This produces:

{
  "date_accessed": "2019-03-21",
  "chapters": [
    {
      "kind": "Chapter",
      "db_id": "36",
      "number": "101",
      "name": "Oregon Health Authority, Public Employees' Benefit Board",
      "url": "https://secure.sos.state.or.us/oard/displayChapterRules.action?selectedChapter=36",
      "divisions": [
        {
          "kind": "Division",
          "db_id": "1",
          "number": "1",
          "name": "Procedural Rules",
          "url": "https://secure.sos.state.or.us/oard/displayDivisionRules.action?selectedDivision=1",
          "rules": [
            {
              "kind": "Rule",
              "number": "101-001-0000",
              "name": "Notice of Proposed Rule Changes",
              "url": "https://secure.sos.state.or.us/oard/view.action?ruleNumber=101-001-0000",
              "authority": [
                "ORS 243.061 - 243.302"
              ],
              "implements": [
                "ORS 183.310 - 183.550",
                "192.660",
                "243.061 - 243.302",
                "292.05"
              ],
              "history": "PEBB 2-2009, f. 7-29-09, cert. ef. 8-1-09<br>PEBB 1-2009(Temp), f. &amp; cert. ef. 2-24-09 thru 8-22-09<br>PEBB 1-2004, f. &amp; cert. ef. 7-2-04<br>PEBB 1-1999, f. 12-8-99, cert. ef. 1-1-00",
              }
            ]
          }
        ]
      }
    ]
  }

(etc.)

The Wiki explains the JSON strategy.

Development Environment Notes

Python 3.10

I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.

Poetry for dependency management

So before I start working, I go into the virtual environment:

poetry shell

Making sure I have the current deps installed is always good to do:

poetry install

Pytest for testing

The pytest tests run easily:

pytest

Other tools

Java is required by the Python Tika package.
Pylance/Pyright for type-checking
Black for formatting

Dependencies; helpful links

It has a small glitch, though: it usually runs all the tests twice when I save in VS Code.

open-gov-crawlers's People

Contributors

Stargazers

Watchers

class Article(NamedTuple): """An 'Article' in the Rome Statute; an actual readable section of the statute. An Article belongs to one Part.""" name: str number: str # Is string because of numbers like "8 bis". text: str part_number: int