Court Data Pipeline

Purpose

The Pew Charitable Trusts' civil legal modernization project seeks to make civil court systems more open, efficient, and equitable by promoting policies, processes, and technologies that can improve outcomes for civil litigants. As part of this effort, a data standard and storage format are being developed in partnership with court administrators and information technology specialists to make courthouse information more accessible to the public via internet search engines.

Overview

The tool scrapes JSON-LD data from court websites, validates it against a SHACL schema, and stores the linked data as triples in a database. Web scraping is done using Scrapy, validation with pyshacl, and storage with oxrdflib. The standard vocabulary is provided by Schema.org and is supplemented with extensions developed for this project.

This version of the tool will ingest URLs from a CSV and scrape JSON-LD data (if present) from the provided websites. Any JSON-LD data found is validated against a SHACL file before being stored in an RDF triplestore.

This version of the tool will ingest URLs from a CSV and scrape the websites they identify for JSON-LD data. If found, the data will be validated before being stored in an RDF triplestore. Imported data can be retrieved in JSON format by running db_exporter.py located in the scripts folder.

Requirements

Python >= 3.10
pipenv >= 2022.10.11

Setting Up the Environment

Create a virtual environment with pipenv and install dependencies by executing the following commands in a terminal or command prompt window.

pipenv sync
pipenv shell

Create a .env file in the root directory and define a variable that indicates the location in which the database will be stored. The file should contain the following:

DB_LOC = "absolute/path/to/database"

Running the Script

Start a local server. It will serve the contents of the current directory on Port 8000. The contents of the definition files (under data/defs) are accessed via HTTP so validation will fail if they cannot be reached.

Note: these resources are now available online and should obviate the need to host them locally. The current version still requires local hosting and so this step is still necessary. It will be removed in a later version.

python -m http.server

Webpages to be scraped are provided to the script by passing a CSV file as an argument when executing the .py file. The location of the CSV file does not matter as long as the path in the argument is valid. Execute the following command to run the script with sample data.

python app.py data/sites/websites.csv

The pipeline terminates after storing the data ingested by the web scraper in an RDF data store. Queries against the database must be done using SPARQL. Examples of simple queries are included in the queries directory and can be run by executing the run_query.py script. By default, the script will return all records. In run_query.py, change the path in the open statement on Line 26 to queries/basic.sparql or queries/more_advanced.sparql to return other examples of data stored by the pipeline. By default, results will be saved in queries/results/. A custom destination can be specified using the --loc argument (e.g., python run_query.py --loc your/path/here/).

Naming Convention for JSON-LD Files

Currently, the JSON-LD files that are scraped are saved locally using the url of the site on which they originated. As an example, a JSON hosted on https://www.courts.ca.gov/los-angeles-county.html would be saved by the scraper as https://www.courts.ca.gov/los-angeles-county.json. This file name is used during the validation process. Validated files are renamed using string manipulation to extract the just the jurisdiction name (los-angeles-county.json). This file is then used to import new data into the DB.

For the moment, this is fine, but I am using dev data with a standardized naming scheme. It's unlikely that urls encountered in the wild will be so easily parsed given the lack of standardization among court sites. So at the moment, I can think of two solutions:

Continue to use the URL as the filename of the JSON that is scraped but do not rename it after validation. It will be passed to the DB as {url}.json.
As the URLs to scrape are passed to the script as a CSV provided as an argument when the script is executed, require an additional column that provides some identifier for the courthouse or its jurisdiction and use that as the filename throughout the script.

I think (1) is probably the way to go. There was a benefit to using simpler names earlier in development that is lost now that everything is automated. While more information is always good, (2) puts additional burden on the administrators running the script and I think a goal is to make this process as easy as possible.

Regardless of approach, there is one other issue: URLS make bad filenames due to their use of the / symbol. They get parsed as directories by the file system and throw errors when they are accessed by the script. Is there a standard replacement character or can we choose one? I'm currently replacing / with . but that's also a symbol used by the file system and might have unintended consequences.

@jungshadow and @JDziurlaj, I'd appreciate your input.

theturnout / court-data-pipeline Goto Github PK