GithubHelp home page GithubHelp logo

court-data-pipeline's Introduction

Court Data Pipeline

Purpose

The Pew Charitable Trusts' civil legal modernization project seeks to make civil court systems more open, efficient, and equitable by promoting policies, processes, and technologies that can improve outcomes for civil litigants. As part of this effort, a data standard and storage format are being developed in partnership with court administrators and information technology specialists to make courthouse information more accessible to the public via internet search engines.

Overview

The tool scrapes JSON-LD data from court websites, validates it against a SHACL schema, and stores the linked data as triples in a database. Web scraping is done using Scrapy, validation with pyshacl, and storage with oxrdflib. The standard vocabulary is provided by Schema.org and is supplemented with extensions developed for this project.

This version of the tool will ingest URLs from a CSV and scrape JSON-LD data (if present) from the provided websites. Any JSON-LD data found is validated against a SHACL file before being stored in an RDF triplestore.

This version of the tool will ingest URLs from a CSV and scrape the websites they identify for JSON-LD data. If found, the data will be validated before being stored in an RDF triplestore. Imported data can be retrieved in JSON format by running db_exporter.py located in the scripts folder.

Requirements

  • Python >= 3.10
  • pipenv >= 2022.10.11

Setting Up the Environment

  1. Create a virtual environment with pipenv and install dependencies by executing the following commands in a terminal or command prompt window.
pipenv sync
pipenv shell
  1. Create a .env file in the root directory and define a variable that indicates the location in which the database will be stored. The file should contain the following:
DB_LOC = "absolute/path/to/database"

Running the Script

  1. Start a local server. It will serve the contents of the current directory on Port 8000. The contents of the definition files (under data/defs) are accessed via HTTP so validation will fail if they cannot be reached.

Note: these resources are now available online and should obviate the need to host them locally. The current version still requires local hosting and so this step is still necessary. It will be removed in a later version.

python -m http.server

  1. Webpages to be scraped are provided to the script by passing a CSV file as an argument when executing the .py file. The location of the CSV file does not matter as long as the path in the argument is valid. Execute the following command to run the script with sample data.

python app.py data/sites/websites.csv

  1. The pipeline terminates after storing the data ingested by the web scraper in an RDF data store. Queries against the database must be done using SPARQL. Examples of simple queries are included in the queries directory and can be run by executing the run_query.py script. By default, the script will return all records. In run_query.py, change the path in the open statement on Line 26 to queries/basic.sparql or queries/more_advanced.sparql to return other examples of data stored by the pipeline. By default, results will be saved in queries/results/. A custom destination can be specified using the --loc argument (e.g., python run_query.py --loc your/path/here/).

court-data-pipeline's People

Contributors

bguayante avatar jdziurlaj avatar jungshadow avatar

Watchers

 avatar James Cloos avatar  avatar Amanda Brown avatar

court-data-pipeline's Issues

Wishlist for Pipeline

A few things I'd like to see:

  • Build application on argparse
  • Split the application code into logical sections/modules (e.g., db, scraper, exporter)
  • #3
  • #4

You can create individual Issues on any of these for discussion as you see fit.

Task Tracking

  • Update ontology URI in validator and SHACL file when provided path by Margaret/Stanford
  • Update validator to fetch SHACL file each run to ensure parity with updates to ontology (need above URI)
  • Convert SPARQL query output to CSV
  • Add License file
  • Integrate pipeline modules into argparse
  • Merge dev and main, delete dev

Make SHACL file accessible remotely

As the definitions and SHACL file have to be kept in parity, and because the definitions will be hosted remotely, I need to add logic to download the SHACL file from the same remote source, save it locally, then use it with pyshacl in validator.py

Naming Convention for JSON-LD Files

Currently, the JSON-LD files that are scraped are saved locally using the url of the site on which they originated. As an example, a JSON hosted on https://www.courts.ca.gov/los-angeles-county.html would be saved by the scraper as https://www.courts.ca.gov/los-angeles-county.json. This file name is used during the validation process. Validated files are renamed using string manipulation to extract the just the jurisdiction name (los-angeles-county.json). This file is then used to import new data into the DB.

For the moment, this is fine, but I am using dev data with a standardized naming scheme. It's unlikely that urls encountered in the wild will be so easily parsed given the lack of standardization among court sites. So at the moment, I can think of two solutions:

  1. Continue to use the URL as the filename of the JSON that is scraped but do not rename it after validation. It will be passed to the DB as {url}.json.
  2. As the URLs to scrape are passed to the script as a CSV provided as an argument when the script is executed, require an additional column that provides some identifier for the courthouse or its jurisdiction and use that as the filename throughout the script.

I think (1) is probably the way to go. There was a benefit to using simpler names earlier in development that is lost now that everything is automated. While more information is always good, (2) puts additional burden on the administrators running the script and I think a goal is to make this process as easy as possible.

Regardless of approach, there is one other issue: URLS make bad filenames due to their use of the / symbol. They get parsed as directories by the file system and throw errors when they are accessed by the script. Is there a standard replacement character or can we choose one? I'm currently replacing / with . but that's also a symbol used by the file system and might have unintended consequences.

@jungshadow and @JDziurlaj, I'd appreciate your input.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.