This is a scraper to collect and process public case records from the Tyler Technologies Odyssey County Records system. The intention is to gain insight and advocate for defendant's rights. A google search for "Copyright * Tyler Technologies" "Court Calendar" will show some other possible sites to scrape. You should just need to input parameters for the main page and a list of JOs and scrape any Odyssey page.
Tested with:
- http://public.co.hays.tx.us/ (~4k cases from 2 months from 9 JOs)
- https://txhoododyprod.tylerhost.net/PublicAccess/ (110 cases from 1 JO)
- https://judicial.smith-county.com/PublicAccess/ (125 cases from 1 JO)
- Clone this repo.
git clone https://github.com/derac/Odyssey-Court-Records-to-JSON.git
- Navigate to it. (use venv if desired)
cd Odyssey-Court-Records-to-JSON
- Install libraries.
pip install -r requirements.txt
Use --help for command line parameter information.
- Scrape calendar and case data by JO and day.
python ./src/scraper.py
- ./data/case_html/odyssey id.html
- Parse the case data into JSON files.
python ./src/parser.py
- ./data/case_json/odyssey id.json
- Print some stats from the JSON.
python ./src/print_stats.py
- The session must visit the main page in order to access the calendar search page. To visit a case page, you must have visited a results page containing it.
- hidden values are grabbed from the calendar page, NodeID and NodeDesc are grabbed from the main page location field.
The command python3.8 src/combine_parsed.py
will run a script to combine html files into a .json in an s3 bucket.
Currently this is running daily on a shell script, on only 1000 files as an example to see schema for Athena.
- Some Odyssey sites have a CAPTCHA on the calendar page. This can't beat that yet. Could implement 2Captcha or get input from user potentially.
- The only bit that seems to break between sites is parsing party information. Need to recode this to work in a different way.