nursing_home_data_collection's Introduction

Long Term Care Cost Report Data Download

This is a hodge-podge of bash and python scripting to

Download a bunch of pdf files from the Illinois HFS website
Convert the downloaded pdfs to xml format
Parse the xml files for information we have determined we will use for modeling purposes.

Setup

The bash scripts will require the command line tool pdftohtml (part of the poppler-utils package). Install via your system's package manager (e.g. on Debian)

sudo apt-get install poppler-utils

The python scripts will require an environment with a few non-standard python libraries available. I recommend a conda or pip virtual environment -- they can be created using environments.yaml (conda) or requirements.txt (pip). Choose one of the following:

# conda
conda env create -f environment.yml

# pip
pip install -r requirements.txt

Usage

The help text for the long_term_cost_care.sh script should be sufficient, but just to be clear:

bash long_term_cost_care.sh [SUBCMD]

get_pdfs will download (with wget) all of the pdfs we will parse. Under the hood, this will:
call the python long_term_cost_care.py script to create a wget_pdfs.sh shell script, and
execute the wget_pdfs.sh script (actually does the downloading).
pdfs_to_xml will convert all pdfs in the download directory (data/long_term_cost_care) into xml
parse_xml will read all the xml files in data/long_term_cost_care looking for info and writing the output to './parsed_pdf_info.csv`

Recommend Projects