This is a hodge-podge of bash
and python
scripting to
- Download a bunch of pdf files from the Illinois HFS website
- Convert the downloaded pdfs to xml format
- Parse the xml files for information we have determined we will use for modeling purposes.
The bash scripts will require the command line tool pdftohtml
(part of the poppler-utils package). Install via your system's package manager (e.g. on Debian)
sudo apt-get install poppler-utils
The python scripts will require an environment with a few non-standard python libraries available. I recommend a conda or pip virtual environment -- they can be created using environments.yaml
(conda) or requirements.txt
(pip
). Choose one of the following:
# conda
conda env create -f environment.yml
# pip
pip install -r requirements.txt
The help text for the long_term_cost_care.sh
script should be sufficient, but just to be clear:
bash long_term_cost_care.sh [SUBCMD]
get_pdfs
will download (with wget) all of the pdfs we will parse. Under the hood, this will:- call the python
long_term_cost_care.py
script to create awget_pdfs.sh
shell script, and - execute the
wget_pdfs.sh
script (actually does the downloading). pdfs_to_xml
will convert all pdfs in the download directory (data/long_term_cost_care
) into xmlparse_xml
will read all the xml files indata/long_term_cost_care
looking for info and writing the output to './parsed_pdf_info.csv`