GithubHelp home page GithubHelp logo

andyras / nursing_home_data_collection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from open-retirement/nursing_home_data_collection

0.0 2.0 0.0 67 KB

data collection scripts for the nursing_home_app

Python 67.77% Shell 32.23%

nursing_home_data_collection's Introduction

Long Term Care Cost Report Data Download

This is a hodge-podge of bash and python scripting to

  1. Download a bunch of pdf files from the Illinois HFS website
  2. Convert the downloaded pdfs to xml format
  3. Parse the xml files for information we have determined we will use for modeling purposes.

Setup

The bash scripts will require the command line tool pdftohtml (part of the poppler-utils package). Install via your system's package manager (e.g. on Debian)

sudo apt-get install poppler-utils

The python scripts will require an environment with a few non-standard python libraries available. I recommend a conda or pip virtual environment -- they can be created using environments.yaml (conda) or requirements.txt (pip). Choose one of the following:

# conda
conda env create -f environment.yml

# pip
pip install -r requirements.txt

Usage

The help text for the long_term_cost_care.sh script should be sufficient, but just to be clear:

bash long_term_cost_care.sh [SUBCMD]
  1. get_pdfs will download (with wget) all of the pdfs we will parse. Under the hood, this will:
  2. call the python long_term_cost_care.py script to create a wget_pdfs.sh shell script, and
  3. execute the wget_pdfs.sh script (actually does the downloading).
  4. pdfs_to_xml will convert all pdfs in the download directory (data/long_term_cost_care) into xml
  5. parse_xml will read all the xml files in data/long_term_cost_care looking for info and writing the output to './parsed_pdf_info.csv`

nursing_home_data_collection's People

Contributors

rzachlamberty avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.