Parse a regulation (plain text) into a well-formated JSON tree (along with associated layers, such as links and definitions). This works hand-in-hand with regulations-site, a front-end for the data structures generated.
- Split regulation into paragraph-level chunks
- Create a tree which defines the hierarchical relationship between these chunks
- Layer for external citations -- links to Acts, Public Law, etc.
- Layer for graphics -- converting image references into federal register urls
- Layer for internal citations -- links between parts of this regulation
- Layer for interpretations -- connecting regulation text to the interpretations associated with it
- Layer for key terms -- pseudo headers for certain paragraphs
- Layer for meta info -- custom data (some pulled from federal notices)
- Layer for paragraph markers -- specifying where the initial paragraph marker begins and ends for each paragraph
- Layer for section-by-section analysis -- associated analyses (from FR notices) with the text they are analyzing
- Layer for table of contents -- a listing of headers
- Layer for terms -- defined terms, including their scope
- lxml (3.2.0) - Used to parse out information XML from the federal register
- pyparsing (1.5.7) - Used to do generic parsing on the plain text
- inflection (0.1.2) - Helps determine pluralization (for terms layer)
- requests (1.2.3) - Client library for writing output to an API
Download the source code from GitHub (e.g. git clone [URL]
)
Make sure the libxml
libraries are present. On Ubuntu/Debian, install
it via
$ sudo apt-get install libxml2-dev libxslt-dev
$ sudo pip install virtualenvwrapper
$ mkvirtualenv parser
$ cd regulations-parser
$ pip install -r requirements.txt
At the moment, we parse from a plain-text version of the regulation. This requires such a plain text version exist. One of the easiest ways to do that is to find your full regulation from e-CFR. For example, CFPB's regulation E.
Once you have your regulation, copy-paste from "Part" to the "Back to Top" link at the bottom of the regulation. Next, we need to get rid of some of the non-helpful info e-CFR puts in. Delete lines of the form
- ^Link to an amendment .*$
- Back to Top
Also, delete any table of contents which contain the section character.
Save that file as a text file (e.g. reg.txt).
The syntax is
$ python build_from.py regulation.txt title doc_#/version act_title
act_section
So, for the regulation we copy-pasted above, we could run
$ python build_from.py reg.txt 12 `date +"%Y%m%d"` 15 1693
This will generate three folders, regulation
, notice
, and
layer
in the OUTPUT_DIR
(current directory by default).
All of the settings listed in settings.py
can be overridden in a
local_settings.py
file. Current settings include:
OUTPUT_DIR
- a string with the path where the output files should be written. Only useful if the JSON files are to be written to disk.API_BASE
- a string defining the url root of an API (if the output files are to be written to an API instead)META
- a dictionary of extra info which will be included in the "meta" layer. Useful fields include "contact_info" (an html string), "effective" (a dictionary with "url":string, "title":string, "date":date-string), and "last_notice" (a dictionary with "url":string, "title":string, "action":string, "published":date-string, "effective":date-string)SUBPART_STARTS
- a dictionary describing when subparts begin. Seesettings.py
for an example.CFR_TITLE
- array of CFR Title names (used in the meta layer)DEFAULT_IMAGE_URL
- string format used in the graphics layerIMAGE_OVERRIDES
- a dictionary between specific image ids and unique urls for them
Unlike our other layers (at the moment), the Keyterms layer is build using XML from the Federal Register rather than plain text. Right now, this is a particularly manual process which involves manually retrieving each notice's XML, generating a layer, and merging the results with the existing layer. This is not a problem if the regulation is completely re-issued.
In any event, to generate the layer based on a particular XML, first
download that XML (found by on federalregister.gov
by selecting 'DEV', then 'XML' on a notice). Then, modify the
build_tree.py
file to point to the correct XML. Running this script
will convert the XML into a JSON tree, maintaining some tags that the plain
text version does not.
Save this JSON to /tmp/xtree.json
, then run generate_layers.py
.
The output should be a complete layer; so combine information from
multiple rules, simply copy-paste the fields of the newly generated layer.
For most tweaks, you will simply need to run the Sphinx documentation builder again.
$ pip install Sphinx
$ cd docs
$ make dirhtml
The output will be in ``docs/_build/dirhtml```.
If you are adding new modules, you may need to re-run the skeleton build script first:
$ pip install Sphinx
$ sphinx-apidoc -F -o docs regparser/
To run the unit tests, make sure you have added all of the testing requirements:
$ pip install -r requirements_test.txt
Then, run nose on all of the available unit tests:
$ nosetests tests/*.py
If you'd like a report of test coverage, use the nose-cov plugin:
$ nosetests --with-cov --cov-report term-missing --cov regparser tests/*.py