GithubHelp home page GithubHelp logo

jstray / deepform Goto Github PK

View Code? Open in Web Editor NEW
75.0 5.0 21.0 113.92 MB

Using ML to extract campaign finance data from messy forms for journalism

Python 81.52% Shell 0.47% Jupyter Notebook 17.98% PLpgSQL 0.02% Dockerfile 0.02%

deepform's Introduction

Deepform

An experiment to extract information from TV station political advertising disclosure forms using deep learning, and a challenging journalism-relevant dataset for NLP/AI researchers. Orignal data from ProPublica's Free The Files project.

This model achieves 90% accuracy extracting total spending from the PDFs in the (held out) test set, which shows that deep learning can generalize surprisingly well to previously unseen form types. I expect it could be made much more accurate through some feature engineering (see below.)

For results and discussion, see this talk.

Full thanks to my collaborator Nicholas Bardy of Weights & Biases.

Why?

TV stations are required to disclose their sale of political advertising, but there is no requirement that this disclosure is machine readable. Every election, tens of thousands of PDFs are posted to the FCC Public File, available at https://publicfiles.fcc.gov/ in hundreds of different formats.

In 2012, ProPublica ran the Free The Files project (you can read how it worked) and hundreds of volunteers hand-entered information for over 17,000 of these forms. That data drove a bunch of campaign finance coverage and is now available from their data store.

Can we replicate this data extraction using modern deep learning techniques? This project aimed to find out, and successfully extracted the easiest of the fields (total amount) at 90% accuracy using a relatively simple network.

How it works

I settled on a relatively simple design, using a fully connected three-layer network trained on 20 token windows of the data. Each token is hashed to an integer mod 500, then converted to 1-hot representation and embedded into 32 dimensions. This embedding is combined with geometry information (bounding box and page number) and also some hand-crafted "hint" features, such as whether the token matches a regular expression for dollar amounts. For details, see the talk.

Although 90% is a good result, it's probably not high enough for production use. However, I believe this approach has lots of room for improvement. The advantage of this type of system is that it can elegantly integrate multiple manual extraction methods โ€” the "hint" features โ€” each of which can be individually crappy. The network actually learns when to trust each method. In ML speak this is "boosting over weak learners."

So the next steps would be something like:

  • Add additional hand-crafted features that signal when a token is the total. These don't have to be individually very accurate.
  • Extend the technique to the other fields we wish to extract (advertiser, etc.)

How to run

If you wish to reproduce this result, there are multple steps in the data preparation:

  • The raw data is in source/ftf-all-filings.tsv. This file contains the crowdsourced answers and the PDF url.
  • download-pdfs.py will read this file and download all the PDFs from DocumentCloud. It takes several days. Also, perhaps 10% of these PDFs are no longer on DocumentCloud. In theory they could be re-collected from the FCC.
  • tokenize-pdfs.py will read each PDF and output a list of tokens and their geometry. Also takes several days to run.
  • create-training-data.py reads the PDF tokens and matches them against the original data, outputting only documents where the training data is available. Edit this to control which extracted fields appear in the training data.
  • train.py loads this data, trains a network, and logs the results using Weights & Biases
  • baseline.py is a hand coded total extractor for comparison, which achieves 61% accuracy.

Training data format

The main training data file is data/training.csv but it's too big to post in github, so you can download it here.

There is data from 9018 labelled documents. It's formatted as "tokens plus geometry" like this:

slug,page,x0,y0,x1,y1,token,gross_amount
473630-116252-0-13442821773323-_-pdf,0,272.613,438.395,301.525,438.439,$275.00,0.62
473630-116252-0-13442821773323-_-pdf,0,410.146,455.811,437.376,455.865,Totals,0.0
473630-116252-0-13442821773323-_-pdf,0,525.84,454.145,530.288,454.189,6,0.0
473630-116252-0-13442821773323-_-pdf,0,556.892,454.145,592.476,454.189,"$1,170.00",1.0
473630-116252-0-13442821773323-_-pdf,0,18.0,480.478,37.998,480.527,Time,0.0
473630-116252-0-13442821773323-_-pdf,0,40.5,480.478,66.51,480.527,Period,0.0

The slug is a unique document identifier, ultimately from the source TSV. The page number runs from 0 to 1, and the bounding box is in the original PDF coordinate system. The actual token text is reproduced as token. The gross_amount represents string similarity to the correct answer in the original gross_amount column, from 0 to 1. To add other ground-truth fields, edit create-training-data.py.

Running with Docker

Note that the training script currently brings all of the training set into memory, and therefore has significant RAM requirements.

docker build -t projectdeepform/deepform .
docker run -m 7g projectdeepform/deepform:latest

A research data set

There is a great deal left to do! For example, we still need to try extracting the other fields such as advertiser and TV station call sign. This will probably be harder than totals as it's harder to identify tokens which "look like" the correct answer.

There is still more data preparation work to do. We discovered that about 30% of the PDFs documents still need OCR, which should increase our training data set from 9k to ~17k documents.

But even in its current form, this is a difficult data set that is very relevant to journalism, and improvements in technique will be immediately useful to campaign finance reporting.

The general problem is known as "knowledge base construction" in the research community, and the current state of the art is achieved by multimodal systems such as Fonduer.

I would love to hear from you! Contact me on twitter or through my blog.

deepform's People

Contributors

danielfennelly avatar jstray avatar ngrayluna avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

deepform's Issues

License?

Which license if any is the DeepForm data released under? Thanks

Unable to find create-training-data.py

In the readme file, create-training-data.py is said to be one of the main steps to train the model. But the python file is missing. It'd be helpful if you can add that too, as the target variable column is missing in the data created in the previous step.
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.