GithubHelp home page GithubHelp logo

fec-scraper's Introduction

FEC Scraper:
------------
By Christopher Schnaars and Anthony DeBarros, USA TODAY
Developed with Python 2.7.2



Contents:
---------
1 - What is FEC Scraper?
2 - Requirements
3 - What does FEC Scraper do?
4 - What doesn't FEC Scraper do?
5 - How to incorporate a database with FEC Scraper
6 - How to specify committees that should be scraped
7 - Directories



1 - What is FEC Scraper?
------------------------
FEC Scraper is a Python script you can run to find and download campaign finance filings
from the Federal Election Commission website for specified committees, including candidates
for president and the House of Representatives as well as Super PACs and other organizations.

The scraper is most powerful when you take advantage of its ability to interact with a database
manager. The default behavior is for the script to retrieve a list of committees that already exist
in the database as well as all files housed in the database, then go to the FEC website to look for
and scrape new filings for those committees.

The scraper can be used in conjunction with FEC Parser (coming soon) to convert all of the FEC's 
data into neat, delimited text files for easy import into your database.



2 - Requirements
----------------
To run FEC Scraper, you will need to install the mechanize module (http://wwwsearch.sourceforge.net/mechanize/download.html).

If you want to enable functionality that integrates FEC Scraper with a database manager, you'll
also need a library to connect to a database manager of your choice, if you don't
use the SQLite functionality baked into Python. For example:
 * pyodbc (SQL Server, http://code.google.com/p/pyodbc/downloads/list)
 * mySQLdb (mySQL, http://mysql-python.sourceforge.net)
 * psycopg2 (PostgreSQL, http://initd.org/psycopg/download/)



3 - What does FEC Scraper do?
-----------------------------
Every candidate or organization that has to file reports with the FEC is assigned a committee ID
number. You can specify the committee IDs FEC Scraper should use when looking for
new reports. There are two ways to do this.

First, you have the option to specify a database the script can interact with. When you
do this, FEC Parser automatically will download all Form 3 filings for all committees in
the database, minus any filings previously downloaded or imported into the database. (You can
download FECParser.sql for all the scripts you need to create a SQL Server database with all
the necessary tables and stored procedures. You also can use these scripts to develop
a database for a different database manager, such as mySQL, PostgreSQL or Access.)

Second, you can specify a list of committees that you want scraped directly in the script. Even if
you are interacting with a database, you'll need to specify the IDs for any committees that never
have been downloaded previously.

FEC Parser only downloads versions of Form 3, including Forms 3A, 3N, 3PA,
3PN, 3XA and 3XN. This covers monthly and quarterly reports filed by PACs and other
political organizations as well as candidates for President and the House of Representatives
(but not the Senate, where members still engage in the archaic process of filing their reports
on paper to hide contributors and expenditures).

Form 3 includes contributions, expenditures and loans, among other types of information, so it
probably contains everything you need.

As of this writing, the FEC offers data files in both comma and ASCII-28 delimited formats.
Because of inherent problems with .csv files (namely, commas in the data itself), FEC Scraper
uses the ASCII-28 delimited format. This ASCII code is obscure and unlikely to show up in your
data. (The FEC Parser sister script will convert these files to standard tab-delimited files
for easier import into a database manager.)



4 - What doesn't FEC Scraper do?
--------------------------------
FEC Scraper does not download data for committees you don't specify. Additionally, it does
not download data files for any form types other than Form 3.

Additionally, FEC Scraper's use is solely to scrape the FEC site and download the data files.
The script does not do any data processing.



5 - How to incorporate a database with FEC Scraper
---------------------------------------------------
Near the top of the script, in the User Variables section, you can set the usedatabaseflag
variable to a value other than 1 if you don't want the script to integrate with a database.

As written, FEC Parser uses the pyodbc module (http://code.google.com/p/pyodbc/downloads/list)
to communicate with a SQL Server database. If you want to use a different database manager, you
should be able to modify the code pretty easily. Here are a few suggestions:
 * mySQL: MySQLdb module (http://mysql-python.sourceforge.net)
 * PostgreSQL: psycopg2 module (http://initd.org/psycopg/download/) 
 
 
For the default SQL Server code, you will need to input the server and the database (if you've named
it something other than FEC). If necessary, you also can specify a username (UID) and password (PWD).

When the script runs, it fires two stored procedures that create a list of Committee IDs
and data files that already exist in the database. FEC Scraper uses these lists to determine
what committees to scrape and what files do not need to be downloaded.

The FECScraper.sql file contains all the SQL Server scripts you need to create the necessary
tables and stored procedures. You can use this code to create similar objects for another
database manager.



6 - How to specify committees that should be scraped
----------------------------------------------------
Even if you link FEC Scraper to a database, you'll still need to explicitly tell
the script to download files for committees that don't already exist in the database.

You'll find commented out examples of how to explicitly add a committee to the script.
For example: commidlist.append('C00431445')



7 - Directories
---------------
You can specify up to three directories for various files:
 * savedir: This is where scraped files will be saved.
 * reviewdir: This is a directory where you can house files that you have not
imported into your database because the data requires cleanup before it
can be successfully imported. This script uses this directory only to look
for files that previously have been downloaded. No files in this directory
will be altered, and no new files will be saved here.
 * processeddir: Similarly, you can put files you've already imported
into a database in this directory. As with reviewdir, the script will use
this directory only to look for files that previously have been downloaded.
It will not alter files in this directory or save new files here.


fec-scraper's People

Contributors

cschnaars avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.