openoakland / opendisclosure Goto Github PK

THIS PROJECT IS UNMAINTAINED - SEE: https://github.com/caciviclab/odca-jekyll AND https://github.com/caciviclab/disclosure-backend-static

Home Page: http://opendisclosure.io/

Ruby 36.27% CSS 17.85% JavaScript 28.26% Shell 1.63% Python 0.64% HTML 15.35%

opendisclosure's Introduction

THIS VERSION IS DEPRECATED. SEE: https://github.com/caciviclab/odca-jekyll

opendisclosure

Overview

The goal of the project is to produce useful visualizations and statistics for Oakland's campaign finance data, starting with the November 2014 mayoral race.

Meeting notes can be found in this Google Doc.

To install the backend in a Vagrant virtual box, follow the instructions here:

Instructions for installing backend in Vagrant

Running Locally

To start, you'll need ruby installed.

brew install rbenv
brew install ruby-build
rbenv install 2.1.2

Then install bundler and foreman:

gem install bundler
gem install foreman

Install postgres:

brew install postgres

# choose one:
# A) to start postgres on startup:
ln -sfv /usr/local/opt/postgresql/*.plist ~/Library/LaunchAgents
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

# B) or, to run postgres in a terminal (you will have to leave it running)
postgres -D /usr/local/var/postgres

ARCHFLAGS="-arch x86_64" gem install pg

Now you can install the other dependencies with:

bundle install

Create your postgresql user: (may be unnecessary, depending on how postgres is installed):

sudo -upostgres createuser $USER -P
# enter a password you're not terribly worried to share
echo DATABASE_URL="postgres://$USER:[your password]@localhost/postgres" > .env

You should be all set. Run the app like this:

foreman start

Then, to get a local copy of all the data:

bundle exec ruby backend/load_data.rb

Data Source

The raw, original, separated-by-year data can be found on Oakland's "NetFile" site here: http://ssl.netfile.com/pub2/Default.aspx?aid=COAK

We process that data in a nightly ETL process. Every day (or so) this dataset is updated with the latest version of the data. There is a data dictionary of what all the columns mean here.

Name mapping

When we aggregate to find top contributors by company and employee, we use a mapping table to correct for spelling errors and different ways of representing the same entity. This is stored in backend/map.csv and gets loaded into the maps table during the data load process.

Since there is no easy way to calculate when two entities are the same updating the maps table requires human intervention. Here are the steps to update the data:

load the most recent data (see above).
In your favorite Postgres interface run this query and export it: SELECT DISTINCT * FROM ( SELECT 0, name, name FROM parties c, contributions WHERE contributor_id = c.id AND c.type <> 'Party::Individual' AND NOT name =ANY (SELECT Emp2 FROM maps) UNION ALL SELECT 0, employer, employer FROM parties c, contributions WHERE contributor_id = c.id AND c.type ='Party::Individual' AND NOT employer =ANY (SELECT Emp2 FROM maps) ) s
load map.csv and this new data into your favorite column oriented data processing tool e.g. Excel
sort on the Emp1 column
Search for rows that have 0 in the first column and see if they are equivalent to any near by entity. If they are, copy the value of Emp1 from that row to this one. If the entity is a union but "Union" in the type column. In some cases an equivalent entity might not sort near by, e.g: San Fransisco Bay Area Rapid Transit District : BART City of Oakland : Oakland, city of California Senate : State of CA Senate
Renumber the first column so all are unique. In Excel or equivalent you can set the first row to 1 and the second row to =A1+1 and copy that forumla to all the other rows.

Deploying

In order to deploy to production ([opendisclosure.io]) you will need a couple things:

A public-private SSH keypair (use the ssh-keygen command to make one)
A Heroku account. Make sure to associate it with your public key (~/.ssh/id_rsa.pub)
Permission for your Heroku account to deploy. You can get this from the current OpenDisclosure maintainers.

Then, you can deploy via git:

# first time setup:
git remote add heroku [email protected]:opendisclosure.git

# to deploy:
git checkout master
# highly recommended: run `git log` so you know what will be deployed. When
# ready to deploy, run:
git push heroku master

Make sure to push changes back to this repository as well, so that heroku and this repository's master branch stay in-sync!

opendisclosure's People

Contributors

Stargazers

Watchers

opendisclosure's Issues

Create a graph showing candidate fundraising over time.

Probably a line graph.

Create process for getting data from socrata into db

Switch Development to Postgres

When adding queries it will be much easier if we just switch development to match production. The quorum at the 5/27 meeting did not see a problem with this.
I will do the work and up date README as needed.

Replace the function that loads the csv with a function that gets data from the database

I need permissions!

Let me push things!

Integrate a dot graph which shows contributions by corporation, location, and recipient

Hi, I was at the OpenOakland hackathon last night and put together a simple graph that allows visualization of the campaign finance data across a couple dimensions. Each dot represents a donation and has a size and a color. Options include

For size: 1) YTD Donation, and 2) Quarterly donation
For color: 1) City, 2) Recipient, and 3) Company of donor

It was a pretty shameless copy of BART employees in D3, but the attribution is there (since the repo is forked) so I think the author should be cool with it. Unfortunately that will make the integration into your project a bit trickier because I cannot simply create a pull request.

My repository is here: https://github.com/ted27/oakland-candidates-in-d3
You can see the graph here: http://ted27.github.io/oakland-candidates-in-d3/

Let me know if there is anything I can do to help.

Set up Amazon S3 bucket for Netfile ETL dump

@daguar is taking this on.

@eddietejeda / @spjika -- I searched my email but can't find AWS credentials. Can one of you shoot me an email with that? Thanks!

Decide criteria to exclude some candidates from primary display

This person just filed her papers to run for mayor.

If she's serious about the whole thing, cool, but we probably should decide a threshold for who to display in the header on the site.

I propose the rule: "We only show candidates who have filed contribution or expense data. All other candidates will be listed on a separate page."

Get Netfile ETL scripts running on Heroku

Either @ted27 or me will do this -- whoever gets to it first.

Comes from discussion here: #24

Remaining to-do's here are:

Either replace unzip with Python code (DG preference) or get unzip buildpack working
Combine @ted27's Heroku setup work with @daguar's updates
Confirm it works, and set up job
Adding code to upload to an S3 bucket (pending setup, in @daguar's court, which is issue #31)

Add capability to search for donors

On the contributor pages (i.e. http://opendisclosure.io/recipients/committee/11), the list is pretty unwieldy. We should add a search/typeahead thing to that page which filters the people on the list!

Data processing backend

Neat project!

It seems like a core need here is a script that can routinely download the data from the web site ( http://ssl.netfile.com/pub2/Default.aspx?aid=COAK&AspxAutoDetectCookieSupport=1 ) and puts it in a backend store (any web database will do.)

Ideally, this should do an "upsert" (i.e., running the script [1] updates any changed rows and [2] adds any new rows, from the Excel file)

Having a web-accessible database/API that serves the aggregate data wanted here means that any visualizations that request the data from here will always be up-to-date.

I'll be around tonight, so happy to chat more IRL!

Framework for Independent Expenditures

This is the bread and butter of campaign finance. Although we won't get to see if there have been any IE spending until the summer, it would be useful to look at old IE spending disclosures to figure out a good framework to visualize.

This is the aspect of campaign finance that often gets ignored. This is where there are $100,000s spent by interest groups.

I'll start looking into a framework for these.

Convert Netfile spreadsheets into per-sheet CSVs across all years

This essentially replicates SF's process with a script:

Get zipped files for each year (for Oakland, that's 2011-2013, soon to also have 2014) from the Netfile site
Unzip the files
Take each "schedule" sheet (example: "A-Contributions") from each year's Excel Workbook and compile a single CSV per tab
- Example: "A-Contributions.csv" will include all the rows from the "A-Contribution" sheets in the 2011, 2012, and 2013 Excel files
[EDIT 022814: Changing from S3 to local disk] Store those CSVs on local disk

Then, we can work with City of Oakland IT to get these onto a server within the City where DataSync can nightly upload these to Socrata.

Clicking on a contribution from candidate page errors..

The errors below are raised when clicking on a contribution. After that clicking on the referenced committee gets to a badly formed page. Clicking on the candidate sidebar goes back to the root not to the candidate page.

Uncaught TypeError: Cannot read property 'replace' of undefined underscore-min.js:1
w.template underscore-min.js:1
(anonymous function) candidateTable.js:3
Uncaught TypeError: undefined is not a function app.js:15
OpenDisclosure.App.Backbone.Router.extend.home app.js:15
OpenDisclosure.App.Backbone.Router.extend.initialize app.js:11
e.Router backbone.js:1221
s backbone.js:1566
(anonymous function) app.js:42
x.Callbacks.c jquery.js:3048
x.Callbacks.p.fireWith jquery.js:3160
x.extend.ready jquery.js:433
q jquery.js:104
Error in event handler for (unknown): Cannot read property 'state' of null
Stack trace: TypeError: Cannot read property 'state' of null
at CSRecorder.onQueryStateCompleted (chrome-extension://cplklnmnlbnpmjogncfgfijoopmnlemp/content_scripts/recorder.js:43:13)
at messageListener (extensions::messaging:340:9)
at Function.propertyNames.forEach.target.(anonymous function) (extensions::SafeBuiltins:19:14)
at EventImpl.dispatchToListener (extensions::event_bindings:395:22)
at Function.propertyNames.forEach.target.(anonymous function) (extensions::SafeBuiltins:19:14)
at Event.$Array.forEach.publicClass.(anonymous function) as dispatchToListener
at EventImpl.dispatch_ (extensions::event_bindings:378:35)
at EventImpl.dispatch (extensions::event_bindings:401:17)
at Function.propertyNames.forEach.target.(anonymous function) (extensions::SafeBuiltins:19:14)
at Event.$Array.forEach.publicClass.(anonymous function) as dispatch extensions::event_bindings:383

Open Data Survey for Mayoral Candidates - post on Open Disclosure?

Hey everybody,
I was thinking about this a few weeks ago and think that something like this might be interesting to feature on candidates pages:
http://www.azavea.com/blogs/atlas/2014/05/which-candidates-for-governor-of-pennsylvania-support-open-data/

OpenOakland did a candidate open data survey in 2012 and maybe we could recruit more members to join our group to help us take that on as part of our project (since it's related to campaigning). When we or other OpenOakland members administer the survey we could post the results on the individual candidate pages. Take a look at the link above, think about it and we can discuss the idea next Tuesday.

Thanks! Have a great weekend!
Lauren

Jekyll?

What's the plan for breaking apart the header/footers into partials so you don't have to repeat code over and over?

I can easily Jekyll-fy this. Any reasons not to?

Employer name de-aliasing

I spent some time processing the current data to look for various spellings of the same entity in the Tran_Emp Field. I then created a mapping from various spellings to a single entity name. This was then queried to find the top 5 employers with the highest sums of donations from their employees and compare this with the same query over the unmapped data.
Observations:

All 4 of the candidates (who have more than one donation) have blank or N/A in their top 5 donations. For two it is the top one and more that double the next employer. I'm not sure of the law, but this seems to circumvent disclosure.
One or more of retired, unemployed and self-employed appear in the top 5 of all candidates. Some candidates mark self-employed others mark them as self-. I have mapped all these to a single entity for each candidate.
Mapping has some effect on the totals for some entities but other than self-employed no entity moves very far in the standings. For only two candidates did the top 5 change by one entity (not counting self-employed, retired, N/A).

I think given the variability on how data is entered mapping could be important for showing what companies' employees spend the most on the election. The current data is too limited to see much effect, however.

Update Oakland Answers to include link to Open Disclosure

http://answers.oaklandnet.com/web_services/how-can-i-view-campaign-disclosure-forms

Wish List of Data

Whitney Barazota, Chair of the Public Ethics Commission, asked me to put together a wish list of data we would want converted into machine-readable format. So far, only campaign finance forms are converted and lobbyist info is on the way, but here are ideas to get the ball rolling:

Ask Netfile to provide a public web directory with the raw URLs for the data files (this is a relatively minor change they did this for SF, as seen here: http://nf4.netfile.com/pub2/excel/SFOBrowsable/ ). This was identified as a high priority by the group.
Ask Netfile to provide provide individual files for each tab in the spreadsheet that spans all years of data (for example, Netfile would provide Form460.csv, which would have all the rows from each of the three "Form 460" tabs contained in the 2011-2013 spreadsheets). This was identified as a high priority by the group.
Machine-readable data of 700 forms (conflict of interest) for city employees, elected/appointed officials and candidates for office.
Machine-readable data of contracts between the city and contractors, including budgeted cost of services, actual cost of service, primary staff contact with contractor, council votes on contract, and relevant dates (such as date of contract vote).
Machine-readable data of any and all expense forms submitted by public employees and elected/appointed officials.
Convert older electoral race campaign finance data into machine-readable form (2014 first to require electronic submission), including candidate committees and independent expenditures.
Machine-Readable way to get data on how council members vote for what resolutions, contracts, and other votable products.
Machine-readable data on public campaign finances, specifically who received what and how that money was spent.
Machine-readable data on active and terminated political committees going back at least 3 years.
For active political committees, Machine-readable data on the status of whether said committees filed all required forms and renewals (if certain forms need to be renewed) at the local and state level with dates as to when those forms were filed.

Pct. Outside Oakland is off by a factor of 100

e.g. Quan shows 0.45%. The real number is 45%

I believe the fix is:
--- a/assets/js/models.js
+++ b/assets/js/models.js
@@ -18,7 +18,7 @@ OpenDisclosure.Candidate = Backbone.Model.extend({
},

friendlyPct : function(float) {

return Math.round(float * 100) / 100 + "%";
return Math.round(float * 100) + "%";
}
});

Define api for getting data from db

Create a graph showing any overlap between lobbyist contributions from corporations and campaign contributions

This is Question 5 of 5 From the Public Ethics Commission

Evaluate any overlap between corporations and industries that employ and register a lobbyist with the City of Oakland and campaign contribution and expenditure data.

Cross tab data: Whales, etc.

Calculate various factoids:
Whales: Large donors to all Oakland campaigns
Donors to more than one candidate
Donors who are lobbyists
Per Candidate:
Individual/Corporate/Union
Top Company/Employer

Selecting a license for the campaign finance data

@lla2105 brought this up in this comment on issue #33:

in licensing these datasets on Socrata I've been just following SF's Creative Commons Universal license because I'm doing Port Jobs for each dataset. But I also notice that Public Domain is an option as well. Why use Creative Commons and not Public Domain?

Lauren, this is your call, but here's a resource from Socrata that might explain it:
http://support.socrata.com/entries/21434472-Which-licensing-option-should-I-use-

Also there are a few threads on the Open Data Stack Exchange of relevance, and it looks like this one is:
http://opendata.stackexchange.com/questions/335/categories-and-varieties-of-open-data-licensing

Does NetFile have any license requirements on the data? I imagine "public domain" is best.

Bar graph showing how much campaign committee has raised so far versus how much that committee has spent in expenditures on the campaign.

This is Question 3 of 5 From the Public Ethics Commission

Also is there any way to cross check data and see if there are any overlaps between corporate contributors and campaign expenditures for campaign costs? Such as a sign maker contributes $700 to Quan then her campaign pays the same contributor to make signs for her campaign.

Create DB schema

Add tests

Anything which allows us to verify the functionality and correctness of the site before it is deployed will allow us to iterate quicker.

Can we add Google Analytics (or something similar) to the page?

Data exploration - mock up 10+ charts

Do some data exploration in Excel or a stats tool. Get a sense of which charts will be helpful before we build them in D3.

Try out DataSync with a single CSV

This is a good very first step in getting @lla2105's piece of the backend going.

Basically the idea is to take 1 of the ~20 CSV files output by the netfile-etl scripts, and set up a DataSync job for just that CSV.

The different pieces of this to me are:

Create the dataset for the CSV on Socrata
- Can be a private dataset for this test-drive
- I think this is best done through the web site; from there it will should give you a dataset ID that you put into DataSync
- Probably best to base it off of the SF Ethics Commission's Socrata datasets
Create a DataSync scheduled job with the CSV
Make a slight change to the CSV file (eg, delete a row) and run the DataSync job, then check that the dataset on Socrata was indeed updated

(PS, Lauren, you can check the above boxes inside the GitHub issue itself to mark them as done.)

Lauren, I'm sending you a zip file with the CSVs output by the ETL scripts, and you can choose which CSV you want to use for this test.

After this is successful, the next steps will be:

Creating Socrata datasets and DataSync jobs for the other CSVs
Writing a light script that runs on Lauren's PC to download the CSVs to a specified folder from the S3 bucket every night

Write the middleware that gets the data from the db and sends response for each endpoint in api

Lobbyist Data Structure and Integration

Just starting this now for upcoming discussions on how to best format lobbyist data and integrate with campaign finance data. Currently, I'm looking over a draft structure of an Excel file @lla2105 sent me last night.

Try out Python ETL scripts on Windows box in city

Lauren -- Adding this issue because I think a next step is to see if you can run my scripts that do the Netfile data ETL on your Windows box:

https://github.com/daguar/netfile-etl

An alternative is to run them on an external Unix-y server (like Heroku or elsewhere) and then set up a job to download them every day to a computer within the city.

Hybrid (automated + manual) data cleaning?

Talking with @semerj last night after the meeting (who's also been working with and cleaning the data) his opinion was that because of:

(1) the relatively small number of contributions each release, and
(2) the nuances of the cleaning of the data

there will probably be a need for a manual data cleaning step with each new release.

In that context, it might make sense to think of a workflow where the software pulls out the new data from a release, which alerts a human to clean it, and then post-cleaning the cleaner uploads to a tool that stores it in a "clean" database.

Wanted to open an issue to discuss this -- thoughts?

Love your thoughts in particular, @bayreporta.

Choropleth map showing campaign contributions by zipcode.

Where is Klein?!

Basic data processor for ALL CA Netfile sites

This is something that I've heard mentioned as a potential side goal, but a simple data processor (downloads + does some very basic structuring and analysis) of California data contained in the proprietary Netfile system would probably be useful all across the state.

You can actually "fuzz" the URL with the aid parameter to find new cities' sites, for example:

Oakland: http://ssl.netfile.com/pub2/?aid=COAK
Santa Clara: http://ssl.netfile.com/pub2/?aid=CSC
San Mateo: http://ssl.netfile.com/pub2/?aid=CSM
Sunnyvale: http://ssl.netfile.com/pub2/Default.aspx?aid=COS
Irvine: http://ssl.netfile.com/pub2/Default.aspx?aid=COI

This might be an interesting way to extend the impact of this work.

Chart styleguide

From Lauren:

Hey guys, I wanted to share this link from the Sunlight Foundation that is a guide for visualizing data:

http://sunlightfoundation.com/blog/2014/03/12/datavizguide/
http://design.sunlightlabs.com/projects/Sunlight-StyleGuide-DataViz.pdf

Take a look and maybe it can help with some of the building and visually designing of our OpenDisclosure charts.

Publish "Summary" sheet to Socrata

Steven in SF helped set up a data set:
https://data.oaklandnet.com/Financial/Oakland-Campaign-Finance-Summary-Sheet/c9t6-6fwf

@lla2105 -- Your doing this will be the first step in getting all the other tabs up into Socrata.

Summarize Data on the Server

Currently, the client pulls down a complete list of every contribution on initial page load. As the number of contributions grows, this is going to be unsustainable. We need a better approach. Instead we should roll up the data on the server and send down only those top level numbers.

But we've decided to defer this task for a while since sending the list of all contributions is manageable for now.

layout.haml uses record ids

layout.haml has hardwired record ids for the candidates. These ids could change when data is reloaded. It should use committee_id and translate that to the current record id.

views/index.haml uses ending_cash_balance for Total Raised

The code:

<%= m.friendlySummaryNumber('ending_cash_balance') %> generates the value in the column titled: Total Raised Shouldn't this be: <%= m.friendlySummaryNumber('total_contributions_received') %> ?

Site layout and design -- Seeking Feedback

First draft:

Feedback from Whitney Barazoto in: She doesn’t like the stamp look that says OpenDisclosure on the website. I thought she just didn’t like the red but she doesn’t like the stamp because she thinks it looks bureaucratic and like the City is trying to hide something. She wants a softer design and possible one that doesn’t even need a tree but perhaps uses an image of a sun and says OpenDisclosure in a softer (less bureaucratic – whatever that means) font. I’m sorry I know you’ve put a lot of time and effort into this and I apologize for all the requesting all of these edits!

What does Undefined mean zip code chart?

There is an Undefined tab that shows contributions in several zip codes. What is this?

A script to download Netfile Excel files

Steven (the SF Ethic Commission's data/IT lead) e-mailed to say that they had to make a custom request of Netfile because the default site (ie, http://ssl.netfile.com/pub2/Default.aspx?aid=COAK ) uses a Javascript postback that makes simple scripted downloading (eg, curl) of the data difficult.

Here's what he said:

Another thing you’ll need to work through is Netfile links to the annual excel download files for each jurisdiction using a javascript postback. As a result, you can’t get the download URLs (unless you know of some creative way of doing that – if so, let me know). So I talked to Netfile and had them expose the download URLs for San Francisco’s files so that Jeff could easily grab them with a script.

http://nf4.netfile.com/pub2/excel/SFOBrowsable/

If you want something similar, you’ll need Oakland to request it from Netfile.

I think we might be able to get around this with a full browser simulator (like Selenium or JSPhantom.) We could build a simple Heroku job (like a cron job) that pulls the data nightly and stores the files on Amazon S3.

IMHO, relying on Netfile customization should be a last resort, both because it creates a big bottlenecking dependency and if we are able to solve this problem for Oakland we solve it for any other CA city using Netfile (most cities.)

About page - outline

Hi everyone! I put together a quick outline of the things we would possibly want to cover in the about section. Please take a look and let me know if there's anything else you think the about section should cover, if there's anything that should be removed, etc. Thanks!! :)

ABOUT

Group origin: a project of Open Oakland/CfA Brigade
- Brief intro to CfA (external link)
- Funding
Goals/Mission
Volunteers
- List team members?
- If you’d like to be involved (meet-up info)
Statement of nonpartisanship

METHODOLOGY

How are candidates defined/Why aren’t some candidates covered
How up-to-date is the data
Where does this data come from
Why is the data presented the way it is

FAQ

GLOSSARY

CONTACT

Contact info (for questions, report broken links or inaccurate data)

Create a graph showing what percentage of campaign contributions to each candidate are made from Oakland residents

This is Question 4 of 5 From the Public Ethics Commission

What percentage of campaign contributions to each mayoral candidate are made from Oakland residents? What percentage are from non-Oakland residents? What percentage are from out of state residents? What percentage of contributors do not volunteer their residence address contact information?