GithubHelp home page GithubHelp logo

openoakland / opendisclosure Goto Github PK

View Code? Open in Web Editor NEW
49.0 53.0 40.0 8 MB

THIS PROJECT IS UNMAINTAINED - SEE: https://github.com/caciviclab/odca-jekyll AND https://github.com/caciviclab/disclosure-backend-static

Home Page: http://opendisclosure.io/

Ruby 36.27% CSS 17.85% JavaScript 28.26% Shell 1.63% Python 0.64% HTML 15.35%

opendisclosure's Introduction

THIS VERSION IS DEPRECATED. SEE: https://github.com/caciviclab/odca-jekyll

Stories in Ready opendisclosure

Overview

The goal of the project is to produce useful visualizations and statistics for Oakland's campaign finance data, starting with the November 2014 mayoral race.

Meeting notes can be found in this Google Doc.

To install the backend in a Vagrant virtual box, follow the instructions here:

Instructions for installing backend in Vagrant

Running Locally

To start, you'll need ruby installed.

brew install rbenv
brew install ruby-build
rbenv install 2.1.2

Then install bundler and foreman:

gem install bundler
gem install foreman

Install postgres:

brew install postgres

# choose one:
# A) to start postgres on startup:
ln -sfv /usr/local/opt/postgresql/*.plist ~/Library/LaunchAgents
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

# B) or, to run postgres in a terminal (you will have to leave it running)
postgres -D /usr/local/var/postgres

ARCHFLAGS="-arch x86_64" gem install pg

Now you can install the other dependencies with:

bundle install

Create your postgresql user: (may be unnecessary, depending on how postgres is installed):

sudo -upostgres createuser $USER -P
# enter a password you're not terribly worried to share
echo DATABASE_URL="postgres://$USER:[your password]@localhost/postgres" > .env

You should be all set. Run the app like this:

foreman start

Then, to get a local copy of all the data:

bundle exec ruby backend/load_data.rb

Data Source

The raw, original, separated-by-year data can be found on Oakland's "NetFile" site here: http://ssl.netfile.com/pub2/Default.aspx?aid=COAK

We process that data in a nightly ETL process. Every day (or so) this dataset is updated with the latest version of the data. There is a data dictionary of what all the columns mean here.

Name mapping

When we aggregate to find top contributors by company and employee, we use a mapping table to correct for spelling errors and different ways of representing the same entity. This is stored in backend/map.csv and gets loaded into the maps table during the data load process.

Since there is no easy way to calculate when two entities are the same updating the maps table requires human intervention. Here are the steps to update the data:

  1. load the most recent data (see above).
  2. In your favorite Postgres interface run this query and export it: SELECT DISTINCT * FROM ( SELECT 0, name, name FROM parties c, contributions WHERE contributor_id = c.id AND c.type <> 'Party::Individual' AND NOT name =ANY (SELECT Emp2 FROM maps) UNION ALL SELECT 0, employer, employer FROM parties c, contributions WHERE contributor_id = c.id AND c.type ='Party::Individual' AND NOT employer =ANY (SELECT Emp2 FROM maps) ) s
  3. load map.csv and this new data into your favorite column oriented data processing tool e.g. Excel
  4. sort on the Emp1 column
  5. Search for rows that have 0 in the first column and see if they are equivalent to any near by entity. If they are, copy the value of Emp1 from that row to this one. If the entity is a union but "Union" in the type column. In some cases an equivalent entity might not sort near by, e.g: San Fransisco Bay Area Rapid Transit District : BART City of Oakland : Oakland, city of California Senate : State of CA Senate
  6. Renumber the first column so all are unique. In Excel or equivalent you can set the first row to 1 and the second row to =A1+1 and copy that forumla to all the other rows.

Deploying

In order to deploy to production ([opendisclosure.io]) you will need a couple things:

  1. A public-private SSH keypair (use the ssh-keygen command to make one)
  2. A Heroku account. Make sure to associate it with your public key (~/.ssh/id_rsa.pub)
  3. Permission for your Heroku account to deploy. You can get this from the current OpenDisclosure maintainers.

Then, you can deploy via git:

# first time setup:
git remote add heroku [email protected]:opendisclosure.git

# to deploy:
git checkout master
# highly recommended: run `git log` so you know what will be deployed. When
# ready to deploy, run:
git push heroku master

Make sure to push changes back to this repository as well, so that heroku and this repository's master branch stay in-sync!

opendisclosure's People

Contributors

angelalvarado avatar bayreporta avatar ckingbailey avatar cleishm avatar daguar avatar eddietejeda avatar elinaru avatar endenizen avatar evanwolf avatar ianaroot avatar jwrobes avatar kleinlieu avatar kylew avatar magshi avatar mikeubell avatar polkapolka avatar spjika avatar stochastictreat avatar tdooner avatar vbrown608 avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opendisclosure's Issues

Switch Development to Postgres

When adding queries it will be much easier if we just switch development to match production. The quorum at the 5/27 meeting did not see a problem with this.
I will do the work and up date README as needed.

Integrate a dot graph which shows contributions by corporation, location, and recipient

Hi, I was at the OpenOakland hackathon last night and put together a simple graph that allows visualization of the campaign finance data across a couple dimensions. Each dot represents a donation and has a size and a color. Options include

For size: 1) YTD Donation, and 2) Quarterly donation
For color: 1) City, 2) Recipient, and 3) Company of donor

It was a pretty shameless copy of BART employees in D3, but the attribution is there (since the repo is forked) so I think the author should be cool with it. Unfortunately that will make the integration into your project a bit trickier because I cannot simply create a pull request.

My repository is here: https://github.com/ted27/oakland-candidates-in-d3
You can see the graph here: http://ted27.github.io/oakland-candidates-in-d3/

Let me know if there is anything I can do to help.

Decide criteria to exclude some candidates from primary display

This person just filed her papers to run for mayor.

If she's serious about the whole thing, cool, but we probably should decide a threshold for who to display in the header on the site.

I propose the rule: "We only show candidates who have filed contribution or expense data. All other candidates will be listed on a separate page."

Get Netfile ETL scripts running on Heroku

Either @ted27 or me will do this -- whoever gets to it first.

Comes from discussion here: #24

Remaining to-do's here are:

  • Either replace unzip with Python code (DG preference) or get unzip buildpack working
  • Combine @ted27's Heroku setup work with @daguar's updates
  • Confirm it works, and set up job
  • Adding code to upload to an S3 bucket (pending setup, in @daguar's court, which is issue #31)

Data processing backend

Neat project!

It seems like a core need here is a script that can routinely download the data from the web site ( http://ssl.netfile.com/pub2/Default.aspx?aid=COAK&AspxAutoDetectCookieSupport=1 ) and puts it in a backend store (any web database will do.)

Ideally, this should do an "upsert" (i.e., running the script [1] updates any changed rows and [2] adds any new rows, from the Excel file)

Having a web-accessible database/API that serves the aggregate data wanted here means that any visualizations that request the data from here will always be up-to-date.

I'll be around tonight, so happy to chat more IRL!

Framework for Independent Expenditures

This is the bread and butter of campaign finance. Although we won't get to see if there have been any IE spending until the summer, it would be useful to look at old IE spending disclosures to figure out a good framework to visualize.

This is the aspect of campaign finance that often gets ignored. This is where there are $100,000s spent by interest groups.

I'll start looking into a framework for these.

Convert Netfile spreadsheets into per-sheet CSVs across all years

This essentially replicates SF's process with a script:

  1. Get zipped files for each year (for Oakland, that's 2011-2013, soon to also have 2014) from the Netfile site
  2. Unzip the files
  3. Take each "schedule" sheet (example: "A-Contributions") from each year's Excel Workbook and compile a single CSV per tab
    • Example: "A-Contributions.csv" will include all the rows from the "A-Contribution" sheets in the 2011, 2012, and 2013 Excel files
  4. [EDIT 022814: Changing from S3 to local disk] Store those CSVs on local disk

Then, we can work with City of Oakland IT to get these onto a server within the City where DataSync can nightly upload these to Socrata.

Clicking on a contribution from candidate page errors..

The errors below are raised when clicking on a contribution. After that clicking on the referenced committee gets to a badly formed page. Clicking on the candidate sidebar goes back to the root not to the candidate page.

Uncaught TypeError: Cannot read property 'replace' of undefined underscore-min.js:1
w.template underscore-min.js:1
(anonymous function) candidateTable.js:3
Uncaught TypeError: undefined is not a function app.js:15
OpenDisclosure.App.Backbone.Router.extend.home app.js:15
OpenDisclosure.App.Backbone.Router.extend.initialize app.js:11
e.Router backbone.js:1221
s backbone.js:1566
(anonymous function) app.js:42
x.Callbacks.c jquery.js:3048
x.Callbacks.p.fireWith jquery.js:3160
x.extend.ready jquery.js:433
q jquery.js:104
Error in event handler for (unknown): Cannot read property 'state' of null
Stack trace: TypeError: Cannot read property 'state' of null
at CSRecorder.onQueryStateCompleted (chrome-extension://cplklnmnlbnpmjogncfgfijoopmnlemp/content_scripts/recorder.js:43:13)
at messageListener (extensions::messaging:340:9)
at Function.propertyNames.forEach.target.(anonymous function) (extensions::SafeBuiltins:19:14)
at EventImpl.dispatchToListener (extensions::event_bindings:395:22)
at Function.propertyNames.forEach.target.(anonymous function) (extensions::SafeBuiltins:19:14)
at Event.$Array.forEach.publicClass.(anonymous function) as dispatchToListener
at EventImpl.dispatch_ (extensions::event_bindings:378:35)
at EventImpl.dispatch (extensions::event_bindings:401:17)
at Function.propertyNames.forEach.target.(anonymous function) (extensions::SafeBuiltins:19:14)
at Event.$Array.forEach.publicClass.(anonymous function) as dispatch extensions::event_bindings:383

Open Data Survey for Mayoral Candidates - post on Open Disclosure?

Hey everybody,
I was thinking about this a few weeks ago and think that something like this might be interesting to feature on candidates pages:
http://www.azavea.com/blogs/atlas/2014/05/which-candidates-for-governor-of-pennsylvania-support-open-data/

OpenOakland did a candidate open data survey in 2012 and maybe we could recruit more members to join our group to help us take that on as part of our project (since it's related to campaigning). When we or other OpenOakland members administer the survey we could post the results on the individual candidate pages. Take a look at the link above, think about it and we can discuss the idea next Tuesday.

Thanks! Have a great weekend!
Lauren

Jekyll?

What's the plan for breaking apart the header/footers into partials so you don't have to repeat code over and over?

I can easily Jekyll-fy this. Any reasons not to?

Employer name de-aliasing

I spent some time processing the current data to look for various spellings of the same entity in the Tran_Emp Field. I then created a mapping from various spellings to a single entity name. This was then queried to find the top 5 employers with the highest sums of donations from their employees and compare this with the same query over the unmapped data.
Observations:

  1. All 4 of the candidates (who have more than one donation) have blank or N/A in their top 5 donations. For two it is the top one and more that double the next employer. I'm not sure of the law, but this seems to circumvent disclosure.
  2. One or more of retired, unemployed and self-employed appear in the top 5 of all candidates. Some candidates mark self-employed others mark them as self-. I have mapped all these to a single entity for each candidate.
  3. Mapping has some effect on the totals for some entities but other than self-employed no entity moves very far in the standings. For only two candidates did the top 5 change by one entity (not counting self-employed, retired, N/A).

I think given the variability on how data is entered mapping could be important for showing what companies' employees spend the most on the election. The current data is too limited to see much effect, however.

Wish List of Data

Whitney Barazota, Chair of the Public Ethics Commission, asked me to put together a wish list of data we would want converted into machine-readable format. So far, only campaign finance forms are converted and lobbyist info is on the way, but here are ideas to get the ball rolling:

  1. Ask Netfile to provide a public web directory with the raw URLs for the data files (this is a relatively minor change they did this for SF, as seen here: http://nf4.netfile.com/pub2/excel/SFOBrowsable/ ). This was identified as a high priority by the group.
  2. Ask Netfile to provide provide individual files for each tab in the spreadsheet that spans all years of data (for example, Netfile would provide Form460.csv, which would have all the rows from each of the three "Form 460" tabs contained in the 2011-2013 spreadsheets). This was identified as a high priority by the group.
  3. Machine-readable data of 700 forms (conflict of interest) for city employees, elected/appointed officials and candidates for office.
  4. Machine-readable data of contracts between the city and contractors, including budgeted cost of services, actual cost of service, primary staff contact with contractor, council votes on contract, and relevant dates (such as date of contract vote).
  5. Machine-readable data of any and all expense forms submitted by public employees and elected/appointed officials.
  6. Convert older electoral race campaign finance data into machine-readable form (2014 first to require electronic submission), including candidate committees and independent expenditures.
  7. Machine-Readable way to get data on how council members vote for what resolutions, contracts, and other votable products.
  8. Machine-readable data on public campaign finances, specifically who received what and how that money was spent.
  9. Machine-readable data on active and terminated political committees going back at least 3 years.
  10. For active political committees, Machine-readable data on the status of whether said committees filed all required forms and renewals (if certain forms need to be renewed) at the local and state level with dates as to when those forms were filed.

Pct. Outside Oakland is off by a factor of 100

e.g. Quan shows 0.45%. The real number is 45%

I believe the fix is:
--- a/assets/js/models.js
+++ b/assets/js/models.js
@@ -18,7 +18,7 @@ OpenDisclosure.Candidate = Backbone.Model.extend({
},

friendlyPct : function(float) {

  • return Math.round(float * 100) / 100 + "%";
  • return Math.round(float * 100) + "%";
    }
    });

Cross tab data: Whales, etc.

Calculate various factoids:
Whales: Large donors to all Oakland campaigns
Donors to more than one candidate
Donors who are lobbyists
Per Candidate:
Individual/Corporate/Union
Top Company/Employer

Selecting a license for the campaign finance data

@lla2105 brought this up in this comment on issue #33:

in licensing these datasets on Socrata I've been just following SF's Creative Commons Universal license because I'm doing Port Jobs for each dataset. But I also notice that Public Domain is an option as well. Why use Creative Commons and not Public Domain?

Lauren, this is your call, but here's a resource from Socrata that might explain it:
http://support.socrata.com/entries/21434472-Which-licensing-option-should-I-use-

Also there are a few threads on the Open Data Stack Exchange of relevance, and it looks like this one is:
http://opendata.stackexchange.com/questions/335/categories-and-varieties-of-open-data-licensing

Does NetFile have any license requirements on the data? I imagine "public domain" is best.

Add tests

Anything which allows us to verify the functionality and correctness of the site before it is deployed will allow us to iterate quicker.

Try out DataSync with a single CSV

This is a good very first step in getting @lla2105's piece of the backend going.

Basically the idea is to take 1 of the ~20 CSV files output by the netfile-etl scripts, and set up a DataSync job for just that CSV.

The different pieces of this to me are:

  • Create the dataset for the CSV on Socrata
    • Can be a private dataset for this test-drive
    • I think this is best done through the web site; from there it will should give you a dataset ID that you put into DataSync
    • Probably best to base it off of the SF Ethics Commission's Socrata datasets
  • Create a DataSync scheduled job with the CSV
  • Make a slight change to the CSV file (eg, delete a row) and run the DataSync job, then check that the dataset on Socrata was indeed updated

(PS, Lauren, you can check the above boxes inside the GitHub issue itself to mark them as done.)

Lauren, I'm sending you a zip file with the CSVs output by the ETL scripts, and you can choose which CSV you want to use for this test.

After this is successful, the next steps will be:

  1. Creating Socrata datasets and DataSync jobs for the other CSVs
  2. Writing a light script that runs on Lauren's PC to download the CSVs to a specified folder from the S3 bucket every night

Lobbyist Data Structure and Integration

Just starting this now for upcoming discussions on how to best format lobbyist data and integrate with campaign finance data. Currently, I'm looking over a draft structure of an Excel file @lla2105 sent me last night.

Hybrid (automated + manual) data cleaning?

Talking with @semerj last night after the meeting (who's also been working with and cleaning the data) his opinion was that because of:

(1) the relatively small number of contributions each release, and
(2) the nuances of the cleaning of the data

there will probably be a need for a manual data cleaning step with each new release.

In that context, it might make sense to think of a workflow where the software pulls out the new data from a release, which alerts a human to clean it, and then post-cleaning the cleaner uploads to a tool that stores it in a "clean" database.

Wanted to open an issue to discuss this -- thoughts?

Love your thoughts in particular, @bayreporta.

Basic data processor for ALL CA Netfile sites

This is something that I've heard mentioned as a potential side goal, but a simple data processor (downloads + does some very basic structuring and analysis) of California data contained in the proprietary Netfile system would probably be useful all across the state.

You can actually "fuzz" the URL with the aid parameter to find new cities' sites, for example:

This might be an interesting way to extend the impact of this work.

Summarize Data on the Server

Currently, the client pulls down a complete list of every contribution on initial page load. As the number of contributions grows, this is going to be unsustainable. We need a better approach. Instead we should roll up the data on the server and send down only those top level numbers.

But we've decided to defer this task for a while since sending the list of all contributions is manageable for now.

layout.haml uses record ids

layout.haml has hardwired record ids for the candidates. These ids could change when data is reloaded. It should use committee_id and translate that to the current record id.

Site layout and design -- Seeking Feedback

First draft:
opendisclosure-02c

Feedback from Whitney Barazoto in: She doesn’t like the stamp look that says OpenDisclosure on the website. I thought she just didn’t like the red but she doesn’t like the stamp because she thinks it looks bureaucratic and like the City is trying to hide something. She wants a softer design and possible one that doesn’t even need a tree but perhaps uses an image of a sun and says OpenDisclosure in a softer (less bureaucratic – whatever that means) font. I’m sorry I know you’ve put a lot of time and effort into this and I apologize for all the requesting all of these edits!

A script to download Netfile Excel files

Steven (the SF Ethic Commission's data/IT lead) e-mailed to say that they had to make a custom request of Netfile because the default site (ie, http://ssl.netfile.com/pub2/Default.aspx?aid=COAK ) uses a Javascript postback that makes simple scripted downloading (eg, curl) of the data difficult.

Here's what he said:

Another thing you’ll need to work through is Netfile links to the annual excel download files for each jurisdiction using a javascript postback. As a result, you can’t get the download URLs (unless you know of some creative way of doing that – if so, let me know). So I talked to Netfile and had them expose the download URLs for San Francisco’s files so that Jeff could easily grab them with a script.

http://nf4.netfile.com/pub2/excel/SFOBrowsable/

If you want something similar, you’ll need Oakland to request it from Netfile.

I think we might be able to get around this with a full browser simulator (like Selenium or JSPhantom.) We could build a simple Heroku job (like a cron job) that pulls the data nightly and stores the files on Amazon S3.

IMHO, relying on Netfile customization should be a last resort, both because it creates a big bottlenecking dependency and if we are able to solve this problem for Oakland we solve it for any other CA city using Netfile (most cities.)

About page - outline

Hi everyone! I put together a quick outline of the things we would possibly want to cover in the about section. Please take a look and let me know if there's anything else you think the about section should cover, if there's anything that should be removed, etc. Thanks!! :)

ABOUT

  • Group origin: a project of Open Oakland/CfA Brigade
    • Brief intro to CfA (external link)
    • Funding
  • Goals/Mission
  • Volunteers
    • List team members?
    • If you’d like to be involved (meet-up info)
  • Statement of nonpartisanship

METHODOLOGY

  • How are candidates defined/Why aren’t some candidates covered
  • How up-to-date is the data
  • Where does this data come from
  • Why is the data presented the way it is

FAQ

GLOSSARY

CONTACT

  • Contact info (for questions, report broken links or inaccurate data)

Create a graph showing which industries support each candidate

This is Question 2 of 5 From the Public Ethics Commission

(top 5 industries, aggregate amount given from this industry to each candidate, percentage that this contribution makes in the committee’s entire fundraising efforts for this reporting period)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.