cfpb / qu Goto Github PK

View Code? Open in Web Editor NEW

368.0 75.0 96.0 7.02 MB

:warning: This project was archived on September 25th, 2020 and is no longer maintained

License: Creative Commons Zero v1.0 Universal

JavaScript 2.20% Clojure 79.01% CSS 12.91% HTML 5.59% Less 0.29%

qu's Introduction

qu

⚠️ This project was archived on September 25th, 2020 and is no longer maintained ⚠️

qu is a data platform created to serve our public data sets. You can use it to serve your data sets, as well.

The goals of this platform are to:

Import data in our Google-Dataset-inspired format
Query data using our Socrata-Open-Data-API-inspired API
Export data in JSON or CSV format

Developing with Vagrant

If you are using Vagrant, life is clean and easy for you! Go to our Vagrant documentation to get started.

Getting started without Vagrant

Prerequisites

In order to work on qu, you need the following languages and tools installed:

Setup

Front-end assets

Once you have the prerequisites installed and the code downloaded and expanded into a directory (which we will call "qu"), run the following commands:

cd qu
lein deps
npm install -g grunt-cli bower
npm install && bower install
grunt

If editing the JavaScript or CSS, run the following to watch the JS and CSS and make sure your changes are compiled:

grunt watch

You can run grunt to compile the files once.

Vagrant

Start a VM by running vagrant up. Provisioning will take a few minutes.

After a VM is started, you should be able to run vagrant ssh to SSH to the VM. Then run:

cd /vagrant

to change the working directory to the Qu codebase.

Clojure

To start a Clojure REPL to work with the software, run:

lein repl

In order to run the API as a web server, run:

lein run

Go to http://localhost:3000 (or http://localhost:3333 if using Vagrant) and you should see the app running.

Before starting the API, you will want to start MongoDB and load some data into it.

Configuration

All the settings below are shown via environment variables, but they can also be set via Java properties. See [the documentation for environ][https://github.com/weavejester/environ/blob/master/README.md] for more information on how to use Java properties if you prefer.

Configuration file

Besides using environment variables, you can also use a configuration file. This file must contain a Clojure map with your configuration set in it. Unlike with environment variables, where each setting is uppercased and SNAKE_CASED, these settings must be lowercase keywords with dashes, like so:

{ :http-port 8080
  :mongo-host "127.0.0.1" }

In order to use a configuration file, set QU_CONFIG to the file's location, like so:

QU_CONFIG=/etc/qu-conf.clj

Note that the configuration file overrides environment variables.

HTTP server

By default, the server will come up on port 3000 and 4 threads will be allocated to handle requests. The server will be bound to localhost. You can change these settings via environment variables:

HTTP_IP=0.0.0.0
HTTP_PORT=3000
HTTP_THREADS=4

You can also do this in the QU_CONFIG config file:

{ :http-ip "0.0.0.0"
:http-port 3000
:http-threads 50 }

MongoDB

In development mode, the application will connect to your local MongoDB server. In production, or if you want to connect to a different Mongo server in dev, you will have to specify the Mongo host and port.

You can do this via setting environment variables:

MONGO_HOST=192.168.21.98
MONGO_PORT=27017

You can also do this in the QU_CONFIG config file:

{ :mongo-host "192.168.21.98"
:mongo-port 27017 }

If you prefer to connect via a URI, use MONGO_URI.

If you need to connect to several servers to read from multiple replica sets, set specific Mongo options, or authenticate, you will have to set your configuration in a file as specified under QU_CONFIG. Your configuration should look like the following:

{
  ;; General settings
  :http-ip "0.0.0.0"
  :http-port 3000
  :http-threads 50

  ;; Set a vector of vectors, each made up of the IP address and port.
  :mongo-hosts [["127.0.0.1" 27017] ["192.168.1.1" 27017]]
  
  ;; Mongo options should be in a map.
  :mongo-options {:connections-per-host 20
                  :connect-timeout 60}
                  
  ;; Authentication should be a map of database names to vectors containing username and password.
  ;; If you have a user on the admin database with the roles "readWriteAnyDatabase", that user should
  ;; work for running the entire API. To load data, that user needs the roles "clusterAdmin" and
  ;; "dbAdminAnyDatabase" as well.
  ;; If you choose not to have a user on the admin database, you will need a user for every dataset
  ;; and for the "metadata" database.
  :mongo-auth {
    :admin ["admin-user" "s3cr3t"]
    :slicename ["admin-user" "s3cr3t"]
    :metadata ["admin-user" "s3cr3t"]
    :query_cache ["admin-user" "s3cr3t"]}
}

See the Monger documentation for all available Mongo connection options.

StatsD

The application can generate metrics related to its execution and send them to statsd.

However by default metrics publishing is disabled. To enable it you need to provide statsd hostname in the configuration file:

{
  :statsd-host "localhost"
  ;; Standard statsd port
  :statsd-port 8125
}

App URL

To control the HREF of the links that are created for data slices, you can set the APP_URL environment variable.

For example, given a slice at /data/a_resource/a_slice, setting the APP_URL variable like so

APP_URL=https://my.data.platform/data-api

will create links such as

_links":[{"rel":"self","href":"https://my.data.platform/data-api/data/a_resource/a_slice.json?...."}]

when emitted in JSON, JSONP, XML, and so on.

If the variable is not set, then relative HREFs such as /data/a_resource/a_slice.json are used. This variable is most useful in production hosting situations where an application server is behind a proxy, and you wish to granularly control the HREFs that are created independent of how the application server sees the request URI.

API Name

In order for your API to show a custom name (such as "Spiffy Lube API"), set the API_NAME environment variable. This is probably best set in an external config file.

Loading data

Make sure you have MongoDB started. To load some sample data, run lein repl and enter the following:

(go)
(load-dataset "census") ; Takes quite a while to run; can skip.
(stop)

Testing

To execute the project's tests, run:

lein test

We also have integration tests that run tests against a Mongo database. To run these tests:

lein with-profile integration embongo test

or, even more easily:

lein inttest

Nginx

We recommend serving Qu behind a proxy. Nginx works well for this, and there is a sample configuration file available.

qu's People

Contributors

Stargazers

Watchers

qu's Issues

Travis build fails on gh-pages branch

We need a .travis.yml file for the gh-pages branch that builds the site with Travis.

Metrics recording: Only run request timer for valid requests

Currently, we wrap every request with a with-timing call based on the request path, so that we can capture per-slice metrics. However, this has a nasty side effect of also recording metrics for any invalid requests.

We should limit the with-timing calls to only valid, non-400 requests

jQuery not defined in API console

Request https://api.consumerfinance.gov/data/hmda/slice/hmda_lar?

Expected: see page rendered with CSS and jQuery

Actual:

CSV Streaming results in incorrect number of rows

http://www.consumerfinance.gov/hmda/explore#!/as_of_year=2012,2011&action_taken=1&respondent_id=22-3887207&select=as_of_year,count&section=summary

Shows 8,966 records.

Clicking to download the full results as CSV would inconsistently produce a spreadsheet with 5 fewer rows. I observed this behavior over the course of an hour or so. Clicking on "labels and codes" would consistently get me a spreadsheet with the right number of rows, and clicking on "labels" would return 8991 probably 90% of the time. Then, eventually, it simply started working consistently as expected.

The only difference is that "labels" passes the full "$select" statement to include just the fields it needs, and "labels and codes" passes an empty $select. I doubt that matters.

@cndreisbach , can you think of anything in the CSV streaming bits that would cause this behavior? Nothing in the logs indicated a problem when I observed this behavior.

Create table definition documentation for downloads

Some people will use existing databases and tools with data downloads. It would be helpful to create a table definition for data downloaded through the HMDA UI. For example, some DDL like the following that people can use (the data types, obviously, are not be ideal):

postrgeSQL

-- Table: hmda_lar

-- DROP TABLE hmda_lar;

CREATE TABLE hmda_lar
(
  tract_to_msamd_income character varying(128),
  rate_spread character varying(128),
  population character varying(128),
  minority_population character varying(128),
  number_of_owner_occupied_units character varying(128),
  number_of_1_to_4_family_units character varying(128),
  loan_amount_000s character varying(128),
  hud_median_family_income character varying(128),
  applicant_income_000s character varying(128),
  state_name character varying(128),
  state_abbr character varying(128),
  sequence_number character varying(128),
  respondent_id character varying(128),
  purchaser_type_name character varying(128),
  property_type_name character varying(128),
  preapproval_name character varying(128),
  owner_occupancy_name character varying(128),
  msamd_name character varying(128),
  loan_type_name character varying(128),
  loan_purpose_name character varying(128),
  lien_status_name character varying(128),
  hoepa_status_name character varying(128),
  edit_status_name character varying(128),
  denial_reason_name_3 character varying(128),
  denial_reason_name_2 character varying(128),
  denial_reason_name_1 character varying(128),
  county_name character varying(128),
  co_applicant_sex_name character varying(128),
  co_applicant_race_name_5 character varying(128),
  co_applicant_race_name_4 character varying(128),
  co_applicant_race_name_3 character varying(128),
  co_applicant_race_name_2 character varying(128),
  co_applicant_race_name_1 character varying(128),
  co_applicant_ethnicity_name character varying(128),
  census_tract_number character varying(128),
  as_of_year character varying(128),
  application_date_indicator character varying(128),
  applicant_sex_name character varying(128),
  applicant_race_name_5 character varying(128),
  applicant_race_name_4 character varying(128),
  applicant_race_name_3 character varying(128),
  applicant_race_name_2 character varying(128),
  applicant_race_name_1 character varying(128),
  applicant_ethnicity_name character varying(128),
  agency_name character varying(128),
  agency_abbr character varying(128),
  action_taken_name character varying(128)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE hmda_lar
  OWNER TO postgres;

"getting started" instructions not working

I tried with both cfpb and Clinton's master branch-- I can't get this to work:

(require 'cfpb.qu.loader)
(in-ns 'cfpb.qu.loader)
(mongo/connect!)
(load-dataset "county_taxes")
(load-dataset "census") ; Takes quite a while to run; can skip.
(mongo/disconnect!)

Actual error message is:

user=> (require 'cfpb.qu.loader)
FileNotFoundException Could not locate cfpb/qu/loader__init.class or cfpb/qu/loader.clj on classpath:   clojure.lang.RT.load (RT.java:443)

jre:

java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06-451-11M4406)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01-451, mixed mode)

Add roadmap to documentation

Add a roadmap of future plans and features.

CSV Streaming consumes CPU resources (with Feedback from Experienced Clojure Dev)

So I was hanging out on the geohashing IRC channel and got to chatting with a guy who, it turns out is a Clojure developer. I discussed some of the issues we were having, and he took a look at the code and then actually cloned the repo, installed it on a Vagrant box, ran a profiler, and gave me some feedback. This issue is an attempt to catalog his input.

Create VirtualBox image to try out

This is linked to #168. We should be able to run our Packer script and get a VirtualBox image that people can download and give a try.

Downloading the HMDA data

Is there a way to create a torrent to allow downloading the data more effectively for offline analysis. I am trying to download the entire data available at The Home Mortgage Disclosure Act - Explore

The data is about 72.3 GB. There is absolutely no way of downloading it. The download times out and fails through a download manager too.

Please advise.

Why does this query return many duplicates?

I'm trying to find the number of home purchase loans originated in 2017 in Palm Beach County, Fla.

Here's a link to my query: https://api.consumerfinance.gov/data/hmda/slice/hmda_lar.html?%24select=county_name%2C+state_code%2C+county_code%2C+census_tract_number%2C+population%2C+loan_purpose%2C+loan_purpose_name&%24where=loan_purpose%3D1+and+action_taken%3D1+and+state_code%3D12+and+county_code%3D99+and+census_tract_number%21%3D%27%27+and+census_tract_number+is+not+null+and+as_of_year%3D2017&%24group=&%24orderBy=census_tract_number&%24offset=0&%24format=html

Many rows have the same census_tract_number. Why's that?

And what query must I run to find the the number of home purchase loans originated in 2017 in Palm Beach County, Fla.?

403 error when running `wget`

Hello, I received a 403 error when I ran wget "https://api.consumerfinance.gov/data/hmda/slice/hmda_lar.csv?$select=state_code%2C+county_code%2C+census_tract_number%2C+loan_purpose_name%2C+SUM(population)%2C+COUNT()&$where=as_of_year%3D2016+and+loan_purpose%3D1&$group=state_code%2C+county_code%2C+census_tract_number%2C+loan_purpose_name&$orderBy=state_code%2C+county_code%2C+census_tract_number&$limit=100000&$offset=0".

But when I navigate to that address in my browser, the file downloads correctly. Is there a way to use wget to download CSVs or other files via HMDA API?

dead link to nginx config

At the bottom of the README, there is a link to an NGINX configuration file. The link is dead. Is this file elsewhere, or gone?

Problems loading 3 /5 datasets

Hi Team Qu,

This looks like a very interesting project however I am encountering errors when loading data. There are five datasets and I can only get one of them to load, the rest seem to trip ExceptionInfo Value does not match schema: errors:

(require 'cfpb.qu.loader)
(in-ns 'cfpb.qu.loader)
(mongo/connect!)
(load-dataset "county_taxes")         ; ExceptionInfo Value does not match schema: .....
(load-dataset "census")               ; ExceptionInfo Value does not match schema: .....
(load-dataset "consumer_complaints")  ; ExceptionInfo Value does not match schema: .....
(load-dataset "hmda")                 ; Loads
(load-dataset "integration_test")     ; Loads
(mongo/disconnect!)

Improve the API HTML interface

Currently, we have a very rudimentary HTML interface for the API based off Twitter Bootstrap. Ideally, this interface would:

Contain links to documentation and tooltips to help write queries
Have just a touch of color, customizable if possible
Have some semblance of deisgn

Stand up demo server

This has some difficulty, because we don't want to pay for it.

We could use a t1.micro instance at Amazon for a year. Alternatively, we could use Heroku + the free 512 MB level at MongoHQ.

Front end bugs

Given an example query there are a few small pagination-related bugs.

Greyed out buttons are still clickable
Clicking Prev can produce a server error
Setting a limit > 100 only displays 100 records per page

The pagination at the bottom is calculated based on the actual limit, not 100. If we decide 100 should be the cap for html, we need to update pagination accordingly.

New HTML Design creates unexpected data table

The new HTML design creates a results table that is, well... see the screenshot:

Test for govcode. Will be deleted

Lowercase aggregation functions not allowed

Qu doesn't accept lowercase aggregation functions, returning a "Could not parse this clause" error. Is this by design to keep it SQL-y? FWIW, SODA uses lowercase in their docs but is case-insensitive in practice.

Create a Packer script for Qu

We want to give people the ability to quickly stand up an instance of Qu on AWS, VirtualBox, Docker, DigitalOcean, or whatever. I recommend using Packer to define this.

Confused about HMDA API

Hey everyone,I've been reading the docs for the CFPB's Home Mortgage Disclosure Act's API, but I'm confused about how to send requests for certain data types. For example, what type of call do i send when requesting the number of home refi loans for each county? And do I need an API key?

All URLs return 'route not found' if trailing slash is present

for example:

api.consumerfinance.gov/data is OK; api.consumerfinance.gov/data/ is not.

Appears to affect every route.

API down for maintenance?

Hello, I've been trying for the past few days to access https://api.consumerfinance.gov/, but get a message saying the API is down for maintenance. Any news on what's up? Or an ETA on when the API will be accessible again?

CSV export fails for specific respondent_id and state_code

CSV export on api.consumerfinance.gov fails for specific where attributes, resulting in Failed - Network error (EOF). It appears to be just one respondent_id and state_code for 2015 data, but I've replicated the issue in both Explore the data (Include labels and codes) and the API html interface (Output Format CSV).

(as_of_year=2015 AND state_code=17 AND respondent_id="0000817824")

Failing API Call
https://api.consumerfinance.gov:443/data/hmda/slice/hmda_lar.csv?$where=(as_of_year%3D2015+AND+state_code%3D17+AND+respondent_id%3D%220000817824%22)&$limit=0&$offset=0

Respondent ID Master Listing

First I should preface that I am not a programmer but use the CFPB HMDA LAR download tool regularly. Second, if this question has already been address, my apologies. Lastly, if I used the incorrect forum to post the question - again my apologies - this is the first time I have ever accessed this site.

Question: The HMDA download includes a column titled "Respondent ID" I'm aware of a tool to search individual Respondent IDs (https://www.ffiec.gov/hmdaadwebreport/diswelcome.aspx) but is anyone aware of a master listing available so I can do a VLookup in Excel and merge the data?

Offset limit; 400 error

I tried running a query to find the number of home loans, for each type, for each Census tract. I asked for a limit of 15000 with offset set to 100000. I got an HTTP ERROR 400.

https://api.consumerfinance.gov:443/data/hmda/slice/hmda_lar.csv?$select=state_code%2C+county_code%2C+census_tract_number%2C+loan_purpose_name%2C+SUM(population)%2C+COUNT()&$where=as_of_year%3D2016&$group=state_code%2C+county_code%2C+census_tract_number%2C+loan_purpose_name&$orderBy=state_code%2C+county_code%2C+census_tract_number&$limit=150000&$offset=100000

Aggregation worker dies; only fixable by Qu restart

Start an aggregation
Kill mongos
Restart mongos

Actual behavior: The worker dies, because its error handler itself throws an error when trying to reset the job because it can't communicate with mongo

When this happens, that Qu instance's aggregation worker is effectively dead and will not process additional aggregations. This results in a buildup of aggregations until that Qu instance is restarted

Expected behavior:

The error handler should not throw an error, or, if it does, the worker should be restarted
The worker's currently processing job, at the time of failure, would get restarted when a) the worker is restarted and b) able to communicate with mongo successfully

BUG: Trailing slash on explore page resulting in bare 404

http://www.consumerfinance.gov/hmda and http://www.consumerfinance.gov/hmda/ both resolve [the former redirects to the latter].

http://www.consumerfinance.gov/hmda/explore resolves, but http://www.consumerfinance.gov/hmda/explore/ resolves in a pretty bare 404 page.

We should be consistent in our URL resolution conventions. Pretty much the whole site works like the /hmda/ page, where the lack of a trailing slash redirects to the trailing slash. Seems like the thing to do is to extend that to the /explore pages.

Querying for non-HTML download retrieves only 100 rows, even when I don't specify a limit

This query for data from all Census tracts gets a CSV file with only 100 rows of data. https://api.consumerfinance.gov:443/data/hmda/slice/hmda_lar.csv?$select=state_code%2C+county_code%2C+census_tract_number%2C+loan_purpose_name%2C+SUM(population)%2C+COUNT()&$where=as_of_year%3D2016&$group=state_code%2C+county_code%2C+census_tract_number%2C+loan_purpose_name&$orderBy=state_code%2C+county_code%2C+census_tract_number&$offset=0

Any idea how to fix this so I can download all the data for every tract?

Web API link: https://api.consumerfinance.gov/data/hmda/slice/hmda_lar.html?%24select=state_code%2C+county_code%2C+census_tract_number%2C+loan_purpose_name%2C+SUM%28population%29%2C+COUNT%28%29&%24where=as_of_year%3D2016&%24group=state_code%2C+county_code%2C+census_tract_number%2C+loan_purpose_name&%24orderBy=state_code%2C+county_code%2C+census_tract_number&%24offset=&%24format=html

500 error on auto-generated bad URI

When going to the following URL:

http://localhost:3000/data/census/population_raw?$orderBy=&$select=&origin=&origin=%3C%3C%3C%3C%3C%3C%3C%3C%3C%3Cfoo\%22bar\'204%3E%3E%3E%3E%3E&race=&$callback=Lk47REf9&$limit=100&$group=&sex=&$where=&state=&$offset=0&$format=html

we get the error: java.lang.IllegalArgumentException: No implementation of method: :mutate-query of protocol: #'clojurewerkz.urly.core/Mutation found for class: nil. Obviously, this is a bad URL and we shouldn't respond well to it, but we should avoid giving up specific error messages from deep in our system as well.

Latest version of Liberator makes some responses wonky

Go to http://localhost:3000/data/census/population after starting Qu. You will see a 404 page that appears very strange. What is happening is that Liberator no longer accepts maps as a response and passes them through as Ring responses unless you wrap it with liberator.representation/ring-response.

Consider ToroDB

I haven't used it myself, but ToroDB looks like it'd solve some of the pain points I remember from trying to model data via MongoDB. In particular, it'd avoid the field-name hashing scheme, reduce disk use, and give you a path towards a tabular data store (which seems appropriate for the types of queries Qu performs).

api.consumerfinance.gov: JS error on IE8

url

https://api.consumerfinance.gov

Actual Behavior

User Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.3; MS-RTC LM 8; .NET4.0C; .NET4.0E)
Timestamp: Tue, 11 Feb 2014 20:46:55 UTC

Message: Object doesn't support this property or method
Line: 6
Char: 7715
Code: 0
URI: https://api.consumerfinance.gov/static/js/data-api.min.js

Expected Behavior

No JS error

Steps to Reproduce

Visit URL

/cc @cndreisbach @contolini