focusconsulting / housing-insights Goto Github PK

Bringing open data to affordable housing decision makers in Washington DC. A D3/Javascript based website to visualize data related to affordable housing in Washington DC. Data processing with Python.

Home Page: http://housinginsights.org

License: MIT License

Python 99.13% Dockerfile 0.87%

housing-insights's People

Contributors

Stargazers

Watchers

Forkers

nealhumphrey timfoley lizmeister321 salomoneb hansak11 br-mcdermott blw9u2012 dannyphantum aegorenkov skirb1 waffle-iron iflores12 prisalex jeannejk rmcarder hlkwon jmezetin emkap01 kasao10 kevindiffily davidcpell speedyturkey corneliusiv danshorstein anajaved henokalemu nunie123 geoyi jonstewart arjay7891 alerickson jleon6 creence ohaaga mkhassan jasonrhaas janalyn libert-xyz jkwening ajhalani sunjay07 alexisgo lradams-gwu hutchisonk juliaall nealfwilliams franckalomassor datareddy marycolley jgordo04 rcnasa heysol dogtoothdigital xee5ch just4jin bpc81 ilissablech hareno44 cairosubway1 raphaguz habronan laurkgol nfiona dukecat0613 mikes-nth chououtside nmounzih henry-mueller joshfiner rkredux ostermanj mynattc jamiecatania louvis pfjel7 chionugu ianlefcourte ncclementi muthuprakash marwahaha jfs2j ableo57 soopanova neighborhoodinfodc dolmon2 mrkem598 alexhutchinson tsigie kai2002 kevin-wei wassona svitkin inuua socialpercon jihyunle jingfeiz anupamx chankrista eaptsd ndtallant

housing-insights's Issues

Make initial Cleaner object

Cleaner object should fit into the load_data.py main() function workflow.

Code structure

Root Cleaner object should have built-in methods for any cleaning operation that is needed by multiple files (e.g. replace_nulls method could turn any passed string such as "NA" "None" "-" etc and turn it into a consistent Null expected by our read-into-database function.)
Each CSV file that needs a different set of cleaning routines should extend the root object (e.g. class CensusCleaner(Cleaner):) Every extended object should implement a 'clean' method that calls a combo of root and extended object methods as needed by that file.
On initialization, need to pass the appropriate section of 'meta.json' via table_meta = meta['tablename']. See load_data.py for example meta format. Only want to pass the appropriate section of the meta.json needed for that specific table type, not the whole thing.
On initialization, should accept optional 'clean.csv' and 'removed.csv' filenames/paths (otherwise, use defaults).
Cleaner should accept one row of data at a time
Clean.csv should have the transformed data
removed.csv should have any rows that failed specific checks or caused unhandled errors. Include an extra column for error logging as much as practical.
cleaner object should keep track of how many rows it has cleaned and how many have been sent to clean.csv vs. removed.csv

For this ticket, one initial extended version should be able to clean the ACS_Rent files which are included in the manifest_sample.csv file (located in scripts folder). To be cleaned, these need:

"2,000+" replaced with 2000
Any other commas removed
"100-" replaced with 100 (string might be <100, not sure which)
"-" replaced with appropriate null
Anything else you see???

Add D3 resources to website

Copied these from #52's description - should add a page with these to point new developers to:

Resources:

A good clean basic D3 intro: http://alignedleft.com/tutorials/d3/
Using the Constructor pattern in a D3 app: Chris Given tutorial (*see the 'Refactor chart into a constructor' commit in particular)
App-style example using a setup-resize-update pattern, another well documented tutorial: Gap Re-Minder
The Constructor pattern generally: Essential Design Patterns
Reusable charts in D3 - we should try to implement individual components like this, so that we can re-use them in different views (for example, using a donut chart for showing the percentages of each bedroom # in a building on both a search result page and on the building page itself). Basically, use custom getter/setter methods that return the object itself, to allow method chaining the same way the rest of D3 works: https://bost.ocks.org/mike/chart/

Add cleaner name to the meta.json file for cleaner type to be used.

Geographic Area Comparison - reference map

Make a reference map so our data team can tell how closely our different areas align to each other.

One map should be able to show the boundaries

Neighborhood Clusters
Wards
Census Tracts
Zillow's neighborhood boundaries (available for download

-Could use Mapbox and do it directly on a javascript map.

Faster/dirty versions:

Check on NeighborhoodInfoDC to see if they have something like this already
geojson.io looks like it might be a quick easy tool for making this, especially for a non/beginner coder.

At this stage this is just a reference map so that the data team can decide on a plan for how to handle different geographic areas, so quick and dirty is the name of the game.

Write interview script for Kathryn

Extract Inclusionary Zoning Unit counts from PUD applications and Covenants

@salomoneb this is a ticket as we discussed on the phone for looking into extracting data from PUDs about inclusionary zoning.

Example:
This status page at DCOZ for the Atlantic Plumbing buildings (linked from our PUD data source) has a pdf linked under the 'View order' link.

This is the section copied from the 'Benefits' section:

The Applicant proposes to devote an area equal to 15 percent of the density gained through the PUD process to affordable housing for those households whose income does not exceed 80 percent of the Area Median Income as that term is defined by the U.S. Department of Housing and Urban Development. The proposed Project will include a total of 695 apartments on the four parcels. Of these, approximately 14 percent will be studio apartments, 29 percent will be one bedroom units, 40 percent will be one-bedroom with den units, and 16 percent will be two-bedroom units. The Applicant proposes a similar mix of units for the affordable housing requirement, providing 18,800 square feet of affordable housing on the Atlantic Plumbing South parcels and 11,000 square feet of affordable housing on the Atlantic Plumbing North parcel. The Applicant has requested flexibility with regard to the size and type of units. If the allocation of market-rate unit types changes, the allocation of affordable units will change to reflect this allocation. The affordable units will be distributed among floors on Parcels A, B, and C. The units will be affordable for a 20-year term. The Department of Housing and Community Development will determine the price and enforce the affordability of the units through covenants and other legal mechanisms.

Things to note:

Looks like we will need a manual process to extract expected unit counts from these filing documents.
Can we set up a script to extract the appropriate paragraph for volunteers to use?
Should try extracting this data from a few filings to determine appropriate/ relevant fields (i.e. bedroom mix, number of units, floor area, etc.)
This says DHCD will determine enforcement method. For projects that have already put into place, can we get data from DHCD on affordable units?
Later in the PDF, it also says the owner is required to record "a covenant in the land records of the District of Columbia, between the owner(s) and the District of Columbia...Such covenant shall bind the Applicant and all successors in title to construct on and use the applicable PUD Parcel in accordance with this Order or amendment thereof by the Zoning Commission." I think these covenants are similarly difficult to obtain in a systematic way, but this may be another way to validate affordable units down the line.
Does this site by DHCD have all current IZ units in it? http://www.dchousingsearch.org/

Find PUD applications via the zoning map: http://maps.dcoz.dc.gov/zr16
Click the red highlighted areas, and then find the link under 'Zoning commision cases' on the left hand side. This will link to the pages similar to the Atlantic Plumbing case. You can also find a list of all of them that was scraped from this map in our S3 bucket as a CSV file (35 MB)

First steps to complete this ticket:

Access 5-10 PUD applications, and find the relevant document that details the outcome of the zoning commission decisions.
Note any patterns on how to identify the appropriate document.
Extract all paragraphs relevant to the Inclusionary Zoning requirements. Search for terms like 'affordable housing' or 'affordable' and 'Area Median Income'.
Make a table of all the relevant data fields for defining the amount of affordable housing in those buildings - e.g. floor area, unit count, bedroom mix, etc.
Note any trends in wording that we may be able to use to extract this progmatically.
If appropriate, write a Python script that reads the CSV and extracts data from these 5-10 sample cases. It is likely that all we will be able to do with Python is extract the whole paragraph that contains a list of key words, to make it a little easier to do manual extraction later.

Basic blog post sidebars and navigation in /docs site

Update
Merged a partial version of this into 'dev'. Still to do:
-Filter the sidebar list so that it the current blog post is not shown.

-add small teaser of the first 2 most recent posts.

We'll use blog posts combined with an email newsletter linking to the blog as the primary way of communicating project progress with interested users of the tool.

A location for blog posts has been set up here:
http://housinginsights.org/blog/

but, it needs more formatting and setup of basic blog requirements.

Create a new layout under `\docs_layouts' to be used for blog posts'
Create a sidebar in this layout
Add tags for this specific article, author to the sidebar
Add a sidebar with article history (most recent articles... up to maybe 20? Depends on layout choices made)

This blog will be relatively low traffic, so it is not necessary to have a fancy layout - just the bare minimum to meet user expectations of navigation among articles when they are linked to a specific article, and ability for people new to the project to quickly navigate around the history of the project by clicking the 'blog' link the navbar.

Building Permit Exploration

Figure out what how best to get building permit data.

Data sources:

https://eservices.dcra.dc.gov/OBPAT/Default.aspx
(non-intuitive error message -rejects request if data download is too large) http://data.octo.dc.gov/NewCalendar.aspx?datasetid=5
http://opendata.dc.gov/datasets/5d14ae7dcd1544878c54e61edda489c3_24
http://opendata.dc.gov/datasets/930907e0c08843c8b30ab36a29b8ff0e_4

Questions:

How can we filter out only the 'big investment' types of permits (condo conversion, big retail rehabs, etc.)
How can we pull systematically while still having a long enough history to account for the lifetime of major projects (e.g. just the 30 day version isn't enough, but maybe we can combine sources and/or maintain a cache ourselves).
Make sure we have geocode data available for each permit, joining appropriate tables as needed (if not in source data, first place to look would be the MARS - Master Address Repository for DC)

Location Team:
-Using geocoded permit location data to return a list of all building permits within X miles of a specific affordable building.

LIHTC research

DC's current Qualified Action Plan is from 2012: https://dhcd.dc.gov/release/dhcd-releases-final-qap
Apply via the general RFP: https://octo.quickbase.com/db/bjc34b76f?a=showpage&pageid=37

Help Neal with AWS credentials

I think I set up my RDS server credentials properly, but would love some help from someone that knows Amazon Web Services security protocols to help me make sure!

Viz: Total Tax Assesed Land Value of all Affordable Properties by Neighborhood Cluster

For the Preservation the Preservation Network meeting on Tuesday 11/7, we will be asking potential users about their data priorities - which outside data sources are most important to them. To spur thinking about the type of insights they could get by connecting external data sources, we want to show them 2 example visualizations that combine the Preservation Catalog with at least one external data source.

This visualization will combine all entries from the project table with matching entries from the dc_tax table (must use intermediate table of parcel). It will then group these by neighborhood cluster to show the sum total of the assessed land value of the affordable properties in each neighborhood cluster.

Recommended presentation of this data is a side-by-side map-based heatmap and bar chart.

Horizontal Bar:

Same data,
Sorted from highest to lowest total taxable land value
Bar labeled with neighborhood cluster short name (not just cluster #). Need to create a short name (e.g. "Chinatown/N.Capital" instead of "Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capital Street". Choose either most prevalent neighborhood and use "etc." or the two prominent neighborhoods on the outer bounds of the cluster.
Stretch goal 1: Link bar and map via hover (hover over bar chart or map section adds highlighting color to both entries).
Stretch goal 2: Tool tip with Neighborhood Cluster full name, total land value, total assessed value, total number of properties, and total assisted units (project.proj_units_assist_max). Average land value per unit calculation.

Map:

Neighborhood clusters
Heatmap value indicates sum of taxable land value.
Stretch goal 3: If time allows, alternate toggleable view is taxable total value, and/or 'past' vs 'current' value also in toggle.

This will be a PROTOTYPE graph, so choose your favorite graphing tool and keep in mind that the code will eventually be replaced by D3. Quick and dirty is the way to go. If you already know D3 and think we can get it done in time, that will be the most long term friendly, but a mockup for Tuesday is more important.

Options:

Bokeh (for Python): Bokeh maps with geojson
Highcharts or another library (for Javascript)
D3 if you are already familiar!
Tableau Public (drag-and-drop interface, but custom shapefiles is difficult and might just be in pro? Custom shapefile guide

Resources

\scripts\analysis\join_tables_exploration.sql has start of script for joining the data (note, column names have been updated to no longer be MiXedCase so script needs editing)
-\scripts\small_data\Neighborhood_Clusters_shapefile contains neighborhood cluster boundaries.

Work can be split between someone on SQL/data extraction and formatting and a visualization team.

Breadcrumbs on docs resources

Create a new layout (/docs/_layouts) for the resources pages. Model after Main, but add breadcrumbs above the content section to allow easier navigation of the levels of resource items we may have.

Probably best implemented with this plugin, but I have not investigated if it will work for our issue yet:
https://github.com/git-no/jekyll-breadcrumbs

Document bootstrapping a development environment for new developers

Make HISql object to write data to SQL database

After the print(" Ready to load") statement in the main() function of load_data.py, we want to send the data stored in the newly-created clean.csv file to the database. Then, we want to update/add the appropriate row of the Postgres 'manifest' table to reflect the status of that file in the database.

Load cleaned.csv into table

This assumes all necessary tests have been passed (i.e. do_fields_match etc.), but double-checking deal-breaker items doesn't hurt.

Things the object will need passed to it:
'manifest_row', 'meta' (memory version of meta.json), and 'csv_filename' (the path to the clean.csv file created by Cleaner), 'overwrite' (a boolean indicating whether preexisting entries of the manifest_row['unique_data_id'] should be deleted if they exist).

Use a 'manifest_row' dictionary with keys "destination_table" to decide which SQL table to put the data in.
Check if the table already exists. If not, create it using the meta.json expected field names and data types.
If the table does exist, check to see if there are any records corresponding to the manifest_row['unique_data_id']. If not, do nothing. If so and overwrite = True, drop those rows (not all rows). If so and overwrite = False, raise an error for load_data to handle.
Use a passed string of the file path to decide which .csv file should be loaded (this will be provided by load_data.py)
The Cleaner should have already added a new column to the clean.csv file called 'unique_data_id' and every row should match the value of manifest_row['unique_data_id']. Verify these match, raise an error if not.
Use the 'COPY FROM' command in postgres, using a sqlAlchemy connection object, to copy the whole file in one command into the corresponding database table. This should append to the existing table, leaving any already existing entries in place.

After writing to the new file is completed, we want to update the SQL manifest table so that it is an accurate reflection of which files are currently in the database.

If the 'manifest' table does not exist in SQL, create it.
Use the passed 'manifest_row' (a dictionary) to provide the data for a single row of the manifest corresponding to this table.
If there is already a row corresponding to this unique_data_id, delete the row first (or update to match the new manifest_row values)
in SQL, the manifest table has the exact same data as manifest.csv, but one additional column called 'status' should be added, which reflects whether that file has been loaded into the database or not. Use the passed 'status' value to provide this.

This function is called immediately after trying to load the clean.csv file. If the loading process for clean.csv raises an error, the status provided to this function will be 'error', and if it succeeds it will be 'loaded'

There are a few options of how to implement all of this functionality code structure wise:

object oriented, with a root 'HISql' class (i.e. HousingInsightsSql) that is extended twice into a 'ManifestSql' and 'DataSql' class. This approach is consistent with how we currently do DataReader.
Object oriented, with one big object that handles all this stuff relatively automatically. Might be easier for an end-user if we abstract away the updating of the manifest; conversely, most users of this should learn about how we handle the manifest so might be better to make those methods transparent.
Functional. OK approach if whoever handles this is not object-oriented savvy, but we are probably better off trying to do as much as we can object oriented.

Note, there is currently a get_sql_manifest_row function in load_data.py. If we use an object-oriented approach, this function should be folded into the appropriate class, and load_data.main() should call this version instead (call is currently located at logging.info("Preparing to load row {} from the manifest")

At-risk map and list

Viz: Affordable Units in each Neighborhood Cluster

Side-by-side map-based heatmap and bar chart. This item should mimic #22 (presumably by reusing the same code but with different data).

Heatmap and bar chart values indicates the sum of project.Proj_Units_Assist_Max

Viz: Expiring subsidies

Another idea for potential visualization. This one may be better done as a graphic mockup for now (Illustrator, Powerpoint, Moqups, etc.), rather than the actual data since most potential users who see it will be familiar with the data it represents.

Timeline of expiring subsidies – graph of sum of city-wide affordable units as it declines over next 10 years via subsidy expiration dates. List of top 10 (by unit count) buildings that expire each quarter.

Refactor CSV import code to be more efficient

update

This has been partially completed and is in the overhaul-csv-loading branch.

The remaining tasks that need to be completed are
#83
#86

This ticket should be closed, but keeping in the 'on hold' column so it is available for reference if needed.

4 parallel tasks that we can do now:

(salomone) Make JSON data: We will store our list of expected fields and their data types into a file called meta.json. Use Pandas to take a new .csv file, guess the data types and produce the meta.json. Use the make_draft_json.py file as a starting point.
(Brandon) Figure out how to read a CSV into memory (so that we can edit values such as messy data) and then write it to SQL efficiently. Use demo_load_to_sql_without_pandas.py as a starting point.
(unclaimed) Everything else for our load_data.py file, which handles downloading files if needed, comparing manifest.csv to the manifest table in SQL to decide whether to load a table or not, handling errors. Temporarily just use dataframe = pandas.read_csv and dataframe.to_sql() in the csv_to_sql() function until #2 is ready. Use my list of functions and the rough outline in the main() function as a guide, but there are probably additional things I didn't think of.
(Neal) Get data upload tools to S3 configured.

Stuff to do after:

run make_draft_json.py on all our data
set up data cleaning phase in between read and write (see note in the csv_to_sql() function)

To get working:

Pull the branch overhaul-csv-loading into your repo
Make a new branch off of that branch for your portion
When you're ready to share your work, make a pull request on Github into the overhaul-csv-loading branch (not the dev branch). We'll consolidate all our work there and make sure it's working before merging into dev.
(let me know if you need help with any of this!)

What you'll see in the /scripts folder:

/__root__

database_management.py: contains tools to connect to our database. Moved to root because this can be also be used later in other places (e.g. analysis scripts). I created this for housing_risk but I have not yet tested this in our project, so it might need minor changes.
-secrets.example.json: check out the addition of the local_database vs. remote_database, and make this edit to your secrets.json copy (or ask me for a copy if you don't have one yet).

/data

I moved the small_data, big_data, and output_data folders into here to clean up our root

/ingestion:

make_draft_json.py: for item #1
demo_load_to....: for item #2
load_data.py: for item #3
manifest.csv: I edited the columns available here to match what was used in the housing_risk example code. I deleted all the rows except for 3 - one small table that is immediately available in the repo, plus two large tables that are both a) able to be downloaded from S3 (to test download code) and b) need to go to the same table in SQL (to test append to sql table functionality)
meta.json: my proposed structure for the JSON file that stores field names and data types.
example_....: these are the old load_data.py scripts and files, which you can look at for reference on how things are previously done. We'll delete these once we get everything working.

If you're feeling ambitious:
Unit tests!!!
Since this new approach should be production-worthy, I want to make sure we get at least a decent amount of unit tests written. If you're into test driven development, write these tests as you go, otherwise I'll want to get them written before we merge to dev. If you're not familiar with unit tests, don't worry about it yet.

Resources:

High performance with named tuples and custom classes, but only discusses the read from csv part of the equation. Slightly clunky for data type parsing. Typically we won't have larger-than-memory data sets so the SQL inserts are our bigger problem. District Data Labs CSV article
Odo, another library that claims to be good at parsing dates.
BULK INSERT command in SQL. Use this with the load into Pandas, fix data problems (cleaning), and then could process comma delimited into pipe delimited flat file to deal with some data format issues and load from there? StackOverflow Discussion. Pay attention to the dtype parameter in the pandas.read_csv method to pass it manual overrides of data types instead of inferring when reading the file.

--ongoing-- Data Sources Research

Temporary place to hold links and resources for data we need. TODO is move this into long term spot (google doc or wiki or website)

Add PresCat tables to RDS database

The Preservation Catalog data needs to be added to the Postgres server, so we can begin doing analysis.

This ticket should create an ingestion script to load the data into Postgres. The script should be able to be re-run to recreate the database as needed.

The companion project has a simple method to load a list of CSV files into a database using Pandas (inefficient memory wise, but convenient in that it automatically creates appropriate data tables). Can copy this over if it's useful. Look in postgres.py:
https://github.com/georgetown-analytics/housing-risk/tree/dev/ingestion

flat files: https://s3.amazonaws.com/nhumphrey-misc/PresCat_Export_20160401.zip)
Server: housing-insights-raw.ccmqak7fm8oa.us-east-1.rds.amazonaws.com
Port: 5432

Ask Neal for login credentials

--ONGOING-- checklist of misc. items to resolve in final tool version

Put any misc. issues that should be improved or checked before doing an official tool release. Things like data quality compromises, approximate assumptions, or needed documentation should especially be included.

Make sure Zillow data is credited properly if used (include logo)
Current tax data comes from opendata Public Extract which has a tag of June 2015 - unclear if this data is updated. Used in analysis for now, should see if we can tap directly into Taxpayer Service Center in final version.

AMI Requirements

Get a list of all DC subsidy programs (federal and local) from Neal
Find data sources for individual buildings and their AMI requirements (work with @salomoneb to automate extraction)
Write summary of federal and local subsidy programs and rules

Example data for building view

@ptgott , @wassona @vincecampanale can one of you write out details of what you need data-wise, and your ideal data format (csv or json, and what structure)? CSV is slightly easier to produce, but can also do JSON if the structure of the data for the building view page would be better served by JSON.

@louvis - forgot, you're not there tonight. Flagging this for Hansa instead.

@hansak11 - this would be a good task for you Tuesday night if you can coordinate with the D3 guys about what they need.

Sweet Homepage

Homepage things we need:

"Name TBD"
"Bringing open data to affordable housing decision makers"

2 sentence(ish) description of project
Call to action button ("get project updates" or something)

Project partners (maybe 2 column or 4 column?)
Partners to include

CNHED - https://www.cnhed.org/
Neighborhood Info DC: http://www.neighborhoodinfodc.org/
Living Cities (Civic Tech and Data Collaborative): https://www.livingcities.org/work/civic-tech-and-data-collaborative
The CTDC is the project that this is under. Probably want living cities as the main partner name/logo, with Civic Tech and Data Collaborative as a subtitle
DC Department of Housing and Community Development
http://dhcd.dc.gov/

Adding an information page for individual buildings

@wassona @vincecampanale As we talked about on Tuesday, here's a separate Issue for coordinating our work on the building page. We can refer to this when making commits related to the building page.

One thing to think about is how we will bind the data relating to a given building, as loaded in the moving block diagram, to that building's page. We don't have to determine this until after we've produced the visualizations for the accordion. In the meantime, we could think about using a 'dummy_building.json' object to refer to, or integrating such an object into a .js file.

Prototype #2: Dashboard Library

After completing #48 (Prototype number 1), we'll rearrange some of the elements and make some new ones to make a second prototype, focusing on the 'Library' view that includes a dashboard quick-view of each of the library analyses.

This one will likely require a bit more playing with our actual available data to come up with some useful analyses, as we'll be testing the home page method (i.e. the library) as much as we're testing the individual graphs contained therein. We will make this prototype second; while the prototype team is working on Prototype 1, some people can start working with the data to generate options for the sample analyses we should include in the library.

Post user profiles to docs

Add our current draft user profiles to the /resources page, as well as design criteria.

User Profiles + Flowchart

In preparation for our design brainstorming meeting, we need a description of how this tool will fit into the workflow of potential users. This will be a resource that people in the meeting can reference when shaping the ideas they come up with.

Write building view queries for the D3 team

Calculate metro and bus route distances

Using Mapbox's Python SDK, calculate the number of (a) metro stops and (b) bus routes within 0.5 miles walking distance of each building in the project table from our database.

num. of bus routes should be defined as a bus route that stops at a bus stop within the distance parameter. If there are two stops in the radius that stop at a particular
This ticket can be ready-to-merge into our dev branch if it uses a hard-coded list of addresses/lat-lon pairs, or if it reads directly from the csv file of projects downloaded locally. Other tickets will be developing re-usable modules for reading data from our database.
Bus stops should use WMATA's JSON API
It will likely be necessary to call WMATA's API with the 0.5 mile crow-fly distance, and then filter that to any stops that are not actually available within the Mapbox-provided walking distance.

Past HPTF allocation history from PDF reports

Past allocation history of Housing Production Trust Fund are located in these annual reports:

They contain building name, address, funding level, total project funding amount, etc. But, they're locked in PDFs (ugh). We need to either

a) use a pdf-scraping utility to liberate the data.
b) convince DHCD to give us spreadsheet forms
c) do it manually

Demo connection of CKAN API to insights-site with Preservation Catalog sample data

We have a flat-file dump of the Preservation Catalog from April 2016, available for download as a zip file here:
https://s3.amazonaws.com/nhumphrey-misc/PresCat_Export_20160401.zip

The Code for DC CKAN site (http://data.codefordc.org/) provides a place to store this data set, and we should be able to enable the CKAN DataStore Extension to provide us with a record-level API.

Steps:

Upload the PresCat data to CKAN (start with the 'Project.csv' file as first test set). Create a group on CKAN for housing-insights data sets, and any other CKAN configuration to make it clear what the data set is (keep in mind this is not the current PresCat, it's a few months old - we'll need to figure out our update strategy later).
Configure the DataStore settings so that we can read data from the dataset via an API.
Any Create, Update, Delete API setting should be disabled - we only want to read via the api.
Add javascript code to our /insights-site page that sucessfully queries this CKAN api. At a minimum, return a single properties full record (by passing an ID) and a list of all property IDs.
Demonstrate additional filter/query methods so future development can see how to use the api.

Merge ready as soon as the minimum 2 API demos (property and property_list) are successfully able to display data on the insights-site page.

Figure out cleaner structure and files that need cleaning

Identify the data sources that need to be cleaned, and create a design for which Cleaner objects need to be created and how they will inherit from each other to reuse methods.

Zillow data added to AWS Postgres

Zillow provides average monthly rents by both neighborhood and zip code. We want an ingestion script to load the data into our Postgres database.

Verify that TOU allow us to use the data for this site. See last sentence of 'Permissible Use' paragraph.
Download the "ZRI Time Series: Multifamily, SFR, Condo/Co-op ($)" data by neighborhood and by zip code. Save to the /scripts/small_data folder. Data source
Using code from #8 as a basis, make ingestion code reusable to be run with either the PresCat data, the Zillow data, or other CSV data. Add a manifest file that references the downloaded data.
Run the script to upload the Zillow data as two tables (zip and neighborhood)
Make sure the script is reusable, so that the data can be replaced with updated data on a regular basis.

Zillow to PresCat mapping

We need a mapping to connect properties in the Preservation Catalog to Zillow data.

Test JOIN the PresCat table(s) to the Zip Code-based rent data from zillow ( needs to occur after #14 ). Make sure that all PresCat properties can be connected to zip-code based rent data. If not, make a list of un-matched zip codes
Create a mapping table or series of tables to connect the PresCat 'project' table to the zillow neighborhood level data. Their are multiple potential ways to do this, so try to find the easiest option. Try to get the most detailed level of data possible, but if it is significantly easier to use 'neighborhood cluster' that may provide a much faster option - make note of tradeoff in accuracy. While it may be easiest to do this manually, preference is for a code-based solution so that it can be updated.

Data sources:

Zillow distributes their neighborhood shape files: http://www.zillow.com/howto/api/neighborhood-boundaries.htm
PresCat 'property' table has Lat/Lon - can use GIS to find which neighborhood the property is in.
DC distributes shape files of 'neighborhood clusters.' These are larger than the zillow neighborhoods. But, neighborhood cluster is already contained in the tabular data of the PresCat 'projects' table. Might be able to manually map Zillow neighborhood to Neighborhood Cluster with a one-to-many relationship (won't work if neighborhoods aren't fully contained w/in clusters, though). http://opendata.dc.gov/datasets/f6c703ebe2534fc3800609a07bad8f5b_17 .
The PresCat 'Parcel' table has the Master Address Repository ID, which can be used with the MAR API to get relevant location info: http://octo.dc.gov/service/master-address-repository

Data Dictionary

Update Current draft data dictionary:
https://docs.google.com/spreadsheets/d/1hhuCgOIYNXP1VovA8TXuBRIR0drDSLSlBQsUSk-MS2Y/edit#gid=0

This ticket will stay open while we continue to go through the data ingestion process. Each item should be added to the data dictionary when it is added to SQL and/or accessed via javascript on the page (e.g. location data).

We currently have a survey up for potential tool users to provide us with information on their data priorities, and will have results of this survey before the design meeting.

For each of these data categories, we want to also provide a list of specific data sources, download and save the specific source we'll use, and do a quick assessment of what is contained in the data. We'll want to summarize this in a data dictionary, which contains:

1-2 sentence summary of what is in the data source
Highlight key attributes that are of the most use to this project
Flag any potential limitations or difficulties of using the data
Copy or link to a full description of the data attributes, if available.

Starter list of data categories is here, but ask Neal for the latest updates on the data we have already obtained:
https://github.com/codefordc/housing-insights/wiki/Data-Sources

Related tickets:
#16 , #7

Create glossary of terms

Connect mailchimp to email signup modal

Under /docs/includes/modal_email.html is the pop-up box used on the homepage when someone clicks the 'Get Updates' button. This has been populated with some template form fields, but these need to be connected so that the submit button actually does something.

We'll use our mailchimp account to collect the information. They allow for custom signup pages.

Instructions on how to add the necessary data to the form fields:
http://kb.mailchimp.com/lists/signup-forms/host-your-own-signup-forms

Signup form to copy info from:
http://eepurl.com/ckGOhr

Ask @NealHumphrey for log in information to the Mailchimp account so that we can perform testing.

'Ready to merge' criteria:

All 4 fields are linked
When someone signs up using the modal, their information appears in mailchimp
The hidden field 'role' field is included in the modal, OR it automatically populates in Mailchimp without being added to the form. Rather than add a 5th question, admins can change the value based on the answer to the 'where do you work' question to know whether to send coding-focused newsletter or housing-focused newsletter.

Prep NeighborhoodInfo tables as CSV

Excel files summarizing information about neighborhoods in DC are available here:
http://www.neighborhoodinfodc.org/comparisontables/comparisontables.html

But, to use these, they will need to be reformatted into .csv files so they can be uploaded to the database (this will likely utilize code from #8).

Investigate the PresCat data to see what level of granularity is most relevant (e.g. neighborhood cluster, census tract, etc.), so that the flat files will be able to be linked in the database to the PresCat properties table.
Reformat as flat CSV files and add to the /scripts/small_data folder.
Add data dictionary documentation to the wiki (likely should add a new page and link from the main data sources page).

Core D3 scaffolding

Let's all work through some D3 tutorials, and then set up the core functionality of our D3 websites. Feel free to split this into several separate tickets when it makes sense.

Demo basic d3 functions (binding data, displaying it in views, updating data based on a user selection, using transformations)
Demo connecting to data (flat files, and the CKAN api as noted in issue #5)
Robust web-app Javascript structure that we can maintain long-term. We don't have to use the constructor pattern, but this has been recommended as a good technique for D3 apps that have lots of interacting parts.
A couple basic unit tests

Resources:

A good clean basic D3 intro: http://alignedleft.com/tutorials/d3/
Using the Constructor pattern in a D3 app: Chris Given tutorial (*see the 'Refactor chart into a constructor' commit in particular)
App-style example using a setup-resize-update pattern, another well documented tutorial: Gap Re-Minder
The Constructor pattern generally: Essential Design Patterns
Reusable charts in D3 - we should try to implement individual components like this, so that we can re-use them in different views (for example, using a donut chart for showing the percentages of each bedroom # in a building on both a search result page and on the building page itself). Basically, use custom getter/setter methods that return the object itself, to allow method chaining the same way the rest of D3 works: https://bost.ocks.org/mike/chart/

Write interview script for Scott

Research potential data sources

As we're just launching this project, we still need to figure out what data sources can best provide relevant information to the tool.

The project summary provides a description of the types of data that we think is relevant to this project. Use this to contribute to the Wiki pages for data sources and analysis ideas. TODO - @NealHumphrey needs to add the most up to date version of our proposed data list.

Some key items we need:

Source of neighborhood-level average rent in DC
DC funded economic development projects (will likely need to compile multiple data sources - 'economic development projects' is not a precise definition). DC building projects: might be a good source for economic development data
Building permits - useful 'early warning' indicator of properties that may be at risk if there is a large property investment coming in nearby. Need to figure out if we can effectively deduce investment size from permits, and see if we can tap into a live updated stream of data. Are there other data sources that would get this even earlier?

This will be an ongoing ticket; add comments to this chain for discussion, and summarize key findings on the wiki.

Zoning / PUD application status nearby

We'd like to include notifications of nearby PUD / Zoning statuses in a map + list for each PresCat building.

We will probably want to talk to the Office of Zoning about getting access to this data systematically. They have an online system and posted PDFs. For now, this is the information available:

PDF list of filings by Ward
Interactive Zoning Information System
Board of Zoning calendar of upcoming hearings
Zoning Commision calendar of upcoming hearings
Zoning map (contains current pending PUD areas in red)

--ongoing-- Data Cleaning Issues Flagged

Place to hold all the issues we need to resolve in our ingestion and processing code:

Number of Bedrooms data not available in PresCat. Need to check on sources of data for this info, it is available on the contract level (not building level) for in the Section 8 Contracts database, need to find out if it is available a) building level and/or b) for other subsidy types.

Write user interview script

Inspirational Visualization / Tool examples in the wild

To inform our prototype design process, we want inspirational examples that we can draw from about how we could present our data. These can be:

Visualizations from other projects
Mockups
Real-life demos, like the one we made for our survey landing page: http://housinginsights.org/tool/

Coordinate with the data dictionary ticket ( #38 ) and read the project overview for the types of data sources we will want to be representing.

Create Docker Image for local database

Create a docker image of our postgres database.

Which rent data to use?

We need to figure out which portion of our rent data to use to create a meaningful estimate of 'market rent'

which Zillow data is most valid, while still providing sufficient coverage of the city? Compare what neighborhoods/zip codes are available and in which years for each of the rent types (2 bedroom, all homes, condos only etc.)
How big are the margins of error on our ACS data? We may need to combine/average neighboring census tracts to reduce this margin of error.

We have two primary sources of rent data:
American Community Survey Table B25058 (median), B25059 (upper quartile) and B25057 (lower quartile).

Actual Question
Notes on methodology (*i.e. includes utilities)
The values in the database labeled as acs_rent_median etc. are at the census tract level, downloaded from American Fact Finder.
Note, ACS says not to compare overlapping 5-year estimates, e.g. 2015 should be compared only to 2010 because 2012, for example, includes some of the same survey results in the estimate.
Note, an issue flagged by Art Rogers at office of planning is that ACS includes rent levels paid by people living in affordable housing, which artificially surpresses the effect of market rate rents in census tracts with high fractions of affordable housing. Can we statistically remove the effect of affordable housing from this data in some way?

Zillow Rent Estimate
We use the research data set provided by Zillow. They provide lots of different roll-ups of this data, but unfortunately not all of them are available at the level of detail we want. We can pick between neighborhood and zip code level, as well as which types of buildings/units to include (e.g. single family homes, condos, multifamily buildings, bedroom count). They use machine learning algorithms, a 3 month rolling average. They calculate Rent Zestimates, and then aggregate Rent Zestimates to for the Zillow Rent Index for a specific region.

Note, we do want to use the time series data, and get as close as we can to the 5 year time period of ACS, but we are limited by how far back the data on Zillow goes.

Successful completion of this ticket:

A summary of key concerns for each rent data type
An (approximate) geographical summary of what data is available from Zillow for each building type (e.g. 50% of the city is covered in the one-bedroom data set, or the single family homes data set covers 95% of the neighborhoods in DC but single family homes only represent 60% of the residential units in the city).
Recommendation on which data to use and in what way it should be presented.

Prototype Design Session Sharing - Mockups + Notes

This issue is a central place to store activity related to our two prototype design sessions.

Update:
In addition, please post links to any inspiring websites or visualizations you've seen! It doesn't have to be housing, just anything that has some element you think we could draw inspiration from. See my starter list below.

Update 2:
Here are the handouts for the 12/15 Session! Some of it is duplicated from last time, some new:
12-15-handouts-housing-insights

Notes are available here
Homework assignment before our 12/15 hacknight! Choose one of the scenarios from the list provided at the hacknight and sketch (pen and paper!) an idea for the 'perfect homepage' to solve that problem. Take a picture or scan your sketch and add it as a file in the comments section of this issue.
Electronic versions of handouts attached below.

Summary of design criteria based on our user profiles so far:
"I prefer a tool...." - what are the ~3 most important design principles for this user?

Policymaker

Build a narrative / story
Timely information - up to date more important than 100% verified data
Verifiable historical data - while timely info is important, 'citeable data' and analysis derived from only citeable data (i.e. excluding incomplete timely data) should be clearly distinguished
Summary-forward, but allow for deeper analysis

Tenant Advocate

Alerts me to new info instead of requiring constant review
Helps me be pro-active
Allows me to quickly asses properties on the criteria I deem important

Policy Advocate

Efficient with time / quick hitting
Lead with the story - the ability to explore the data on your own is important, but when balancing this tradeoff the story/narrative should come first
Make the data/findings transportable - export and embed graphs in websites and reports

Prototype #1: Moving Blocks

We'll focus our first prototype on the riskiest option for our homepage - grouping buildings by category with our animated block view. To complete the experience, we'll also have examples of our individual building page and a Ward-specific page.

Storyboard:
design-1.rotated.pdf

Mockup tool:
https://moqups.com/ (check your email for an invite to the project)

Roles (can be partially overlapping)

Box Page Lead: In charge of the main homepage view
Building Page Lead: In charge of assembling the building view
Component Maker: Makes custom graphs, icons, and components for the page leads to use in their views. Likely someone with moderate graphic design experience.
Icon/Asset downloader: Hunts down stock photos and icons in coordination with the component maker.
Data Checker: Looks up relevant data from our data sources to make sure that components look realistic - e.g. number of buildings in a Ward, min/max market rent in the city, etc.
Stitcher: Coordinates connecting all the views into one seamless experience for the user

Interested in reading about the process? Here's the chapter from the book Sprint, which I've used as a guide for our entire design process.
Sprint-prototype-chapter.pdf

Conventions:

Use gray to indicate elements that will be clickable/selectable in the final design but that are not included in the mockup (e.g. we may only mock up 2 out of 10 or so filtering options)
Replace hover options with click - we will direct test users to these places during demos
The goal of the mockup is to feel real - no Lorum Ipsum, real looking data, clean alignment of objects.