tplagge / benefice Goto Github PK

View Code? Open in Web Editor NEW

6.0 7.0 2.0 1009 KB

Python 100.00%

benefice's Introduction

benefice

A database of the built environment in Chicago using open data, forked from Edifice

Installation and dependencies

PostgreSQL (9.0.x or later; 9.1.x+ preferred)
PostGIS (2.0.x or later)
Python (2.7.x or later)

Using pip:

pip install -r requirements.txt

Using easy_install:

easy_install wget
easy_install psycopg2
easy_install PyYAML

Copy over the example setup.cfg:

cp setup.cfg.example setup.cfg

Using setup_benefice.py

setup_benefice.py is used to recreate the benefice database on a system with a PostgreSQL database installed (with PostGIS 2.0.x+ support).

Drop and recreate from scratch a base_postgis template database, using the 'postgres' admin user.

python setup_benefice.py --create_template

Drop and recreate from scratch an benefice database struture, using the 'benefice' user.

python setup_benefice.py --create

Download (~165mb), unzip, and import City of Chicago data into the benefice database. [NOTE: WORK IN PROGRESS]

python setup_benefice.py --data

Optional flags:

--bindir [DIRNAME]: specify the location of PostgreSQL binaries such as pg_config, psql, etc.
--user [USERNAME]: use a username other than 'benefice' as the owner of the main database.
--database [DBNAME]: use a name other than 'benefice' for the main database.
--delete_downloads: delete downloaded zip and csv files after import
--help: provide usage info

Data Sources

Google Doc of data sources we are using

QGIS and TileMill

Once you are done setting up your Benefice database, you can use the following tools (including psql) to explore the datasets.

QGIS is a free, open-source GIS application that can connect directly to a PostGIS database and display and analyze geographic data.

TileMill is a map-design studio that can also connect directly to a PostGIS datastore and create interactive web maps using OpenStreetMap as the base layer.

benefice's People

Contributors

Stargazers

Watchers

Forkers

arbylee dhvanika

benefice's Issues

Address matching algorithm

Determine a quick, accurate way of matching addresses across multiple datasets.

This library seems like a potentially useful start: https://github.com/jjensenmike/python-streetaddress

From Forest:
From what I remember reading in this area, there is no better approach than using a gazetteer (if available). For Chicago, we know all the street names and the their address ranges. https://data.cityofchicago.org/Transportation/Chicago-Street-Names/i6bp-fvbx

Taking that as the gazetteer, the task is to find the standardized street name that is most similar to our query address.

That would standardize the street name, and often the direction.

If we had a source of trusted address or smaller resolution address ranges (maybe the building footprints?), then matching against that gazetteer is the best way to go.

For comparing the similarity of a query address to a target address I would recommend the Levenshtein distance or a modification like the affine-gap distance we use for dedupe.

This is more flexible and will tend to be much more accurate than regexp or similar tricks.

Detecting modified files on the data portal (for setup_edifice.py)

This seems like a useful thing to do. While figuring out how to store and provide views on temporal change from non-temporal datasets in the edifice database is our own problem, for those users who merely want an up-to-date dataset, it would be nice to not have to re-download everything every night.

Any strategies for this? wget --spider will return the ultimately resolved URL and the file length without downloading the file. It seems plausible that a changed file on the data portal might also resolve to a URL with a new string — i.e. when I do:

wget --spider --no-check-certificate -O 'City Boundary.zip' http://data.cityofchicago.org/download/q38j-zgre/application/zip

That gets resolved to https://data.cityofchicago.org/api/file_data/9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs?filename=City%2520Boundary.zip .

I'm guessing that maybe when a new zip file gets put up there, that long string "9OVgki_a-MytpymEU2LRxpx0fsvbAE6MmYS8iDWm4xs" will be changed. Can anyone confirm this?

(The file length — 120943 bytes — is also displayed when you use wget --spider. But obviously file length is an insufficient criteria for determining data modification).

We verified last night that for csv files that re-resolved long string mentioned above is not present, so we can't use that as a way to detect new versions.

One alternative approach may be using the API to find the date/time of when the file was updated, i.e. 'updated_at' in the Socrata SODA API? Has anyone used this successfully before?

The question then is how to best locally store the dates/times when the client last pulled down a given dataset. I could see an argument for having this be a special table.. but it may be better as a local flat file since the main script is also proficient at completely dropping your database.

setup script assumes a postgres user exists

I'm not sure what privileges are expected for the user, so I just created a superuser for now. I used the following steps:

psql -d postgres
create role postgres;
alter role postgres SUPERUSER;

Without the postgres user, you'll just see a bunch of "psql: FATAL: role "postgres" does not exist" when running the script

implement CSV import function

Implement a function that can:

point to a flat file on the data portal
download it as a csv
create a new table based on the data fields
import the raw data in to PostgreSQL

add friendly instructions for installing postGIS

Based on what environment people have, give friendly instructions for

mac (homebrew or macports or postgis?)
linux (ubuntu)
windows

I ultimately wound up installing PostGIS 2.0.2 from source — is there a better way?

On my slightly decrepit MacBook (OS X 10.6.8) the whole path for getting Edifice-ready was:

$ sudo port install postgresql92-server

Then to initialize a default database with Unicode (i.e. template0 and template1, I think)

$ sudo su postgres -c '/opt/local/lib/postgresql92/bin/initdb --locale=en_US.UTF-8 --encoding=UNICODE -D /opt/local/var/db/postgresql92/defaultdb'

Finally, to get the local Postgres database to start and stop on boot:

$ sudo launchctl load -w /Library/LaunchDaemons/org.macports.postgresql92-server.plist

Then my successful compile of the PostGIS 2.0.2 source code looked like this:

~/src/postgis-2.0.2$ ./configure --with-geosconfig=/Library/Frameworks/GEOS.framework/unix/bin/geos-config  --with-projdir='/Library/Frameworks/PROJ.framework//Versions/4/unix' --with-pgconfig=/opt/local/lib/postgresql92/bin/pg_config

This was after installing the GEOS and PROJ frameworks from the http://www.kyngchaos.com/software/frameworks page (very handy -- especially if you want to get QGIS working).

If you have taken notes on other workflows, please post them and we can be as generic and/or specific as possible in the docs.

update description for posterity

This project is now getting tracked on this page: http://opengovhacknight.org/projects.html#/?search=benefice

Would be nice to have an accurate description so others can learn from what we did (or attempted to do).