GithubHelp home page GithubHelp logo

jjdblast / blightfight Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dnc1994/blightfight

0.0 2.0 0.0 15.97 MB

Repo for Capstone Project of Data Science at Scale course offered by University of Washington on Coursera.

Jupyter Notebook 66.94% HTML 32.97% Python 0.08%

blightfight's Introduction

BlightFight

Repo for Capstone Project of Data Science at Scale course offered by University of Washington on Coursera.

Final Report

Average Blight Risk Visualization

Task

Work with real data collected in Detroit to help urban planners predict blight (the deterioration and decay of buildings and older areas of large cities, due to neglect, crime, or lack of economic support).

Approach

Step 1: Establish a list of all the buildings with their space extents.

Done

  1. Filter NAs and invalid coordinates (outside the bounds of Detroit)
  2. Extract latitutude/longitude pair and address (in raw text) from 4 files
  3. Concatenate them into one data frame
  4. Clean up the address field (extract numbers, drop symbols, normalize spelling, expand abbreviations, etc)
  5. Cluster geolocations by fuzzy matching on address field and incident proximities (eps = 0.000075).
  6. Represent each building with a rectangle centered at average coordinates.

Tried

  • DBSCAN based on coordinates, no good.
  • DBSCAN based on a combination of coordinates and address fields, impossible to do without rewriting algorithm because of the way that feature distances are computed.

Step 2: Generate a balanced data set for training and testing

Done

  1. Map demolition permits to buildings, derive positive labels.
  2. Random sample a same amount of buildings with negative labels.
  3. Concatenate them into a "training" set.

Note

This "training" set will later be divided into a (real) training set and a validation set. In this task it does not make much sense to use the remaining data as a "testing" set (at least no in a traditional sense) because we only got buildings that are not on the demolition list. And there's no way to figure out their true labels. So this part is a little bit like semi-supervised learning: I'll just evaluate the model on the validation set and use the remaining data for visualization and drawing conclusions. Anyway this is also what the task requires us to do.

Step 3: Develop a naive model and evalute its performance.

I believe it's OK to jump right to Step 4.

Step 4: Feature engineering.

Done

  1. Derive features from violations.csv, calls.csv and crimes.csv. Bascially counts of one-hot-encoded categorical variables.
  2. Examine feature importance using random forest. Got a ~0.83 AUC score on OOB data.

Note

Counts of violations and crimes are the simplist yet most important features. I even hadn't include a decaying propagation effect of bad incidents.

Step 5: Develop a more advanced model.

  1. Trained a Xgboost model, got a ~0.85 AUC score on OOB data
  2. Simplify the model and still got a~0.849 AUC score.

Step 6: Evaluation and drawing conclusions.

Present a summary with some visualizations.

  1. Explain the model.
  2. Make a Choropleth map of blight risks on out-of-sample data.

Author

Linghao Zhang

License

MIT license

blightfight's People

Contributors

dnc1994 avatar

Watchers

James Cloos avatar jiandong avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.