tsdataclinic / newerhoods Goto Github PK

A Data Clinic project that aggregates NYC Open Data at the tract-level and uses Machine Learning techniques to re-imagine neighborhood boundaries.

License: Apache License 2.0

R 2.61% HTML 97.02% CSS 0.29% JavaScript 0.07% Dockerfile 0.01% Shell 0.01%

rshiny neighborhoods clustering machine-learning spatial-analysis

newerhoods's Introduction

NewerHoods

New York City’s (NYC’s) neighborhoods are a driving force in the lives of New Yorkers—their identities are closely intertwined and a source of pride. However, the history and evolution of NYC’s neighborhoods don’t follow the rigid, cold lines of statistical and administrative boundaries. Instead, the neighborhoods we live and work in are the result of a more organic confluence of factors.

Data Clinic developed NewerHoods with the goal of helping individuals and organizations better advocate for their communities by enabling them to tailor insights to meet their specific needs. NewerHoods is an interactive web-app that uses open data to generate localized features at the census tract-level and machine learning to create homogeneous clusters. Users are able to select characteristics of interest (currently open data on housing, crime, and 311 complaints), visualize NewerHood clusters on an interactive map, find similar neighborhoods, and compare them against existing administrative boundaries. The tool is designed to enable users without in-depth data expertise to compare and incorporate these redefined neighborhoods into their work and life.

The application is live and available to use here.

Getting Started

The below steps will help you get started and setting up and running NewerHoods locally. Since this is a RShiny application, install RStudio on your machine if you haven't already from here.

Directory Structure

newerhoods/clean_data contains just the cleaned/transformed data sets used directly by the Shiny App.

/src contains all the code to merge and clean the data sets, extract features from it, and cluster the features.

/newerhoods contains the code for the RShiny WebApp.

Running the App

First, the R environment needs to be set up with all the necessary packages.

source("newerhoods/setup.R")

The project uses several APIs from loading data using the APIs developed by NYC Developer Portal and Mapbox for the underlying map visualization in the Shiny App. Getting all of these token are free by signing up here and here. Follow the instructions in the settings.R file which can be found in the newerhoods folder and source the local version of the file to get all the tokens stored in the environment. You would have to source this settings file everytime you start a new session.

Note: If you intend to run only the RShiny App, filling in just the MapBox API Token would suffice.

source("newerhoods/settings_local.R")

Run the App

library(shiny)
runApp("newerhoods")

Alternatively, you can run the application in docker. To build the docker container run

docker build -t newerhoods .

Then to run the docker container simply run

docker run -it --rm -p 3000:3000 -v $(pwd):/app newerhoods

any changes you make in the code should trigger an application reload so all you should need to do is refresh your browser to see them.

Contributing to NewerHoods

We invite feedback on the tool and encourage users to contribute. Check out this page for some ways in which you can contribute. To contact Data Clinic about NewerHoods, please email us at [email protected].

Data Sources

References

License

This project is licensed under the Apache 2.0 License - see the LICENSE.md file for details

newerhoods's People

Contributors

Stargazers

Watchers

Forkers

hdoshi2 kaushik12 lejit illesial ts-santind lilysu wq1973

newerhoods's Issues

Download-data: fix bookmarking link

Bookmark link doesn't work properly right now. Seems like the global setting isn't getting picked up properly.

Performance improvements

Look into

Look into shinytest to generate test cases for the app and automate testing.
Look into shinyloadtest to check how the app performs at different loads and see where improvements can be made.
Watch shiny webinar and resources on golem package for shiny app development

UI/UX: Display Optimal clusters

Optimal clusters for the chosen data are computed internally but not yet displayed in the UI. Modifying the slider seems a bit hard. Need to explore alternative solutions.

Needs:
Display optimal number of clusters (4 options for different ranges [0-50; 50-100; 100-150; 150-200]
Allow user to click on these to automatically set number of clusters and recompute

Ideal Solution:
Display little pointers above/below slider that a user can click on to set the slider val to the corresponding value.

Insights

What do insights from the clustering look like?
Can we provide users with succinct insights from the results?

Ideas:

Display where in the distribution the cluster statistic lies (i.e. percentiles in the labels)
Display trends (is this going up or down) [Harder to do with uploaded data]
Summary reports (% overlap with different boundaries)
Highest matching cluster-administrative unit pair
Reports for a Community District

Use Case: Finding your neighborhood

A simple use case of this tool could be to allow a user to find a region based the data and their inputs. This would obviously get richer with more datasets integrated. For example, a user can select a set of criterion (range of housing prices, access to schools, parks etc.) and their importance (maybe?) that would result in "areas" (clusters) that best match the criterion.

Upload-Data: UI Modal Done and close button

These are two separate buttons that need to ideally be one.

Features from Facilities Database

Want to integrate features form the Facilities database shapefile on NYC Open Data. These are points or shapes which can't just be joined into a tract to get local features.

Modify 311 complaint features

Change 311 complaint features to be broad categories such as (Housing related, noise related, etc.)

Upload-data: Tooltips/how to for upload

Add tooltips for info on geographic identifier requirements.
Add a general "How to" with screenshots for an overview of the process.
Add sample csvs for people to download to perhaps.

Upload-data: Column selection

Ensure row numbers are excluded
Remove a column from feature selection if selected as geographic identifier
Basic version of autofill for geographic identifier

Fix debounce / cancel

Should only send one request at a time. Cancel previous fetch when starting a new one.

Real-estate price growth features

Currently, we only use mean and sd of real-estate prices. Growth in prices over time would be easy to add and could add some value.

Download cluster statistics

Allow user to download a csv of census tracts, cluster labels, and cluster averages/statistics.

Recommend number of clusters

Is your feature request related to a problem? Please describe.
Currently, the choice of number of clusters is entirely up to the user. Ideally, we would like to recommend a few options for the number of clusters based on the data.

Describe the solution you'd like
A set of choices for number of clusters in different ranges (0-50,50-100, etc.) available for the user to select from.

Download-data: UI

Need to move the button to ideally top-right on the map. Not sure if this is possible. If not, @YuanyuanMaggie feel free to modify the design as appropriate

Major Update: Migrate to React from RShiny

Not a immediate need but the general outlook is to try and migrate project to React.

Why?

Greater control and reduced dependency on RShiny modules & shinyapps.io
Process user data on client side vs. server side
Align with overall Data Clinic plan for app development

Upload-data: Support for reading xls/xlsx files

Add support for reading in excel files

UI: Add social links in header

Links to Medium, GitHub and twitter in the header mirroring the Data Clinic website

Upload-Data: allow no feature column when using lat-lon identifier

When uploading a file with lat-lon identifiers, users need not necessarily select a feature column as we can aggregate counts and generates rates. Currently, the code won't work without a feature column.

Missing data?

There seems to be a number of areas that have data that is missing where there shouldn't be. Just checking to make sure that nothing is missing in the data pipeline? Are these just areas with a lack of residential housing?

Upload-data: User feature inputs

This is currently a dropdown input which is different from the checkbox inputs. I think the code would work similarly if the input is simply changed to CheckboxInput.

Upload-data: User feature names

Change display names of features generated from user uploaded data

Refresh Housing data

Housing data is currently from 2013-2017. 2018 Annualized sales data for 2018 was updated since. Add it to the data and improve data pre-processing to make it more streamlined. Look into including Rolling sales data which updates more regularly into it.

Features based on Zoning classifications

Use NYC zoning map to build features based on the zones. Could be used to find neighborhood that are residential/commercial/balanced etc.

Download/Share map

Allow users to download/share the results of the clustering. shinyurl can be used to create custom-links.

Major Update: expand to other cities with Open Data

A good second city would be Chicago. Wealth of Open Data and interesting socio-economic structure of the city. Working on this would also be a good time to think through the best data structure to make the process of including more cities easier and possibly migrating the data to an actual database.

Some files don't upload throwing a stopped prematurely error

Describe the bug
Upload stops prematurely when uploading certain files.

Additional context
RStudio Support confirmed that this is a known issue and has to do with file encoding. UTF-8 files are less likely to fail. Alternatively, uploading a zip file and unzipping it on the server side could be a solution.

Improve Help/How tos

Is your feature request related to a problem? Please describe.
Currently, a user doesn't have clear info on how to use the App. We assume the UI is straightforward enough for users to figure out. This may not be the case as we add more features.

Describe the solution you'd like
A help flow which takes a user through each input/control and highlights the process and purpose of each. rintrojs allows us to do this for a shinyapp. Need to implement.

Finding similar neighborhoods

Want to show nieghborhoods similar to each other. Things to adress before that

What definition of similarity?
How can we quantify this?
How do we present this information that is easy to interpret?

Center map on user's location

Request user's location from browser and centre map to this (if location within NYC).

Upload-data: Upload button

The button to open the upload input modal in now just a link. A button like the 'Apply' button with an upload icon might be better as discussed.

Temporal data & clustering analysis

The idea of observing clusters change and evolve over time is appealing. Rather than just a snapshot, ideally, a user can observe the patterns in data over time as well.

With some tweaks to the underlying data structure, we could ideally generate features on the fly through basic aggregations up to a certain point in time which is then clustered.

Errors & validations

Show validation error messages in the upload flow where applicable. Currently, server crashes if there's an error.

Control spatial homogeneity

The clustering right now uses an alpha of 0.15, but when using user uploaded data, this might not be optimal. Two possible solutions to this,

Have an slider similar to number of clusters that allows users to control this parameter, ranging from 0.1 to 0.9 maybe to ensure they don't go to extremes in either direction.
Another solution is to quickly calculate level of disjointedness in the clustering and incrementally increase alpha based on this till a predetermined maximum

Any other ideas or thoughts on this, @stuartlynn ?

Features from reduction in rent-controlled housing

Reduction in rent-controlled housing is a major issue in the city. Adding this as a feature along with existing housing features would be interesting to find neighborhoods becoming less affordable/accessible.

Clustering within a sub-region

Current version only allows for all of New York City to be clustered. It might be interesting for someone to just cluster one of the borough, for example.

Download results as PNG for heatmap

Currently, when you download results as a PNG, you get just the cluster map. We might want to provide both cluster and heat map as individual files or zipped together? Or respond with the file based on their selected view (might be easier and more logical thing to do)