Your Project Desc: Using my company's real estate listings data/history for San Fr

Project Feedback about dat_sf_12 HOT 3 OPEN

sampathweb commented on August 24, 2024

Project Feedback

from dat_sf_12.

Comments (3)

selwyth commented on August 24, 2024

Ramesh,

Going to post the data here tomorrow. Is a sample OK? I think I read Alex mentioning that a sample csv in github is fine if the data set is large.

Breaking it down into a 4-part problem based on your feedback. If I can succeed in the first two, I'd consider it a success, with the latter two being extra credit in case I happen to blow through it really quickly.

Define clusters -- definitely agree on using PCA and other techniques to reduce the # of dimensions first.

The primary challenge for me is drawing these clusters in a lat-long space, but having the clusters be defined by a couple other dimensions like price per sq foot and # of bedrooms. Is this feasible? Some sort of projection of clusters that live in n-dimension space onto a lat-long plane? I'm also curious on how best to draw these clusters... would GeoJSON work for a typical mapping project?

Also, does this mean if I find, say, the 'Atherton' (near Palo Alto) and 'Piedmont' (in Oakland) clusters are similar, would they be 'Cluster 1A and Cluster 1B', or would they be part of the same 'Cluster 1', but the projection into lat-long space results in non-adjacent clusters? I would like to find out if k-Means can produce a cluster made up of several non-adjacent sub-clusters.

Basically, I'm considering either using lat and long as additional dimensions for the clustering (i.e. geographical proximity/clustering is important) or purely using home features only (# bedrooms, price per sq foot, lot size etc) for the clustering. In either case, the end-result is drawing the clusters in a lat-long plane.

Yes, primary intent is an inference exercise, to identify clusters in data. I could probably convert it to a supervised learning classification problem by training the data with user-defined neighborhoods/clusters (i.e. these two/three/n homes are in the same neighborhood because users tend to search for those two/three/n together) if this sounds much more doable than the clustering problem?

Text-mine listing descriptions and/or geo-tagged social media posts to describe clusters -- I'm fascinated by how apps like Yelp, Glassdoor, Amazon, LinkedIn etc are able to pull out the 'key point/insight' in customer reviews, and would like to replicate that.
Use restaurant & bar data as another dimension in determining the clusters and/or as the labels instead in a classification exercise -- I basically want opening hours and type of establishment (luxury, dive etc). I could pull this from Yelp, Google Maps or OpenTable's API if available, or just our company-purchased restaurant map layers if that becomes too gnarly.
Understand how the clusters evolve. If whatever I did for 1) didn't produce recommendations or a similarity distance metric, then also do this (i.e. 'if you like homes in Nob Hill cluster, you'll also like this random neighborhood in San Jose').

I'm very comfortable and have some fair experience with regression and classification techniques, and hence prefer NOT to scope this project around those since I can do a regression/classification project independently. I view this project as a challenge to explore what I DON'T dare to do: clustering and dimension-reduction in 1), text-mining in 2), API pulls in 3). I feel these were and are my blind spots entering the class, and I hope to knock out at least 2 of the 3.

from dat_sf_12.

sampathweb commented on August 24, 2024

This all sounds good. Pull some sample data and explore using plots. Thanks for detailed notes on where you want to go with the project.

Thanks,
Ramesh Sampath

On Feb 10, 2015, at 12:22 AM, selwyth [email protected] wrote:

Ramesh,

Going to post the data here tomorrow. Is a sample OK? I think I read Alex mentioning that a sample csv in github is fine if the data set is large.

Breaking it down into a 4-part problem based on your feedback. If I can succeed in the first two, I'd consider it a success, with the latter two being extra credit in case I happen to blow through it really quickly.

Define clusters -- definitely agree on using PCA and other techniques to reduce the # of dimensions first.

The primary challenge for me is drawing these clusters in a lat-long space, but having the clusters be defined by a couple other dimensions like price per sq foot and # of bedrooms. Is this feasible? Some sort of projection of clusters that live in n-dimension space onto a lat-long plane? I'm also curious on how best to draw these clusters... would GeoJSON work for a typical mapping project?

Also, does this mean if I find, say, the 'Atherton' (near Palo Alto) and 'Piedmont' (in Oakland) clusters are similar, would they be 'Cluster 1A and Cluster 1B', or would they be part of the same 'Cluster 1', but the projection into lat-long space results in non-adjacent clusters? I would like to find out if kNN can produce a cluster made up of several non-adjacent sub-clusters.

Yes, primary intent is an inference exercise, to identify clusters in data. I could probably convert it to a supervised learning classification problem by training the data with user-defined neighborhoods/clusters (i.e. these two/three/n homes are in the same neighborhood because users tend to search for those two/three/n together) if this sounds much more doable than the clustering problem?

Text-mine listing descriptions and/or geo-tagged social media posts to describe clusters -- I'm fascinated by how apps like Yelp, Glassdoor, Amazon, LinkedIn etc are able to pull out the 'key point/insight' in customer reviews, and would like to replicate that.

Use restaurant & bar data as another dimension in determining the clusters and/or as the labels instead in a classification exercise -- I basically want opening hours and type of establishment (luxury, dive etc). I could pull this from Yelp, Google Maps or OpenTable's API if available, or just our company-purchased restaurant map layers if that becomes too gnarly.

Understand how the clusters evolve. If whatever I did for 1) didn't produce recommendations or a similarity distance metric, then also do this (i.e. 'if you like homes in Nob Hill cluster, you'll also like this random neighborhood in San Jose').

I'm very comfortable and have some fair experience with regression and classification techniques, and hence prefer NOT to scope this project around those since I can do a regression/classification project independently. I view this project as a challenge to explore what I DON'T dare to do: clustering and dimension-reduction in 1), text-mining in 2), API pulls in 3). I feel these were and are my blind spots entering the class, and I hope to knock out at least 2 of the 3.

—
Reply to this email directly or view it on GitHub.

from dat_sf_12.

selwyth commented on August 24, 2024

Ramesh, did this count as my final proposal, or am I supposed to do something else? I didn't see anything else to submit. Created a new repo for the project: https://github.com/selwyth/neighborhood

from dat_sf_12.

Project Feedback about dat_sf_12 HOT 3 OPEN

Comments (3)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs