GithubHelp home page GithubHelp logo

Project Feedback about dat_sf_12 HOT 3 OPEN

sampathweb avatar sampathweb commented on August 24, 2024
Project Feedback

from dat_sf_12.

Comments (3)

selwyth avatar selwyth commented on August 24, 2024

Ramesh,

Going to post the data here tomorrow. Is a sample OK? I think I read Alex mentioning that a sample csv in github is fine if the data set is large.

Breaking it down into a 4-part problem based on your feedback. If I can succeed in the first two, I'd consider it a success, with the latter two being extra credit in case I happen to blow through it really quickly.

  1. Define clusters -- definitely agree on using PCA and other techniques to reduce the # of dimensions first.

The primary challenge for me is drawing these clusters in a lat-long space, but having the clusters be defined by a couple other dimensions like price per sq foot and # of bedrooms. Is this feasible? Some sort of projection of clusters that live in n-dimension space onto a lat-long plane? I'm also curious on how best to draw these clusters... would GeoJSON work for a typical mapping project?

Also, does this mean if I find, say, the 'Atherton' (near Palo Alto) and 'Piedmont' (in Oakland) clusters are similar, would they be 'Cluster 1A and Cluster 1B', or would they be part of the same 'Cluster 1', but the projection into lat-long space results in non-adjacent clusters? I would like to find out if k-Means can produce a cluster made up of several non-adjacent sub-clusters.

Basically, I'm considering either using lat and long as additional dimensions for the clustering (i.e. geographical proximity/clustering is important) or purely using home features only (# bedrooms, price per sq foot, lot size etc) for the clustering. In either case, the end-result is drawing the clusters in a lat-long plane.

Yes, primary intent is an inference exercise, to identify clusters in data. I could probably convert it to a supervised learning classification problem by training the data with user-defined neighborhoods/clusters (i.e. these two/three/n homes are in the same neighborhood because users tend to search for those two/three/n together) if this sounds much more doable than the clustering problem?

  1. Text-mine listing descriptions and/or geo-tagged social media posts to describe clusters -- I'm fascinated by how apps like Yelp, Glassdoor, Amazon, LinkedIn etc are able to pull out the 'key point/insight' in customer reviews, and would like to replicate that.

  2. Use restaurant & bar data as another dimension in determining the clusters and/or as the labels instead in a classification exercise -- I basically want opening hours and type of establishment (luxury, dive etc). I could pull this from Yelp, Google Maps or OpenTable's API if available, or just our company-purchased restaurant map layers if that becomes too gnarly.

  3. Understand how the clusters evolve. If whatever I did for 1) didn't produce recommendations or a similarity distance metric, then also do this (i.e. 'if you like homes in Nob Hill cluster, you'll also like this random neighborhood in San Jose').

I'm very comfortable and have some fair experience with regression and classification techniques, and hence prefer NOT to scope this project around those since I can do a regression/classification project independently. I view this project as a challenge to explore what I DON'T dare to do: clustering and dimension-reduction in 1), text-mining in 2), API pulls in 3). I feel these were and are my blind spots entering the class, and I hope to knock out at least 2 of the 3.

from dat_sf_12.

sampathweb avatar sampathweb commented on August 24, 2024

This all sounds good. Pull some sample data and explore using plots. Thanks for detailed notes on where you want to go with the project.

Thanks,
Ramesh Sampath

On Feb 10, 2015, at 12:22 AM, selwyth [email protected] wrote:

Ramesh,

Going to post the data here tomorrow. Is a sample OK? I think I read Alex mentioning that a sample csv in github is fine if the data set is large.

Breaking it down into a 4-part problem based on your feedback. If I can succeed in the first two, I'd consider it a success, with the latter two being extra credit in case I happen to blow through it really quickly.

  1. Define clusters -- definitely agree on using PCA and other techniques to reduce the # of dimensions first.

The primary challenge for me is drawing these clusters in a lat-long space, but having the clusters be defined by a couple other dimensions like price per sq foot and # of bedrooms. Is this feasible? Some sort of projection of clusters that live in n-dimension space onto a lat-long plane? I'm also curious on how best to draw these clusters... would GeoJSON work for a typical mapping project?

Also, does this mean if I find, say, the 'Atherton' (near Palo Alto) and 'Piedmont' (in Oakland) clusters are similar, would they be 'Cluster 1A and Cluster 1B', or would they be part of the same 'Cluster 1', but the projection into lat-long space results in non-adjacent clusters? I would like to find out if kNN can produce a cluster made up of several non-adjacent sub-clusters.

Yes, primary intent is an inference exercise, to identify clusters in data. I could probably convert it to a supervised learning classification problem by training the data with user-defined neighborhoods/clusters (i.e. these two/three/n homes are in the same neighborhood because users tend to search for those two/three/n together) if this sounds much more doable than the clustering problem?

  1. Text-mine listing descriptions and/or geo-tagged social media posts to describe clusters -- I'm fascinated by how apps like Yelp, Glassdoor, Amazon, LinkedIn etc are able to pull out the 'key point/insight' in customer reviews, and would like to replicate that.

  2. Use restaurant & bar data as another dimension in determining the clusters and/or as the labels instead in a classification exercise -- I basically want opening hours and type of establishment (luxury, dive etc). I could pull this from Yelp, Google Maps or OpenTable's API if available, or just our company-purchased restaurant map layers if that becomes too gnarly.

  3. Understand how the clusters evolve. If whatever I did for 1) didn't produce recommendations or a similarity distance metric, then also do this (i.e. 'if you like homes in Nob Hill cluster, you'll also like this random neighborhood in San Jose').

I'm very comfortable and have some fair experience with regression and classification techniques, and hence prefer NOT to scope this project around those since I can do a regression/classification project independently. I view this project as a challenge to explore what I DON'T dare to do: clustering and dimension-reduction in 1), text-mining in 2), API pulls in 3). I feel these were and are my blind spots entering the class, and I hope to knock out at least 2 of the 3.


Reply to this email directly or view it on GitHub.

from dat_sf_12.

selwyth avatar selwyth commented on August 24, 2024

Ramesh, did this count as my final proposal, or am I supposed to do something else? I didn't see anything else to submit. Created a new repo for the project: https://github.com/selwyth/neighborhood

from dat_sf_12.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.