DAT SF 12

Instructor:	Alessandro Gagliardi
EiRs:	Ramesh Sampath
	Otto Stegmaier
	Alex Chao
Classes:	6:30pm-9:30pm, Tuesday and Thursdays
	January 15 – March 31
Office Hours:	Alex Chao, 5:30 - 6:30 before class at GA
	Otto Stegmaier, 9:30 - 10:00 after class at GA
	Ramesh Sampath, 4:00 - 6:00 Saturdays remote
	Can also set by appointment

Homework is to be submitted by posting it to your own github repo. Then post the URL and folder where the homework lives at here.

Tentative Course Outline

Intro to Data Science, Relational Databases & SQL
- Awesome Public Datasets
Getting started with IPython & Git
- How to Use Git and GitHub
- SourceTree
- What’s new in IPython
APIs and semi-structured data
- Mining Twitter
IPython.parallel & StarCluster
- Parallel Computing with IPython
- AWS Provisining
Hadoop Distributed File System and Spark
- AMP Camp 5 Exercises
- Announcing Spark 1.3
Intro to ML: k-Nearest Neighbor Classification
- Nearest Neighbor Methods (PDF Slides)
- k-Nearest Neighbor Classification Algorithm (YouTube Video)
- K Nearest Neighbors (Coursera Video)
- KNN for Humans
- Intuitive Classification using KNN and Python
- Nearest Neighbors Classification (scikit-learn documentation)
- Should I normalize/standardize/rescale the data?
Clustering: Hierarchical and K-Means
- A Tutorial on Clustering Algorithms (web tutorial)
- Hierarchical Clustering in Action
- K-means (scikit-learn documentation)
- Clustering: k-means (PDF Slides)
- Clustering: Hierarchical Clustering (PDF Slides)
Probability, A/B Tests & Statistical Significance
- Probability and Statistics (Khan Academy Course)
- What’s a good value for R-squared?
- Visualizing Distributions of Data
Multiple Linear Regression and ANOVA
- Plotting Linear Models using Seaborn
- Why ANOVA and Linear Regression are the Same Analysis
- Linear Regresion and ANOVA (Examples in R)
- Multiple (Linear) Regression (Examples in R)
Logistic Regression and Generlized Linear Models
- Logistic Regression in Python
- Explaining Odds Ratios
- GLM for Poisson Data
- Logistic Regression (scikit-learn documentation)
- sklearn logistic regression with unbalanced classes
Project Elevator Pitches
- See Student Project Repos below
Naïve Bayes, Cross Validation, ROC, AUC & Midterm Review - Part I
- Bayes' Theorem with Lego
- Probabilistic Programming and Bayesian Methods for Hackers
- Doing Naive Bayes Classification
- Receiver operating characteristic (wikipedia article)
- Receiver Operating Characteristic (ROC) (scikit-learn documentation)
Naïve Bayes, Cross Validation, ROC, AUC - Part II
Principal Components Analysis
- PCA and other Decomposition Examples in scikit-learn
- Principal Component Analysis Explained Visually
- Should you apply PCA to your data?
Decision Trees and Forests
- Decision Tree Learning (Wikipedia article)
- Decision Trees
- How to construct a tree
- Information gain
Support Vector Machines
- Support Vector Machines (scikit-learn documentation)
- A User's Guide to Support Vector Machines
Scaling Out
- numpy.memmap
- Feature Hashing
- HashingVectorizer
- Vectorizing a large text corpus with the hashing trick
- Pyrallel - Parallel Data Analytics in Python
- Online Passive-Aggressive Algorithms
Recommendation Systems
- Machine Learning & Recommender Systems @ Netflix Scale
Visualization
- Chart Suggestions—A Thought-Starter
- Seaborn: statistical data visualization
- Welcome to Bokeh
- ggplot for python
- plotly
Final Project Presentations (12 min. each)
Final Project Presentations (12 min. each)
Future Directions

Project Schedule

Date	Due	Returned
1/22	Preliminary Project Proposals Due (3-4 sentences)
1/27	Homework 1
1/29		EiR Feedback on Project Proposals
2/3		EiR Feedback on Homework 1
2/5	Formal Proposals (including data and methods chosen)
2/10	Homework 2 Assigned
2/12		EiR Feedback on Formal Proposals
2/17	Homework 2 Due
2/19	Project Elevator Pitch in class (4 minutes each)	Project Live on Github
2/24		Homework 3 Assigned
2/26	Peer Feedback of Projects	Peer Feedback on Project
3/3		Midterm Assessment Posted
3/12	Midterm Assessment Due
3/17	At least one working model
3/24-26	Final Presentations (12 minutes each)	Midterm Graded

Student Project Repos:

dat_sf_12's People

Contributors

Watchers

dat_sf_12's Issues

HW 2 Feedback

Nice use of plots to help you determine the number of neighbors for KNN
You can make use of a pivot table (http://pandas.pydata.org/pandas-docs/stable/reshaping.html) to be able to better and more compactly see the predicted vs. actual
There are other ways to calculate the accuracy of your model. For instance most models will have a score function. i.e. "clf.score(X_test, y_test)" http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
But what you wrote essentially does that albeit in a more manual way. It's good practice to write out the entire calculation so that you know the internals behind functions.
Check out the solutions note about scaling. Turns out that if you scale your variables before running kmeans, you can get much better predictive accuracy!
And while the problem didn't specify it, I think using a few basic histograms to visualize the distribution of the data can go a long way in catching things like the differing scales and other peculiarities with which you can modify your model.
All in all, good work!

Your Project Desc:
Using my company's real estate listings data/history for San Francisco and East Bay, attempt to define 'neighborhoods' in terms of similar-price clusters. Augment this with text mining based on listing descriptions or geo-tagged social media posts to describe these clusters and allow humans to make sense of them (e.g. "Oh what we think of as Nob Hill should have its western boundaries extended if we think in terms of home price affordability"). Then, for extra credit, attempt to predict future gentrification, trending neighborhoods and/or pockets of home price growth with the help of map layers (e.g. restaurants, crime, walkability, bike score, bars, demographics) from 3rd-party APIs or company-purchased data.

Initial Feedback:
David,
Looks like you have the data from work and the text data to learn more about the properties. Integrating that with third-party API's is generally trickly and time consuiming. So, I would probably focus primarily on what data you already have and mixing that with the text analysis on descriptions about the property.

Unsupervised learning (Clustering) is a hard problem in high dimensions. We can certainly try PCA and other dimension reduction techiques so that we can visualize the data points in a 2-D space. Unsupervised is also makes it difficult to test your model. Do you have any Supervised learning you want to do from the data or is the intend is primarily to identify clusters in data?

HW 1 Feedback

May not have been clearly communicated through the instructions, but we wanted you to create an app token so that you could interact with the API. What you have works but using the app token will have allowed you to do data[0] and see results.
Use this url to access the API with a working token https://data.sfgov.org/resource/vw6y-z8j6.json?$$app_token=Iq3JEcmMzuOIvahirX7cR0GTI
This will change a few of the column names, notably media_url. So you’d only need to drop ‘point’
Good job though choosing to drop the missing values. Since all values in the row were missing, this is the right choice. Usually you wouldn’t want to drop rows if you have some data elements in them.
Nice use of adding a percentage column in your query results. You’re the first student I saw do this!
One thing I would suggest is that when using the IPython notebook, don’t show the full data or make sure you use head() or limit in your sql queries otherwise the reader will have to scroll a lot to get to the next cell.
All in all, good work. In general, query results were what we were expecting and hopefully you were able to get more familiar with pandas.

Recommend Projects

selwyth / dat_sf_12 Goto Github PK

dat_sf_12's Introduction

DAT SF 12

Tentative Course Outline

Project Schedule

Student Project Repos:

dat_sf_12's People

Contributors

Watchers

dat_sf_12's Issues

HW 2 Feedback

Project Feedback

HW 1 Feedback

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs