davidrosenberg / mlcourse Goto Github PK

Machine learning course materials.

Home Page: https://davidrosenberg.github.io/ml2018

TeX 13.79% Python 0.40% Jupyter Notebook 60.89% MATLAB 0.08% R 1.85% Mathematica 0.56% Shell 0.06% Makefile 0.08% CSS 0.01% HTML 20.62% Emacs Lisp 0.02% Asymptote 1.64% Perl 0.01%

machine-learning course-materials

mlcourse's People

Contributors

Stargazers

Watchers

Forkers

lenovor nonva stan2133 honglongwu shixinli uniquegino rwzhao kuanhoong alexsisu yiranxu xiaodiu2010 wz1070 yuweitu crystal-butler nyuxz fc1315 hzhao16 qcmgrt kl26 junesiyu violaciao bordias dmadeka boukos colinsongf irfanohkay pseemakurthi hbugsbunny cj5815 josevacha mutual-ai yjhbnb guolei329 might2014 jpatrickpark buj201 sdmcclain jinhan7 hb1500 gzmkobe o9812 preetgandhi95 bjakubowski yangjue-han yangj14 johnyjyu joomladigger sfloam nguyenkatie1 congjiang tobyatgithub hzhz2020 nudnikshpilkis yiddishkop tinghao724 amyple veryfatboy coltonkinstley yzhucs a-bencheikh anamikasen che2 collector-m krsapkota baldsatan anaerobeth ruifalves jingyi-huang skoundin nzino82 pmeerkamp lixihan krzin tatenda-ndambakuwa optang nidharap hulalazz zzygyx9119 hbcbh1999 anhnguyendepocen leo-zhang-93 apoorvreddy nagyist venik snowdj aijs-at wqu1995 michalliu asad-30 xiongcailuo amkoshy nhuang37 evawsy yxqsophie hyiltiz jtsw1990 guneetkohli plthiyagu harima1301 wwengm

mlcourse's Issues

Figure for Representer Theorem Proof

Proof is very easy to understand if you have the right picture. It's just a matter of projections... Perhaps it could be disguised as an exercise for the review on projections that we already need for hard-margin SVM?

Homework 5 Question 4.2a

Looks like the variable i has conflicting meanings.

Calibrating classifiers

A possible lab topic is calibration. It seems potentially most useful for students to cover the sklearn CalibratedClassifierCV methods -- perhaps working off this paper or similar.

Clean up mlcourse for archiving

Brett -- I think you over-committed some that shouldn't be in the repos. Can you remove those, and also any cleanup you want to do before I make a 2017 archive?

Directional Derivatives vs Differentiability

Let f:R^2->R be given by f(x,y) = x if y = x^2 and 0 otherwise. Then f is continuous at (0,0) and all directional derivatives there are zero but f is not differentiable at (0,0). In this example, we have non-differentiability even though we have a vector (the zero vector) that "behaves" like a gradient with respect to giving the directional derivatives upon taking inner products with the direction.

Learning objectives for Lab #1

Write learning objectives for Lab #1. Learning objectives should be somewhat testable. Something more specific than "understand X". I currently have this in learning-objectives.org, but feel free to start a learning-objectives.md file instead. Seems more standard.

Are these actually subgradients in the figure?

mlcourse/Labs/4-Subgradients-Notes.ltx

Line 123 in bb0d258

Below are subgradients drawn at $x_0$ and $x_1$.

are they the underestimating functions associated with the subgradients?

Feature Slides

Is vector quantization an important enough feature generation method to discuss?

Change code in homework #1 to use square loss without the 1/2 in front

Ideas for Course

Add comprehensive notes for each lecture/lab
Condense lecture/lab by referring students to notes for optional material
Add Neural Networks early on as a non-linear method

Add references for L1 regularization

Some nice references for L1 regularization and sparsity:

Mairal et al's.Sparse Modeling for Image and Vision Processing and corresponding slides
Bach et al's Optimization with Sparsity-Inducing Penalties

Decouple ERM from estimation error / approximation error

Following discussion with @brett1479, the following approach may be more clear:

For any prediction function ĥ ∈ 𝓗, we can decompose the excess risk into an estimation error and an approximation error.

Often ĥ is specified as the solution to an optimization problem. (The ERM is one such example... but so is almost every learning method we will discuss.) In most of these situations, we have the practical issue that we cannot solve the optimization problem exactly in a limited amount of time, with fixed computational resources. In this case, we end up with a function h̃, which will perform differently from ĥ. The gap in performance between ĥ and h̃ is called the "optimization error".

Jupiter notebook for SVM and kernel method visualizations

Can be based on Eric Kim's work, which has source code (linked at bottom of page). Would also be nice to make prettier as well as 2-dimensional versions of these figure.
See Piazza page for original reference.

Geometric margin vs... margin

If we discuss the "margin" in the hard-margin SVM setting, it's important to distinguish this geometric margin from the usual margin (which is just score x label for 2-class).

More ideas to add for k-means

https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means

Review Writeup

https://github.com/davidrosenberg/mlcourse/blob/gh-pages/Labs/UniquenessOfSVM.pdf

Lecture Slides 4a, Convergence Rates for GD with Strong Convexity

May be helpful to add that "linear convergence" means linear in the number of digits of accuracy.

typo should be f(x)=-1

mlcourse/Labs/4-Subgradients/OnePitSets.asy

Line 9 in bb0d258

label("$f(x)=1$",(-3.75,-3));

Confusing notation in Lecture 4b, Slide 7

In Slide 7 we illustrate a subgradient of a function on reals, in which case the subgradient is a scalar -- the slope of the linear lower bound on the function. But we've written it as a vector. (Exactly analogous to the gradient of a function on reals just being the derivative, which is the slope of the tangent line at a point.)

For risk function or empirical risk, show true gradient and minibatch gradient samples

On contour plot, probably. Sampling far away from minimum should show most minibatch gradients pointing in the right-ish direction. While closer to the minimum, not as good. Can do this with varying minibatch sizes... This feels like it would be a good interactive demo...

In the same vein, compare a full batch gradient step (say with e.g. line search?), vs a single epoch of stochastic or minibatch gradient descent. Show the paths for each. Could also show all the minibatch gradients evaluated at the initial point, and explain that the full batch gradient is just adding all those vectors together, end to end. While an epoch of minibatch gradient allows a recalculation of the direction after each step. Which seems it would much better, so long as the minibatches are taking you in roughly the right direction.

Show empirical risk function and true risk for 1-dim linear regression

For 1-dimensional linear regression, there are 2 parameters so we can view the empirical risk function for varying amounts of data and the risk function as 3D plots and as contour plots. Suppose true distribution is X uniform on [0,1], y = 2x + N(0,SD=2).

Concept check question: what's the bayes risk for square loss?

Add slides showing ridge regression controls smoothness

Use Cauchy-Schwartz inequality to show that an L2 constraint on the weight vector ensure the prediction function is Lipschitz:
|F(X) – F(X*)| = |WX – WX*| = |W*(X-X*)| = |W* ε| <= |W|2 * | ε|2
Based on Brian Dalessandro's slides. Also a special case of the same idea for RKHS spaces.

Prep sheet on equivalent optimization problems for lasso

Several people were confused about the variable splitting in lasso and the meaning of equivalent optimization problems. Let's create a prep sheet to ease people into this, along the lines of the svm prep sheet.

Convexity of compositions and maxima to convex optimization notes

Add note on the convexity of function composition to convex optimization notes ( BV p. 84 (3.11). )
Add note on the convexity of a maxima; it is important for hinge loss, structured classification(?), and anyway thinking of maxima of functions is important for subgradients

Replace l1/l2 contour image with more readable ones

Could use some help generating new versions of images on slides 19-23 of L1/L2 regularization: https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/2b.L1L2-regularization.pdf#page=19

The light blue color is very difficult to read when projected.

Also, this image is very important: http://lear.inrialpes.fr/people/mairal/resources/pdf/denis.pdf#page=26
I usually just draw it on the board, but with the board not videoing well, we should probably have our own slide version. Slide 36 is cool too -- the non-convex regularizer... and slide 38 for elastic net.

Link to Felippa's Matrix Calculus is incorrect/outdated

Should be http://www.colorado.edu/engineering/cas/courses.d/IFEM.d/IFEM.AppC.d/IFEM.AppC.pdf

Reproduce (and fix) Hastie et al's L1/L2 regularizations paths

Hastie et al's book Statistical Learning with Sparsity has a very nice toy example illustrating the properties of L1/L2 regularization in the case of highly correlated groups of variables. Unfortunately their pictures are broken. Let's make a Jupyter notebook that reproduces the experiment with the correct images. My slides on the experiment setup start here:
https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/2.Lab.elastic-net.pdf#page=7

Seems like code to do this in sklearn is readily available:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html

Repeat Hastie & Tibshirani's SPAM experiment

HTF Figure 9.4 shows no overfitting as we increase the tree size. Would like to understand what's going on here better.

Lecture Slides 4b Slide 17

Limit should be = to f(x*)

Neural Networks Potentially Underemphasized

I don't know the correct answer to this question, or whether Neural networks are best left to other courses. Given that deep networks are a particularly hot field in Machine Learning right now, I am wondering whether a student finishing this course should have had more of an exposure to the concepts (I agree they shouldn't have mastery).

More on Elastic Net Picture

In the lab you give a picture proof that L2 regularization sets the coefficients to be equal when you have a duplicated feature. I think it makes sense to continue the picture discussion and see what happens when you use L1 regularization.

LIME/Black box

Related to #47 is interrogating black box predictors (perhaps using LIME). It might be interesting to roll this into coverage of feature importance -- highlighting that globally important features might not be locally important. In fact, the LIME vignette for continuous and categorical features here actually uses the adult data, so this would be an easy extension -- we could easily tack LIME (from this tutorial) onto the catboost with/without one-hot encodings.

Lecture slides 4a: Add SGD Convergence Rate

In the lecture video you said you had left out the slide with the SGD convergence rate.

Possible concept check for SGD

For a differentiable function that is Lipschitz continuous with constant L, give a bound on the derivative (for functions mapping R to R). (Write down the limit form of the derivative -- each quotient is bounded by Lipschitz constant, so limit is too. modulo some absolute values ) For functions mapping R^d --> R^d, give a bound on the determinant of the Jacobian. (it's L^d) (use the same strategy but with directional derivatives in coordinate directions -- also https://math.stackexchange.com/questions/1195715/jacobian-determinant-of-lipschitz-function

Approximation theory for neural networks

Michael Nielsen has a nice reference on universality for neural nets

Discuss k nearest neighbors

Nearest neighbors "guaranteed to be no worse than twice the Bayes error rate on infinite data [Cover and Hart 1967].. Unpack this statement -- is it useful in high dimensions (e.g. images)?

Would fit nicely before trees, as an easy first fully nonparametric model...

Note on properties of contour plot / tie to subgradient

Why are gradient vectors othogonal to contour lines? Also, what happens at non-differentiable points? What vectors represent subgradients? Students had difficult time understanding this subgradient slide, understandably.

Shanshan's differentiation handout has a bug?

Shanshan's notes may have a small bug in them, for differentiating the quadratic form

https://piazza.com/class/ii99b8o57me5jo?cid=30

Better intuition and/or examples for Classification tree impurity measures

Can we amp up the explanations and/or examples? Current explanation here. We do have Homework 4 problem 6 -- should we have more? Something in python that we can more quickly visualize differences?

Tie Estimation Error to Variance

Given a sample, we get an estimator in the hypothesis space. The performance gap between the estimator and the best in the space is the estimation error. The estimator is a random function, so if we repeat the procedure on a new training set, we will end up with a new estimator. Show a different point for each new batch of data, clustering around the optimal. If we take a larger training set, the variance of those points should decrease. I don't know of a precise measure of this "variance". But if I draw it this way, need to point out that this is just a cartoon, in which points in the space correspond to prediction functions, and closer points correspond to prediction functions that have more similar predictions (say in L2 norm for score functions, or probability of difference for classifications).

Probably of relevance here is Pedro Domingos's paper on generalizing bias-variance decompositions beyond the square loss: http://homes.cs.washington.edu/~pedrod/bvd.pdf

Need to recompile all beamer PDFs as 'handout'

Define x^* in gradient descent convergence theorem, and cite source for proof

Lecture 2a, decision function exists?

Does f_F exist in the decision tree example?

Alternate Proof of Square Sum

This isn't an issue, but just another idea. Suppose you have x1+x2+...+xn = c and you want to solve this while minimizing ||x||_2. Suppose xi > xj. Then -ei + ej is a descent direction that lies within the constraint hyperplane.

Need content on feature importance for tree-based methods

Software routinely gives various measures of feature importance and even marginal dependency on features. Explanation of what such things are and pointing out how they fail seems like a potentially very useful Lab topic.

Lecture 3c Slide 17 Comment

I don't think the fact that w is a linear combination of the xi vectors is a surprising fact that we obtained from the analysis. I believe that follows immediately from the statement of the primal problem. What makes the linear combination result a bit more interesting is that the coefficients on the xi have a fixed sign.

L1 Regularization 2016: Typo

Slide 36: g_d -> w_d

Make note on gradient being row vs column vector

Add to directional derivative note the discussion about whether the gradient is a row or a column vector. From Piazza discussion:

Is the gradient a row vector or a column vector? (and does it matter?)
This is indeed a confusing issue. There are standard conventions that I will explain below, and which we will follow. But if you understand the meaning of the objects in question, it won't really matter for this class.

When we talk about the derivative of f:Rd→R, we're talking about the Jacobian matrix of f, which for a function mapping into R ends up as a matrix with a single row, which is a row vector. The gradient is then defined as the transpose of the Jacobian matrix, and thus a column vector.

In the course webpage we link to Barnes's Matrix Differentiation notes as a reference. You'll notice the notes never use the word "gradient". Indeed, everything he writes there is about the derivative (i.e. the Jacobian). This is fine, as the gradient is just going to be the transpose of the relevant Jacobian.

Now an annoying thing: the other document on the website, simply called Appendix F: Matrix Calculus, uses the reverse convention. They define the Jacobian as the transpose of the one I've defined above and which I've found to be the standard one. Once you realize the difference is just a transpose, it's not a big deal. But it can certainly be confusing at first...

I recently found this nice website that describes how to find derivatives, but it also mentions the gradient as an aside:
http://michael.orlitzky.com/articles/the_derivative_of_a_quadratic_form.php

So now -- does it matter? Well, to some people, of course it matters. But in this couse, we have two primary uses for the gradient:
Find the directional derivative in a particular direction. To do this, we only need to take the inner product of the gradient with the direction. If you have a row vector (i.e. the Jacobian) instead of a column vector (the gradient), it's still pretty clear what you're supposed to do. In fact, when you're programming, row and column vectors are often just represented as "vectors" rather than matrices that happen to have only 1 column or 1 row. You then just keep track yourself of whether it's a row or a column vector.
Equating the gradient to zero to find the critical points. Again, here it doesn't matter at all if you have a row or column vector (i.e. if you're working with the gradient or the derivative).

Making text-based / flow chart figures

Any suggestions on how to make figures like these: (would you use Asymptote?)
https://github.com/davidrosenberg/mlcourse/blob/gh-pages/Figures/features/emailFeatures.png
https://github.com/davidrosenberg/mlcourse/blob/gh-pages/Figures/features/feature-extraction.png

davidrosenberg / mlcourse Goto Github PK

mlcourse's People

Contributors

Stargazers

Watchers

Forkers

mlcourse's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs