GithubHelp home page GithubHelp logo

davidrosenberg / mlcourse Goto Github PK

View Code? Open in Web Editor NEW
534.0 31.0 259.0 395.47 MB

Machine learning course materials.

Home Page: https://davidrosenberg.github.io/ml2018

TeX 13.79% Python 0.40% Jupyter Notebook 60.89% MATLAB 0.08% R 1.85% Mathematica 0.56% Shell 0.06% Makefile 0.08% CSS 0.01% HTML 20.62% Emacs Lisp 0.02% Asymptote 1.64% Perl 0.01%
machine-learning course-materials

mlcourse's People

Contributors

bjakubowski avatar brett1479 avatar davidrosenberg avatar dmadeka avatar leventsagun avatar sdmcclain avatar sreyas-mohan avatar vakobzar avatar xintianhan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlcourse's Issues

Figure for Representer Theorem Proof

Proof is very easy to understand if you have the right picture. It's just a matter of projections... Perhaps it could be disguised as an exercise for the review on projections that we already need for hard-margin SVM?

Clean up mlcourse for archiving

Brett -- I think you over-committed some that shouldn't be in the repos. Can you remove those, and also any cleanup you want to do before I make a 2017 archive?

Directional Derivatives vs Differentiability

Let f:R^2->R be given by f(x,y) = x if y = x^2 and 0 otherwise. Then f is continuous at (0,0) and all directional derivatives there are zero but f is not differentiable at (0,0). In this example, we have non-differentiability even though we have a vector (the zero vector) that "behaves" like a gradient with respect to giving the directional derivatives upon taking inner products with the direction.

Learning objectives for Lab #1

Write learning objectives for Lab #1. Learning objectives should be somewhat testable. Something more specific than "understand X". I currently have this in learning-objectives.org, but feel free to start a learning-objectives.md file instead. Seems more standard.

Feature Slides

Is vector quantization an important enough feature generation method to discuss?

Ideas for Course

  1. Add comprehensive notes for each lecture/lab
  2. Condense lecture/lab by referring students to notes for optional material
  3. Add Neural Networks early on as a non-linear method

Decouple ERM from estimation error / approximation error

Following discussion with @brett1479, the following approach may be more clear:

  • For any prediction function ĥ ∈ 𝓗, we can decompose the excess risk into an estimation error and an approximation error.

Often ĥ is specified as the solution to an optimization problem. (The ERM is one such example... but so is almost every learning method we will discuss.) In most of these situations, we have the practical issue that we cannot solve the optimization problem exactly in a limited amount of time, with fixed computational resources. In this case, we end up with a function h̃, which will perform differently from ĥ. The gap in performance between ĥ and h̃ is called the "optimization error".

Geometric margin vs... margin

If we discuss the "margin" in the hard-margin SVM setting, it's important to distinguish this geometric margin from the usual margin (which is just score x label for 2-class).

Confusing notation in Lecture 4b, Slide 7

In Slide 7 we illustrate a subgradient of a function on reals, in which case the subgradient is a scalar -- the slope of the linear lower bound on the function. But we've written it as a vector. (Exactly analogous to the gradient of a function on reals just being the derivative, which is the slope of the tangent line at a point.)

For risk function or empirical risk, show true gradient and minibatch gradient samples

On contour plot, probably. Sampling far away from minimum should show most minibatch gradients pointing in the right-ish direction. While closer to the minimum, not as good. Can do this with varying minibatch sizes... This feels like it would be a good interactive demo...

In the same vein, compare a full batch gradient step (say with e.g. line search?), vs a single epoch of stochastic or minibatch gradient descent. Show the paths for each. Could also show all the minibatch gradients evaluated at the initial point, and explain that the full batch gradient is just adding all those vectors together, end to end. While an epoch of minibatch gradient allows a recalculation of the direction after each step. Which seems it would much better, so long as the minibatches are taking you in roughly the right direction.

Show empirical risk function and true risk for 1-dim linear regression

For 1-dimensional linear regression, there are 2 parameters so we can view the empirical risk function for varying amounts of data and the risk function as 3D plots and as contour plots. Suppose true distribution is X uniform on [0,1], y = 2x + N(0,SD=2).

Concept check question: what's the bayes risk for square loss?

Replace l1/l2 contour image with more readable ones

Could use some help generating new versions of images on slides 19-23 of L1/L2 regularization: https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/2b.L1L2-regularization.pdf#page=19

The light blue color is very difficult to read when projected.

Also, this image is very important: http://lear.inrialpes.fr/people/mairal/resources/pdf/denis.pdf#page=26
I usually just draw it on the board, but with the board not videoing well, we should probably have our own slide version. Slide 36 is cool too -- the non-convex regularizer... and slide 38 for elastic net.

Reproduce (and fix) Hastie et al's L1/L2 regularizations paths

Hastie et al's book Statistical Learning with Sparsity has a very nice toy example illustrating the properties of L1/L2 regularization in the case of highly correlated groups of variables. Unfortunately their pictures are broken. Let's make a Jupyter notebook that reproduces the experiment with the correct images. My slides on the experiment setup start here:
https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/2.Lab.elastic-net.pdf#page=7

Seems like code to do this in sklearn is readily available:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html

Neural Networks Potentially Underemphasized

I don't know the correct answer to this question, or whether Neural networks are best left to other courses. Given that deep networks are a particularly hot field in Machine Learning right now, I am wondering whether a student finishing this course should have had more of an exposure to the concepts (I agree they shouldn't have mastery).

More on Elastic Net Picture

In the lab you give a picture proof that L2 regularization sets the coefficients to be equal when you have a duplicated feature. I think it makes sense to continue the picture discussion and see what happens when you use L1 regularization.

LIME/Black box

Related to #47 is interrogating black box predictors (perhaps using LIME). It might be interesting to roll this into coverage of feature importance -- highlighting that globally important features might not be locally important. In fact, the LIME vignette for continuous and categorical features here actually uses the adult data, so this would be an easy extension -- we could easily tack LIME (from this tutorial) onto the catboost with/without one-hot encodings.

Possible concept check for SGD

For a differentiable function that is Lipschitz continuous with constant L, give a bound on the derivative (for functions mapping R to R). (Write down the limit form of the derivative -- each quotient is bounded by Lipschitz constant, so limit is too. modulo some absolute values ) For functions mapping R^d --> R^d, give a bound on the determinant of the Jacobian. (it's L^d) (use the same strategy but with directional derivatives in coordinate directions -- also https://math.stackexchange.com/questions/1195715/jacobian-determinant-of-lipschitz-function

Discuss k nearest neighbors

Nearest neighbors "guaranteed to be no worse than twice the Bayes error rate on infinite data [Cover and Hart 1967].. Unpack this statement -- is it useful in high dimensions (e.g. images)?

Would fit nicely before trees, as an easy first fully nonparametric model...

Tie Estimation Error to Variance

Given a sample, we get an estimator in the hypothesis space. The performance gap between the estimator and the best in the space is the estimation error. The estimator is a random function, so if we repeat the procedure on a new training set, we will end up with a new estimator. Show a different point for each new batch of data, clustering around the optimal. If we take a larger training set, the variance of those points should decrease. I don't know of a precise measure of this "variance". But if I draw it this way, need to point out that this is just a cartoon, in which points in the space correspond to prediction functions, and closer points correspond to prediction functions that have more similar predictions (say in L2 norm for score functions, or probability of difference for classifications).

Probably of relevance here is Pedro Domingos's paper on generalizing bias-variance decompositions beyond the square loss: http://homes.cs.washington.edu/~pedrod/bvd.pdf

More careful treatment of the link between hard and soft-margin SVM

With separable data, we seek to maximize the geometric margin, and that gives hard-margin SVM. Without separable data, need to maintain clarity on "geometric margin" is and how it connects to slack. In particular, rather than just the slack penalty perspective, can we formulate soft-margin SVM as: maximize the geometric margin, subject to a maximum total slack. Or, for a given geometric margin, find separator that minimizes slack. Christoph Lambert's Kernel Methods for Object Recognition gives a nice set of slides.

Alternate Proof of Square Sum

This isn't an issue, but just another idea. Suppose you have x1+x2+...+xn = c and you want to solve this while minimizing ||x||_2. Suppose xi > xj. Then -ei + ej is a descent direction that lies within the constraint hyperplane.

Lecture 3c Slide 17 Comment

I don't think the fact that w is a linear combination of the xi vectors is a surprising fact that we obtained from the analysis. I believe that follows immediately from the statement of the primal problem. What makes the linear combination result a bit more interesting is that the coefficients on the xi have a fixed sign.

Make note on gradient being row vs column vector

Add to directional derivative note the discussion about whether the gradient is a row or a column vector. From Piazza discussion:

Is the gradient a row vector or a column vector? (and does it matter?)
This is indeed a confusing issue. There are standard conventions that I will explain below, and which we will follow. But if you understand the meaning of the objects in question, it won't really matter for this class.

When we talk about the derivative of f:Rd→R, we're talking about the Jacobian matrix of f, which for a function mapping into R ends up as a matrix with a single row, which is a row vector. The gradient is then defined as the transpose of the Jacobian matrix, and thus a column vector.

In the course webpage we link to Barnes's Matrix Differentiation notes as a reference. You'll notice the notes never use the word "gradient". Indeed, everything he writes there is about the derivative (i.e. the Jacobian). This is fine, as the gradient is just going to be the transpose of the relevant Jacobian.

Now an annoying thing: the other document on the website, simply called Appendix F: Matrix Calculus, uses the reverse convention. They define the Jacobian as the transpose of the one I've defined above and which I've found to be the standard one. Once you realize the difference is just a transpose, it's not a big deal. But it can certainly be confusing at first...

I recently found this nice website that describes how to find derivatives, but it also mentions the gradient as an aside:
http://michael.orlitzky.com/articles/the_derivative_of_a_quadratic_form.php

So now -- does it matter? Well, to some people, of course it matters. But in this couse, we have two primary uses for the gradient:
Find the directional derivative in a particular direction. To do this, we only need to take the inner product of the gradient with the direction. If you have a row vector (i.e. the Jacobian) instead of a column vector (the gradient), it's still pretty clear what you're supposed to do. In fact, when you're programming, row and column vectors are often just represented as "vectors" rather than matrices that happen to have only 1 column or 1 row. You then just keep track yourself of whether it's a row or a column vector.
Equating the gradient to zero to find the critical points. Again, here it doesn't matter at all if you have a row or column vector (i.e. if you're working with the gradient or the derivative).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.