davidrosenberg / mlcourse Goto Github PK
View Code? Open in Web Editor NEWMachine learning course materials.
Home Page: https://davidrosenberg.github.io/ml2018
Machine learning course materials.
Home Page: https://davidrosenberg.github.io/ml2018
Proof is very easy to understand if you have the right picture. It's just a matter of projections... Perhaps it could be disguised as an exercise for the review on projections that we already need for hard-margin SVM?
Looks like the variable i has conflicting meanings.
A possible lab topic is calibration. It seems potentially most useful for students to cover the sklearn CalibratedClassifierCV methods -- perhaps working off this paper or similar.
Brett -- I think you over-committed some that shouldn't be in the repos. Can you remove those, and also any cleanup you want to do before I make a 2017 archive?
Let f:R^2->R be given by f(x,y) = x if y = x^2 and 0 otherwise. Then f is continuous at (0,0) and all directional derivatives there are zero but f is not differentiable at (0,0). In this example, we have non-differentiability even though we have a vector (the zero vector) that "behaves" like a gradient with respect to giving the directional derivatives upon taking inner products with the direction.
Write learning objectives for Lab #1. Learning objectives should be somewhat testable. Something more specific than "understand X". I currently have this in learning-objectives.org, but feel free to start a learning-objectives.md file instead. Seems more standard.
mlcourse/Labs/4-Subgradients-Notes.ltx
Line 123 in bb0d258
are they the underestimating functions associated with the subgradients?
Is vector quantization an important enough feature generation method to discuss?
Some nice references for L1 regularization and sparsity:
Following discussion with @brett1479, the following approach may be more clear:
Often ĥ is specified as the solution to an optimization problem. (The ERM is one such example... but so is almost every learning method we will discuss.) In most of these situations, we have the practical issue that we cannot solve the optimization problem exactly in a limited amount of time, with fixed computational resources. In this case, we end up with a function h̃, which will perform differently from ĥ. The gap in performance between ĥ and h̃ is called the "optimization error".
Can be based on Eric Kim's work, which has source code (linked at bottom of page). Would also be nice to make prettier as well as 2-dimensional versions of these figure.
See Piazza page for original reference.
If we discuss the "margin" in the hard-margin SVM setting, it's important to distinguish this geometric margin from the usual margin (which is just score x label for 2-class).
May be helpful to add that "linear convergence" means linear in the number of digits of accuracy.
In Slide 7 we illustrate a subgradient of a function on reals, in which case the subgradient is a scalar -- the slope of the linear lower bound on the function. But we've written it as a vector. (Exactly analogous to the gradient of a function on reals just being the derivative, which is the slope of the tangent line at a point.)
On contour plot, probably. Sampling far away from minimum should show most minibatch gradients pointing in the right-ish direction. While closer to the minimum, not as good. Can do this with varying minibatch sizes... This feels like it would be a good interactive demo...
In the same vein, compare a full batch gradient step (say with e.g. line search?), vs a single epoch of stochastic or minibatch gradient descent. Show the paths for each. Could also show all the minibatch gradients evaluated at the initial point, and explain that the full batch gradient is just adding all those vectors together, end to end. While an epoch of minibatch gradient allows a recalculation of the direction after each step. Which seems it would much better, so long as the minibatches are taking you in roughly the right direction.
For 1-dimensional linear regression, there are 2 parameters so we can view the empirical risk function for varying amounts of data and the risk function as 3D plots and as contour plots. Suppose true distribution is X uniform on [0,1], y = 2x + N(0,SD=2).
Concept check question: what's the bayes risk for square loss?
Use Cauchy-Schwartz inequality to show that an L2 constraint on the weight vector ensure the prediction function is Lipschitz:
|F(X) – F(X*)| = |WX – WX*| = |W*(X-X*)| = |W* ε| <= |W|2 * | ε|2
Based on Brian Dalessandro's slides. Also a special case of the same idea for RKHS spaces.
Several people were confused about the variable splitting in lasso and the meaning of equivalent optimization problems. Let's create a prep sheet to ease people into this, along the lines of the svm prep sheet.
Could use some help generating new versions of images on slides 19-23 of L1/L2 regularization: https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/2b.L1L2-regularization.pdf#page=19
The light blue color is very difficult to read when projected.
Also, this image is very important: http://lear.inrialpes.fr/people/mairal/resources/pdf/denis.pdf#page=26
I usually just draw it on the board, but with the board not videoing well, we should probably have our own slide version. Slide 36 is cool too -- the non-convex regularizer... and slide 38 for elastic net.
Hastie et al's book Statistical Learning with Sparsity has a very nice toy example illustrating the properties of L1/L2 regularization in the case of highly correlated groups of variables. Unfortunately their pictures are broken. Let's make a Jupyter notebook that reproduces the experiment with the correct images. My slides on the experiment setup start here:
https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/2.Lab.elastic-net.pdf#page=7
Seems like code to do this in sklearn is readily available:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html
HTF Figure 9.4 shows no overfitting as we increase the tree size. Would like to understand what's going on here better.
Limit should be = to f(x*)
I don't know the correct answer to this question, or whether Neural networks are best left to other courses. Given that deep networks are a particularly hot field in Machine Learning right now, I am wondering whether a student finishing this course should have had more of an exposure to the concepts (I agree they shouldn't have mastery).
In the lab you give a picture proof that L2 regularization sets the coefficients to be equal when you have a duplicated feature. I think it makes sense to continue the picture discussion and see what happens when you use L1 regularization.
Related to #47 is interrogating black box predictors (perhaps using LIME). It might be interesting to roll this into coverage of feature importance -- highlighting that globally important features might not be locally important. In fact, the LIME vignette for continuous and categorical features here actually uses the adult data, so this would be an easy extension -- we could easily tack LIME (from this tutorial) onto the catboost with/without one-hot encodings.
In the lecture video you said you had left out the slide with the SGD convergence rate.
For a differentiable function that is Lipschitz continuous with constant L, give a bound on the derivative (for functions mapping R to R). (Write down the limit form of the derivative -- each quotient is bounded by Lipschitz constant, so limit is too. modulo some absolute values ) For functions mapping R^d --> R^d, give a bound on the determinant of the Jacobian. (it's L^d) (use the same strategy but with directional derivatives in coordinate directions -- also https://math.stackexchange.com/questions/1195715/jacobian-determinant-of-lipschitz-function
Michael Nielsen has a nice reference on universality for neural nets
Nearest neighbors "guaranteed to be no worse than twice the Bayes error rate on infinite data [Cover and Hart 1967].. Unpack this statement -- is it useful in high dimensions (e.g. images)?
Would fit nicely before trees, as an easy first fully nonparametric model...
Why are gradient vectors othogonal to contour lines? Also, what happens at non-differentiable points? What vectors represent subgradients? Students had difficult time understanding this subgradient slide, understandably.
Shanshan's notes may have a small bug in them, for differentiating the quadratic form
Can we amp up the explanations and/or examples? Current explanation here. We do have Homework 4 problem 6 -- should we have more? Something in python that we can more quickly visualize differences?
Given a sample, we get an estimator in the hypothesis space. The performance gap between the estimator and the best in the space is the estimation error. The estimator is a random function, so if we repeat the procedure on a new training set, we will end up with a new estimator. Show a different point for each new batch of data, clustering around the optimal. If we take a larger training set, the variance of those points should decrease. I don't know of a precise measure of this "variance". But if I draw it this way, need to point out that this is just a cartoon, in which points in the space correspond to prediction functions, and closer points correspond to prediction functions that have more similar predictions (say in L2 norm for score functions, or probability of difference for classifications).
Probably of relevance here is Pedro Domingos's paper on generalizing bias-variance decompositions beyond the square loss: http://homes.cs.washington.edu/~pedrod/bvd.pdf
With separable data, we seek to maximize the geometric margin, and that gives hard-margin SVM. Without separable data, need to maintain clarity on "geometric margin" is and how it connects to slack. In particular, rather than just the slack penalty perspective, can we formulate soft-margin SVM as: maximize the geometric margin, subject to a maximum total slack. Or, for a given geometric margin, find separator that minimizes slack. Christoph Lambert's Kernel Methods for Object Recognition gives a nice set of slides.
Does f_F exist in the decision tree example?
This isn't an issue, but just another idea. Suppose you have x1+x2+...+xn = c and you want to solve this while minimizing ||x||_2. Suppose xi > xj. Then -ei + ej is a descent direction that lies within the constraint hyperplane.
Software routinely gives various measures of feature importance and even marginal dependency on features. Explanation of what such things are and pointing out how they fail seems like a potentially very useful Lab topic.
I don't think the fact that w is a linear combination of the xi vectors is a surprising fact that we obtained from the analysis. I believe that follows immediately from the statement of the primal problem. What makes the linear combination result a bit more interesting is that the coefficients on the xi have a fixed sign.
Slide 36: g_d -> w_d
Add to directional derivative note the discussion about whether the gradient is a row or a column vector. From Piazza discussion:
Is the gradient a row vector or a column vector? (and does it matter?)
This is indeed a confusing issue. There are standard conventions that I will explain below, and which we will follow. But if you understand the meaning of the objects in question, it won't really matter for this class.
When we talk about the derivative of f:Rd→R, we're talking about the Jacobian matrix of f, which for a function mapping into R ends up as a matrix with a single row, which is a row vector. The gradient is then defined as the transpose of the Jacobian matrix, and thus a column vector.
In the course webpage we link to Barnes's Matrix Differentiation notes as a reference. You'll notice the notes never use the word "gradient". Indeed, everything he writes there is about the derivative (i.e. the Jacobian). This is fine, as the gradient is just going to be the transpose of the relevant Jacobian.
Now an annoying thing: the other document on the website, simply called Appendix F: Matrix Calculus, uses the reverse convention. They define the Jacobian as the transpose of the one I've defined above and which I've found to be the standard one. Once you realize the difference is just a transpose, it's not a big deal. But it can certainly be confusing at first...
I recently found this nice website that describes how to find derivatives, but it also mentions the gradient as an aside:
http://michael.orlitzky.com/articles/the_derivative_of_a_quadratic_form.php
So now -- does it matter? Well, to some people, of course it matters. But in this couse, we have two primary uses for the gradient:
Find the directional derivative in a particular direction. To do this, we only need to take the inner product of the gradient with the direction. If you have a row vector (i.e. the Jacobian) instead of a column vector (the gradient), it's still pretty clear what you're supposed to do. In fact, when you're programming, row and column vectors are often just represented as "vectors" rather than matrices that happen to have only 1 column or 1 row. You then just keep track yourself of whether it's a row or a column vector.
Equating the gradient to zero to find the critical points. Again, here it doesn't matter at all if you have a row or column vector (i.e. if you're working with the gradient or the derivative).
Any suggestions on how to make figures like these: (would you use Asymptote?)
https://github.com/davidrosenberg/mlcourse/blob/gh-pages/Figures/features/emailFeatures.png
https://github.com/davidrosenberg/mlcourse/blob/gh-pages/Figures/features/feature-extraction.png
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.