Comments (8)
I generalized the d
-dimensional learning rate D_n
to have hyperparameters α
and c
:
I_hat = α*I_hat + diag(I_hat_new)
D_n = 1/(I_hat)^c
The observed Fisher information I_hat
is the sum of two terms: I_hat
from the previous iteration, and I_hat_new
which is the score function at the data point in the current iteration outer product-ed with itself, i.e., \nabla l(θ_{n-1}; yn)*(\nabla l(θ_{n-1}; yn))^T
.
The discount α
governs how much to weight previous history of gradients, and c
is the exponent. Two special cases are:
"adagrad"
:α=1
,c=1/2
"d-dim"
:α=0
,c=1
See the learning rate wiki for more.
from sgd.
Nice, do you see the sqrt being better even in the normal model?
from sgd.
Yup.
library(sgd)
# Dimensions
N <- 1e5
d <- 1e2
# Generate data.
set.seed(42)
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
sgd.control=list(lr="d-dim"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.008434492
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
sgd.control=list(lr="d-dim-weight"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.48697
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
sgd.control=list(lr="adagrad"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.0003773736
Here d-dim-weight
has α=1, c=1
; that is, it strongly uses the history of the gradient as in AdaGrad but is not as conservative in reducing the learning rate. Same pattern was true for a bunch of other seeds, i.e., AdaGrad beats d-dim
by 1 or two orders, and d-dim-weight
never converges. It's also worthy to note that this is done with the default SGD method (implicit).
Intuitively, AdaGrad is somehow able to use the information stored in the previous history of the gradients, but intelligently so that it doesn't just blow up as with d-dim-weight
.
from sgd.
I've been trying to dig into the theory and am thoroughly perplexed. The paper looks at minimizing the regret function using the Mahalanobis norm, which generalizes L2. That is, we move from the standard projected gradient stochastic update
θ_n = arg min || θ - (θ_{n-1} + α_n g_{n-1}) ||_2^2
to
θ_n = arg min || θ - (θ_{n-1} + α_n A^{-1} g_{n-1}) ||_A^2
One can then bound the regret function which leads to solving
min_A \sum_{t=1}^T g_t^T A^{-1} g_t
The minimal such A
is exactly the square root of the Fisher information, and so the learning rate as a diagonal approximation should be the square root of the inverse squared diagonal entries. See these slides roughly at page 9-11.
This seems contradictory to the fact that the minimum should really be the Fisher information and not the square root of it? Or perhaps it's different because it's working under the Mahalanobis norm?
from sgd.
nice! Yes, exactly. The inverse of Fisher information gives the minimum-variance estimator (MVUE)
But the sqrt-root gives the minimum-regret (upper bound at least).
So AdaGrad is an inefficient estimator but should have lower regret. I suppose this roughly means
that in small samples it is doing better than MVUE but in the limit the MVUE uses more information.
That's pretty cool actually. Could we try to validate this in experiments?
from sgd.
Yup, would definitely be interesting to see. That is, we check the variance of the two estimates as n -> infty
through a plot
from sgd.
yes exactly
from sgd.
As a reminder (to self), this was looked at and briefly mentioned in the current draft for the NIPS submission. The intuition behind why AdaGrad leads to better empirical performance than the non-square rooted version in practice is still a mystery.
from sgd.
Related Issues (20)
- clarify how to stream data to sgd HOT 1
- sgd$coefficients returns 1x* matrix and not a vector
- implement update generic method
- SGD with arbitrary function for likelihood and gradient HOT 1
- Wine quality example HOT 3
- sgd archived in CRAN? HOT 2
- Cox model support not working
- no longer on CRAN? HOT 1
- Models using boost::math::tools::schroeder_iterate can get stuck HOT 4
- interested in sgd for complex function not in the available families; sgd for gmm not implemented? HOT 3
- Box constraints on coefficients?
- Integration with R's optim() function
- Cannot use SGD with sparse matrix HOT 5
- No support for weights HOT 2
- better clarify how to do prediction in documentation HOT 1
- generalize to sgd with mini batches
- Potential diverging of SGD
- Bad example on logistic regression HOT 2
- Non-huber loss for m-estimator [feature request] HOT 1
- sgd.function is empty HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sgd.