GithubHelp home page GithubHelp logo

AdaGrad vs d-dim about sgd HOT 8 OPEN

airoldilab avatar airoldilab commented on August 20, 2024
AdaGrad vs d-dim

from sgd.

Comments (8)

dustinvtran avatar dustinvtran commented on August 20, 2024

I generalized the d-dimensional learning rate D_n to have hyperparameters α and c:

I_hat = α*I_hat + diag(I_hat_new)
D_n = 1/(I_hat)^c

The observed Fisher information I_hat is the sum of two terms: I_hat from the previous iteration, and I_hat_new which is the score function at the data point in the current iteration outer product-ed with itself, i.e., \nabla l(θ_{n-1}; yn)*(\nabla l(θ_{n-1}; yn))^T.

The discount α governs how much to weight previous history of gradients, and c is the exponent. Two special cases are:

  • "adagrad": α=1, c=1/2
  • "d-dim": α=0, c=1

See the learning rate wiki for more.

from sgd.

ptoulis avatar ptoulis commented on August 20, 2024

Nice, do you see the sqrt being better even in the normal model?

from sgd.

dustinvtran avatar dustinvtran commented on August 20, 2024

Yup.

library(sgd)

# Dimensions
N <- 1e5
d <- 1e2

# Generate data.
set.seed(42)
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="d-dim"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.008434492
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="d-dim-weight"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.48697
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="adagrad"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.0003773736

Here d-dim-weight has α=1, c=1; that is, it strongly uses the history of the gradient as in AdaGrad but is not as conservative in reducing the learning rate. Same pattern was true for a bunch of other seeds, i.e., AdaGrad beats d-dim by 1 or two orders, and d-dim-weight never converges. It's also worthy to note that this is done with the default SGD method (implicit).

Intuitively, AdaGrad is somehow able to use the information stored in the previous history of the gradients, but intelligently so that it doesn't just blow up as with d-dim-weight.

from sgd.

dustinvtran avatar dustinvtran commented on August 20, 2024

I've been trying to dig into the theory and am thoroughly perplexed. The paper looks at minimizing the regret function using the Mahalanobis norm, which generalizes L2. That is, we move from the standard projected gradient stochastic update

θ_n = arg min || θ - (θ_{n-1} +  α_n g_{n-1}) ||_2^2

to

θ_n = arg min || θ - (θ_{n-1} +  α_n A^{-1} g_{n-1}) ||_A^2

One can then bound the regret function which leads to solving

min_A \sum_{t=1}^T g_t^T A^{-1} g_t

The minimal such A is exactly the square root of the Fisher information, and so the learning rate as a diagonal approximation should be the square root of the inverse squared diagonal entries. See these slides roughly at page 9-11.

This seems contradictory to the fact that the minimum should really be the Fisher information and not the square root of it? Or perhaps it's different because it's working under the Mahalanobis norm?

from sgd.

ptoulis avatar ptoulis commented on August 20, 2024

nice! Yes, exactly. The inverse of Fisher information gives the minimum-variance estimator (MVUE)
But the sqrt-root gives the minimum-regret (upper bound at least).
So AdaGrad is an inefficient estimator but should have lower regret. I suppose this roughly means
that in small samples it is doing better than MVUE but in the limit the MVUE uses more information.

That's pretty cool actually. Could we try to validate this in experiments?

from sgd.

dustinvtran avatar dustinvtran commented on August 20, 2024

Yup, would definitely be interesting to see. That is, we check the variance of the two estimates as n -> infty through a plot

from sgd.

ptoulis avatar ptoulis commented on August 20, 2024

yes exactly

from sgd.

dustinvtran avatar dustinvtran commented on August 20, 2024

As a reminder (to self), this was looked at and briefly mentioned in the current draft for the NIPS submission. The intuition behind why AdaGrad leads to better empirical performance than the non-square rooted version in practice is still a mystery.

from sgd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.