Why is the square root in AdaGrad empirically getting better performance? ... or is it

I generalized the d -dimensional learning rate <code c

Yup. <div class="highlight highlight-source-r notranslate position-relative overfl

AdaGrad vs d-dim about sgd HOT 8 OPEN

airoldilab commented on August 20, 2024

AdaGrad vs d-dim

from sgd.

Comments (8)

dustinvtran commented on August 20, 2024

I generalized the d-dimensional learning rate D_n to have hyperparameters α and c:

I_hat = α*I_hat + diag(I_hat_new)
D_n = 1/(I_hat)^c

The observed Fisher information I_hat is the sum of two terms: I_hat from the previous iteration, and I_hat_new which is the score function at the data point in the current iteration outer product-ed with itself, i.e., \nabla l(θ_{n-1}; yn)*(\nabla l(θ_{n-1}; yn))^T.

The discount α governs how much to weight previous history of gradients, and c is the exponent. Two special cases are:

"adagrad": α=1, c=1/2
"d-dim": α=0, c=1

See the learning rate wiki for more.

from sgd.

ptoulis commented on August 20, 2024

Nice, do you see the sqrt being better even in the normal model?

from sgd.

dustinvtran commented on August 20, 2024

Yup.

library(sgd)

# Dimensions
N <- 1e5
d <- 1e2

# Generate data.
set.seed(42)
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="d-dim"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.008434492
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="d-dim-weight"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.48697
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="adagrad"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.0003773736

Here d-dim-weight has α=1, c=1; that is, it strongly uses the history of the gradient as in AdaGrad but is not as conservative in reducing the learning rate. Same pattern was true for a bunch of other seeds, i.e., AdaGrad beats d-dim by 1 or two orders, and d-dim-weight never converges. It's also worthy to note that this is done with the default SGD method (implicit).

Intuitively, AdaGrad is somehow able to use the information stored in the previous history of the gradients, but intelligently so that it doesn't just blow up as with d-dim-weight.

from sgd.

dustinvtran commented on August 20, 2024

I've been trying to dig into the theory and am thoroughly perplexed. The paper looks at minimizing the regret function using the Mahalanobis norm, which generalizes L2. That is, we move from the standard projected gradient stochastic update

θ_n = arg min || θ - (θ_{n-1} +  α_n g_{n-1}) ||_2^2

θ_n = arg min || θ - (θ_{n-1} +  α_n A^{-1} g_{n-1}) ||_A^2

One can then bound the regret function which leads to solving

min_A \sum_{t=1}^T g_t^T A^{-1} g_t

The minimal such A is exactly the square root of the Fisher information, and so the learning rate as a diagonal approximation should be the square root of the inverse squared diagonal entries. See these slides roughly at page 9-11.

This seems contradictory to the fact that the minimum should really be the Fisher information and not the square root of it? Or perhaps it's different because it's working under the Mahalanobis norm?

from sgd.

ptoulis commented on August 20, 2024

nice! Yes, exactly. The inverse of Fisher information gives the minimum-variance estimator (MVUE)
But the sqrt-root gives the minimum-regret (upper bound at least).
So AdaGrad is an inefficient estimator but should have lower regret. I suppose this roughly means
that in small samples it is doing better than MVUE but in the limit the MVUE uses more information.

That's pretty cool actually. Could we try to validate this in experiments?

from sgd.

dustinvtran commented on August 20, 2024

Yup, would definitely be interesting to see. That is, we check the variance of the two estimates as n -> infty through a plot

from sgd.

ptoulis commented on August 20, 2024

yes exactly

from sgd.

dustinvtran commented on August 20, 2024

As a reminder (to self), this was looked at and briefly mentioned in the current draft for the NIPS submission. The intuition behind why AdaGrad leads to better empirical performance than the non-square rooted version in practice is still a mystery.

from sgd.

AdaGrad vs d-dim about sgd HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs