GithubHelp home page GithubHelp logo

clovaai / adamp Goto Github PK

View Code? Open in Web Editor NEW
410.0 13.0 54.0 5.01 MB

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights (ICLR 2021)

Home Page: https://clovaai.github.io/AdamP/

License: MIT License

Python 100.00%
deep-learning optimizer optimizer-algorithms machine-learning pytorch iclr2021

adamp's People

Contributors

bhheo avatar ildoonet avatar sanghyukchun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adamp's Issues

The difference with Adam and AdamP in the code

Hello,

I am just following the major differences between Adam and AdamP.
I can just notice the difference where it says #Projection in the code.

I have been running some baseline cases with AdamP excluding that projection part and pytorch Adam.
However, the results look somehow different.

Is there any part that I am missing?
(Initial states are fixed the same, seeds are the same too!)

Thanks!

Depereciation warning in pytorch 1.5 (maybe and above?)

Hi! I encountered following warning while I using AdamP for my project

..\torch\csrc\utils\python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha)

Might this be relevant to the AdmaP update?

Builtin cosine_similarity

Hello, curious if there is any reason why the projection function uses self defined cosine sim vs something like below

    def _projection(self, p, grad, perturb, delta, wd_ratio, eps):
        wd = 1
        expand_size = [-1] + [1] * (len(p.shape) - 1)
        for view_func in [self._channel_view, self._layer_view]:
            g_view = view_func(grad)
            p_view = view_func(p.data)

            cosine_sim = F.cosine_similarity(g_view, p_view, dim=1, eps=eps).abs_()

            if cosine_sim.max() < delta / math.sqrt(p_view.size(1)):
                p_n = p.data / p_view.norm(dim=1).add_(eps).view(expand_size)
                perturb -= p_n * view_func(p_n * perturb).sum(dim=1).view(expand_size)
                wd = wd_ratio

                return perturb, wd

        return perturb, wd

Question about eq (5) in section 2.3

I appreciate for your work!

I have a question about eq (5) in section 2.3.

image

Please, give me an explanation about why w_{t+1} - w_t = \delta w_t, in most right term.

Hyperparameters of AdamW

image
Table 2 of your paper shows that AdamW on ImageNet is as good as SGDM, which is very excited. Would like to share with us the hyperparameters? Thx!

  • I guess from your paper ResNet + AdamW is AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0001, amsgrad=False), Is it right? However, I have done an experiment with the above setting and it was two points lower than your result. I'm confused

  • What is the hyperparameter of MobileNetV2+AdamW

learning rate scheduler for adamp?

Hi,

From the paper you use cosine learning rate scheduler to train imagenet, did you apply both adamp and sgdp?

what is your opinion about using constant learning rate scheduler or other learning rate scheduler for adam series? From my experience, using step decay or cosine would be better than constant learning rate scheduler.

However, the advantage of adaptive optimizer is that we don't need to manually tune the learning rate scheduler, if we need to tune the learning rate scheduler, then what is the advantage for adam series?

Here are good discussions

thank you

Good bug in your implementation of AdamP

I noticed that you never use the step_size (https://github.com/clovaai/AdamP/blob/master/adamp/adamp.py#L81) which makes use of the bias_correction1 on beta1, which inflates the learning rate in the initial steps of training (see https://arxiv.org/abs/2110.10828). Instead you leave the learning rate as is, only applying bias_correction2 which inflates the estimated variance early in training, which lowers the learning rate in the first few epochs. I don't think you should fix this bug, but it may be helpful to just add a comment or something saying that you skip this step, and maybe even throw a link to my paper stating that this is not a bug :)

Runtime: Adam vs AdamP

Hi,
Thank you for the code release.
I am trying to run MIRNet by changing Adam to AdamP.
However, the training time per epoch is increased by nearly 2 times.
Is there any way to make it faster?

I tried with two environments Python 3.7, Pytorch 1.1, CUDA 9.0 and Python 3.7, Pytorch 1.4, CUDA 10.0 but both give the same speed.

Thanks

Could it be equivalent to normalize the weights ?

Thanks the authors for the very interesting paper and analysis.

I was wondering if an equivalent fix for the weight growth could be to normalize the weights of layers before normalization layers during training ? For example every 10 mini-batches I would normalize the weights, such that the operation remains cheap.

example params

What is the most simple example of params in "optimizer = AdamP(params, lr=0.001, betas=(0.9, 0.999), weight_decay=1e-2)"?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.