<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Hi Thank you for your interest in our paper. For <

Sorry, I missed the wd_ratio and <code class="notrans

Hyperparameters of AdamW about adamp HOT 7 CLOSED

clovaai commented on May 26, 2024

Hyperparameters of AdamW

from adamp.

Comments (7)

bhheo commented on May 26, 2024 5

Thank you for your interest in our paper.

For torch.optim.AdamW, you have to use weight_decay=0.1.
In AdamW paper, they decoupled the weight decay which means w = (1 - weight_decay)w
But, PyTorch implementation is w = (1 - lr * weight_decay) w
https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/optim/adamw.py#L73
It makes it easy to utilize the learning rate scheduler for weight decay but requires changing parameters.

In the paper, we followed the notation of AdamW paper.
So lr=1e-3, weight_decay=0.1 is the PyTorch parameter for weight decay 1e-4.

You can find a similar setting on NovoGrad paper
https://arxiv.org/pdf/1905.11286.pdf

from adamp.

bhheo commented on May 26, 2024 2

5e-3 is correct.
torch.optim.AdamW(param, lr=2e-3, weight_decay=5e-3)

It is 1e-5 in paper notation.

from adamp.

bhheo commented on May 26, 2024

Sorry, I missed the mobilenetV2
We used lr=2e-3, wd=5e-3, batch_size=1024 for MobileNetV2

from adamp.

junlinqu commented on May 26, 2024

Thank you for a quick response.

I'm still a little confused about MobileNet. For mobilenetV2, whther the PyTorch parameter for weight decay is 2.5 or 5e-3 ?

Thx

from adamp.

junlinqu commented on May 26, 2024

I see, Thx !!!

from adamp.

junlinqu commented on May 26, 2024

Sorry, I missed the wd_ratio and delta in AdamP. I know that AdamW and AdamP have the same hyperparameter, except for wd_ratio and delta .
For MobileNetV2 and AdamP of table 2, is the hyperparameter
AdamP(params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=5e-3, delta=0.1, wd_ratio=0.1, nesterov=True) ?

from adamp.

bhheo commented on May 26, 2024

You are correct.
But, we didn't use nesterov for fair comparison with AdamW.
I think nesterov=True will make better performance.

However, if you want the same setting, then
AdamP(params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=5e-3, delta=0.1, wd_ratio=0.1, nesterov=False)
epochs=150, label_smoothing=0.1

from adamp.

Recommend Projects

Hyperparameters of AdamW about adamp HOT 7 CLOSED

Comments (7)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs