sail-sg / adan Goto Github PK

View Code? Open in Web Editor NEW

726.0 7.0 63.0 1.33 MB

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

License: Apache License 2.0

Python 87.85% Shell 0.85% Cuda 7.11% C++ 4.19%

adan bert-model convnext deep-learning fairseq mae optimizer resnet timm vit

adan's People

Contributors

Stargazers

Watchers

adan's Issues

Whether it is applied to the training of GAN network？

Thank you for your impressive work.
I would like to ask if this work applied to unsupervised 3DGAN has the same improvement as other tasks？

Beta values are not same

According to your paper, you used adan with β1 = 0.02, β2 = 0.01, and β3 = 0.01 when fine-tuning Bert. But in your config file, they are all 0.9x like here. Which is right?

About the convergence trend comparison with Adamw in ViT-H

Hi,
Thank you very much for your brilliant work on Adan!
And from you paper, it said Adan should get a lower loss (both Train and test) than Adamw according to Figure 1. However, I got a higher training loss with Adan than AdamW in ViT-H:

Steps	Adamw_train_loss	Adan_train_loss
200	6.9077	6.9077
400	6.9074	6.9075
600	6.9068	6.9073
800	6.9061	6.907
1000	6.905	6.9064
1200	6.9036	6.9056
1400	6.9014	6.9044
1600	6.899	6.9028
1800	6.8953	6.9003
2000	6.8911	6.8971
2200	6.8848	6.8929
2400	6.8789	6.8893
2600	6.8699	6.8843
2800	6.8626	6.8805
3000	6.8528	6.8744
3200	6.8402	6.868
3400	6.8293	6.862
3600	6.8172	6.8547
3800	6.7989	6.8465
4000	6.7913	6.8405

I used the same HPs as AdamW and only changed beta from (0.9, 0.999) to (0.9, 0.92, 0.999).
I only trained for few steps to see the trend. But it seems the loss gap from AdamW is quite big, should I change other HPs to better using Adan? How can I get a lower Loss than AdamW?
I noticed that Adan prefers a large batch size in Vision tasks, should we using a larger batch size?
Or should I train with more steps to see the trend?
Thank you!

Some questions about learning rate.

Thank you for your brilliant work.

I want to ask some questions about Adan's learning rate.

Does Adan use learning rate decay in the paper?
Is the Adan optimizer sensitive to the initial learning rate?
How to set the learning rate compared with adam under the same task conditions?

Thank you!

About the pre-trained model

Could you please release the pre-trained ViT-S based on MAE?

How to implement Adan optimizer in Yolov7?

Can I use Adan optimizer in Yolov7, if yes, what are the steps to implement this.

Adan相比于SGD在前 74 epochs保持领先，但是后续收敛变慢，我改如何调整lr等超参数？

optimizer = Adan(pg0,lr = 1e-3, betas=(0.98, 0.92, 0.99), eps=1.0e-08, weight_decay=0.02)，

Black is SGD,pink is Adan.

valueError: not enough values to unpack (expected 3, got 2)

Hello! Thank you for your work. Now I have a problem.
I don't know how to solve it
Traceback (most recent call last):
File "/home/anaconda/envs/main/Lib/python5.8/site-packages/tonch/optim/optimizer.py" line 113,in wrapperreturn func(*args,**kwargs)
File "/home/anaconda/envs/ main/Lit/python3.8/site-packages/torch/autogpad/gnad_mode.py"，line 27，in decorate_contextreturn func(*args，**kwargs)
File "/home/main/adan.py", line 121,in step
beta1, beta2, beta3 = group [ ' betas ']
valueError: not enough values to unpack (expected 3, got 2)

why there is no sgd-style implementation？

why there is no sgd-style implementation or experiments？
（lemma 1 in the paper)

Some questions in step function

Thank you for your impressive work. I have some questions in your adan.py about step function.
In line 179-180, that is:

for p, copy_grad in zip(group['params'], copy_grads):
    self.state[p]['pre_grad'] = copy_grad

It seems that you want to save the corresponding pre_grad. But I have the following bug:

I think this is because the former contains all parameters, while the latter only contains parameters with gradient. So I made the following changes:

for p, copy_grad in zip(params_with_grad, copy_grads):
    self.state[p]['pre_grad'] = copy_grad

With this modification, I can run normally. Do you think what problems I have encountered and that this modification is correct? @XingyuXie

`no_prox` Flag

Hi there,

I'm just wondering about the no_prox setting.

First of all, does it stand for "approximation"?

In the paper, Algorithm 1, line 7 corresponds to no_prox=True —
why is the default setting in this repo False? Why do you include this option at all?

Were the experiments in the paper done as the algorithm states, or with no_prox=True?

Again, I really appreciate the work! Am just struggling with this detail.

`torch._foreach...` implementation

Hi, very interesting work!
The only problem i see is that your optimizer is slower that sgd/adamw which may discourage some people from using it. Do you plan adding an implementation using torch._foreach... functions? Examples could be seen in torch.optim. This would significantly speed-up your optimizer while having literally no drawbacks.

If you're interested i could take a look and implement this myself, but it would be in 1-2 weeks when i'm less busy

module 'fused_adan' has no attribute 'adan_multi_tensor'

how to solve this problem module 'fused_adan' has no attribute 'adan_multi_tensor'

Step 2 of Usage

Step 2 of Usage in the documentation says

from adam import Adan

I was wondering if it you meant

from adan import Adan

Settings for instruction-tuning

Hi, Adan是一个性能十分优秀的优化器，谢谢你们的工作。

但我最近在尝试用Adan进行指令微调时，发现loss曲线很漂亮，但是下游任务表现（GSM-8k）不如预期。
同样的数据处理和评测，AdamW大概9.63，Adan只有5.08左右。

AdamW超参数：weight_decay 0.01, lr 2e-5
Adan超参数：weight_decay 0.02，按照repo的建议lr尝试了2e-4 1e-4, GSM8k都比较低
lr scheduler都是3%升到最高然后下降到0

AdamW的训练loss曲线：

Adan的训练loss曲线：

使用的代码：

from adan import Adan
optimizer = Adan(model.parameters(), lr=args.lr, weight_decay=0.02, foreach=True, fused=True)

想知道有没有一些对指令微调的超参设置建议？

在我的cnn模型中，lr=0.01时，在20-30epoch，map可以提升的很快但是后续会成为NAN。但是如果使用0.001不会直接为NAN，但是效果不好，请问这个现象代表着什么问题？谢谢！

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

How to install without CUDA_HOME environment variable? ~~For example https://github.com/mapillary/inplace_abn don`t ask about CUDA_HOME.~~

xxx@xxx:~$ python3 -m pip install git+https://github.com/sail-sg/Adan.git
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/sail-sg/Adan.git
  Cloning https://github.com/sail-sg/Adan.git to /tmp/pip-req-build-zs78qhzq
  Running command git clone --filter=blob:none --quiet https://github.com/sail-sg/Adan.git /tmp/pip-req-build-zs78qhzq
  Resolved https://github.com/sail-sg/Adan.git to commit 8f559205f67e565b3bea09554354d69000bd819c
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-req-build-zs78qhzq/setup.py", line 5, in <module>
          cuda_extension = CUDAExtension(
        File "/home/xxx/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1047, in CUDAExtension
          library_dirs += library_paths(cuda=True)
        File "/home/xxx/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
        File "/home/xxx/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2230, in _join_cuda_home
          raise EnvironmentError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

如何设置Adan学习率

您好请问您是否有研究过将Adan用于Diffusion模型训练，其学习率应该如何设置，可否与使用AdamW的学习率一样？

Concrete weight decay configuration for GPT-2 pretraining

Dear authors:

According to the README.md of this amazing project, the weight_decay param should be 0.02, while in the configuration file attached in #32, the WD seems to be 0.05. Also, only beta3 is explicitly specified in the aforementioned configuration file, I can only inspect from https://github.com/sail-sg/Adan/blob/main/gpt2/README.md that

beta1 = 0.98
beta2 = 0.92

However, weight_decay=0.02 together with the other hyperparams above yields an inferior val loss curve compared with (that of the AdamW baseline)[https://github.com/karpathy/nanoGPT/blob/master/config/train_gpt2.py]. Thus, do you have any suggestion about the hyperparams I mentioned? Thanks!

HumanEval shall not be used for training.

HumanEval is a evaluation dataset, you shouldn't train on it and evaluate on exactly the same dataset.

Instead, you can use the github part in the Pile, or other coding source data for training. Before training, make sure the training set doesn't contain HumanEval to avoid probable data leakage.

block: [0,0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed.

Hi, i try Adan on a keypoints task, i got error like this:

./aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [97,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [2,0,0], thread: [32,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [2,0,0], thread: [33,0,0] Assertion `input_val >= zero && input_val <= one` fathread: [51,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [52,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [53,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [54,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [55,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [56,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [57,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [58,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [59,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [60,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [61,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [62,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [63,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [94,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [95,0,0] Assertion `input_val >= zero && input_val <= one` failed.
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

my config:

BASE_LR: 0.05 # maybe 0.012?
STEPS: (40000, 65000, 70000, 85000) # step point need to carefully check
WARMUP_FACTOR: 0.001
# WARMUP_ITERS: 1200
WARMUP_ITERS: 3500
MAX_ITER: 900000
# LR_SCHEDULER_NAME: "WarmupCosineLR"
LR_SCHEDULER_NAME: "WarmupMultiStepLR"
WEIGHT_DECAY: 0.02
MOMENTUM: 0.9
BACKBONE_MULTIPLIER: 0.9
OPTIMIZER: "Adan"

this is on detectron2, config on 8 GPU

why does it happen?

RuntimeError: The detected CUDA version (12.2) mismatches the version that was used to compile PyTorch (11.8).

Hi authors,

I am trying to install Adan with the described command: "python3 -m pip install git+https://github.com/sail-sg/Adan.git", however, I couldn't install it due to the error below, I checked and already saw that torch is installed and worked, do you have any suggestion to install it? I have no idea how to fix this error.

Building wheels for collected packages: adan Building wheel for adan (setup.py) ... error error: subprocess-exited-with-error × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [54 lines of output] running bdist_wheel /root/miniconda3/envs/neurips24/lib/python3.8/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend. warnings.warn(msg.format('we could not find ninja.')) running build running build_py creating build creating build/lib.linux-x86_64-cpython-38 copying adan.py -> build/lib.linux-x86_64-cpython-38 running build_ext Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/tmp/pip-req-build-wcs6gasc/setup.py", line 20, in setup( File "/root/miniconda3/envs/xxx/lib/python3.8/site-packages/setuptools/init.py", line 103, in setup return distutils.core.setup(**attrs) File "/root/miniconda3/envs/xxx/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup return run_commands(dist) ...
raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda)) RuntimeError: The detected CUDA version (12.2) mismatches the version that was used to compile PyTorch (11.8). Please make sure to use the same CUDA versions. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for adan Running setup.py clean for adan Failed to build adan ERROR: Could not build wheels for adan, which is required to install pyproject.toml-based projects

Check with the torch, it is ok:

Typo in the paper

Looking at arxiv version. In Appendix C in the last two lines of the Eq. 10 and the first line of the following update rule: \theta in the last term should have index k-1 instead of k.

(Not sure if this is the appropriate place to report paper typos, please tell me if there is a more sutable one).

Training yolov5 model appears nan

Embedding tensors/weight update unsupported

Hello!

I think I found a bug in the Adan optimizer, which affects embedding tables.

I implemented Adan optimzier in Tensorflow 2. You could find the implementation here

I wanted to keep the implementation as close to the original code as possible. However, there are different approaches for updating "sparse" tensors in TensorFlow and PyTorch. An example of a "sparse" tensor is an embedding matrix. Pytorch treats "sparse" data as if it was dense. TensorFlow has two functions for making updates - _resource_apply_dense for dense and _resource_apply_sparse for "sparse".

I decided to test the correctness of my implementation using the following logic:

Define a function to optimize. In case of "dense" optimization, it's simple linear regression, in case of "sparse" - make all embeddings equal to 1 (see tf_adan/test_adan_*.py)
Generate random input data and initial weights matrix.
Optimize weights matrix using official and my implementation. Optimziers have same hparams.
Compare loss history and weights after optimization. If they are equal - my implementation is correct.

I noticed that loss history and weights after optimization is the same for dense parameters. However, my implementation shows a better loss for embedding params weights after optimization isn't the same. It's especially noticeable in cases when the batch contains only a few possible categories. For example, categorical features have 2k unique values, while the batch size equals 100:

source

I think the source of the bug is the following:

For "new" gradients, i.e., categorical values gradients, for which we haven't made an update before, we replace the previous gradient with the current gradient. This logic is implemented here:

Adan/adan.py

Line 130 in d864647

if 'pre_grad' not in state or group['step'] == 1:

As I understand, prev_grad for all "new" gradients on step>1 won't be replaced with the current gradient.

The other reason is that gradient params (exp_avg, exp_avg_sq, exp_avg_diff) are updated regardless of the presence of the category in the batch. That means that for categories

I'm unsure if it's a bug in your implementation or in mine. I also tested Adam optimizer in tf and torch, see:

https://github.com/DenisVorotyntsev/Adan/blob/02e66241a98958152315ae5358ee6f364f092f8b/tf_adan/utils.py#L37

Losses for Adam optimizers in tf/torch are almost the same.

What do you think? Looking forward to your thoughts.

Is there a TensorFlow/Keras implementation?

Is there a TensorFlow/Keras implementation of Adan? If no official version, do you know of any third-party implementation? Or alternatively, how many lines would you expect an implementation to have? (If not much I may do it myself and ask for your review if you have time.)

Install Error

Hey guys, I had some problems when I installed FusedAdan.
The information is below here. It reminds me that I don't have nvcc, but actually I have. Please help me.

(MDT) root@ubuntu20:~/Adan# pip install .
Processing /root/Adan
Preparing metadata (setup.py) ... done
Requirement already satisfied: torch in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from adan==0.0.2) (2.2.1+cu118)
Requirement already satisfied: filelock in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (3.9.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (4.8.0)
Requirement already satisfied: sympy in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (1.12)
Requirement already satisfied: networkx in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (3.2.1)
Requirement already satisfied: jinja2 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (3.1.2)
Requirement already satisfied: fsspec in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (2024.2.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.8.89 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.8.89)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.8.89 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.8.89)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.8.87 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.8.87)
Requirement already satisfied: nvidia-cudnn-cu11==8.7.0.84 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (8.7.0.84)
Requirement already satisfied: nvidia-cublas-cu11==11.11.3.6 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.11.3.6)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (10.9.0.58)
Requirement already satisfied: nvidia-curand-cu11==10.3.0.86 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (10.3.0.86)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.1.48 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.4.1.48)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.5.86 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.7.5.86)
Requirement already satisfied: nvidia-nccl-cu11==2.19.3 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (2.19.3)
Requirement already satisfied: nvidia-nvtx-cu11==11.8.86 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.8.86)
Requirement already satisfied: triton==2.2.0 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (2.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from jinja2->torch->adan==0.0.2) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from sympy->torch->adan==0.0.2) (1.3.0)
Building wheels for collected packages: adan
Building wheel for adan (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [7 lines of output]
running bdist_wheel
running build
running build_py
creating build/lib.linux-x86_64-cpython-310
copying adan.py -> build/lib.linux-x86_64-cpython-310
running build_ext
error: [Errno 2] No such file or directory: ':/usr/local/cuda-11.8/bin/nvcc'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for adan
Running setup.py clean for adan
Failed to build adan
ERROR: Could not build wheels for adan, which is required to install pyproject.toml-based projects

processing data for BERT experiment

The following steps are modified from Fairseq-Roberta. For completeness, we list some key steps here.

I would like to ask why you modified the dataset settings? In the original fairseq, it seems we can just download the raw data.

https://github.com/sail-sg/Adan/tree/main/NLP/BERT#ii-generate-raw-data
Can you share the code for generating raw code?

Gradient clipping option in DeepSpeed

Hi authors, I found there is an option called "gradient_clipping" in DeepSpeed's configuration options, which seems to be a clipping-by-norm method, too. Does this option has any potential interaction with the max_grad_norm param of Adan?

Restarting strategy

Hey, the repository does not implement the momentum restarting strategy from what I can tell.

If this is something you still have available, would you be so kind to add it in here? It would be super great to optimize Adan training further. :)

\epsilon not implemented as in the paper

Hi there,
$\epsilon$ is within the square root in the paper (L6 in Algorithm 1), but in the code, it is outside of the square root. Could you expand on the reason for this?

The BERT finetuning get_data file error?

Thanks for the inspiring work! However the TASK2PATH paths in download_glue_data.py under directory ./NLP/BERT/ seems to have denied permission to access. e.g. "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FCoLA.zip?alt=media&token=46d5e637-3411-4188-bc44-5809b5bfb5f4"

GPU type and GPU nums and total training time on Transformer-XL, GPT-2

Hi! Thank you for sharing your code.

I would like to know for each Transformer-XL, GPT-2 settings.

which GPU did you use?
how many GPUs are used for training
total training time

I saw logs, but I didn't figure out the exact number
https://github.com/sail-sg/Adan/tree/main/gpt2#results-and-logs-on-gpt2-345m
https://github.com/sail-sg/Adan/blob/main/gpt2/pretrain.sh
https://github.com/sail-sg/Adan/tree/main/NLP/Transformer-XL/exp_results

Thank you!

Deepspeed Integration

Hi~Thanks for your excellent work. Adan optimzier has rechived great success in my different experiments.
However, I really want any suggestions for integrating Adan with deepspeed.
I tried using the ds_config with adamw and simply replacing adamw with adan (of course, I adjusted the learning rate and weight decay correspondingly), but it's pretty slow.
Thank you in advance.

Suggestions for applying to visual dense prediction tasks.

HI~Thanks for you excitring work. I would like to know the performance of Adan for visual dense prediction tasks. I notice you mention that Adan is suitable for large batchsize. So I wondered if it would also work better for visual dense prediction tasks, which are usually not possible with a large batchsize. I have tried Adan in several tasks, but the results are similar or even inferior to its sgd/adamw counterparts. I have followed the best practices you mention in the paper and repo and was wondering if you have done similar experiments or if you have suggestions for tuning the parameters.

Thanks!

Best.

sail-sg / adan Goto Github PK

adan's People

Contributors

Stargazers

Watchers

Forkers

adan's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs