Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Perhaps this bug is related to the issue here: <a class="issue-link js-issue-link" dat

nan occurs when training ImageNet 128x128 about guided-diffusion HOT 25 OPEN

openai commented on June 29, 2024

nan occurs when training ImageNet 128x128

from guided-diffusion.

Comments (25)

forever208 commented on June 29, 2024 3

Hello！ I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

about 0.055. ImageNet is the most time-consuming dataset to train, I suggest you first try with cifar10 or LSUN datasets.

Self-promotion: our ICML 2023 paper DDPM-IP shows an extremely easy way to dramatically improve the FID and training speed based on guided-diffusion, feel free to take a look.

from guided-diffusion.

forever208 commented on June 29, 2024 2

@JawnHoan hi, if you still have this issue, I suggest you decrease the learning rate.

In my experiments, I use batch=128 for Imagenet64, lr=1e-4 cause this nan issue.
Therefore, I changed the learning rate from 1e-4 to 3e-5, problem solved.
Hope this will be helpful

from guided-diffusion.

unixpickle commented on June 29, 2024

Hi Shoufa, Could you please send the exact command you are running for training? This is indeed a NaN during the forward pass (hence losses are NaN), which looks like a divergence.

…

On Tue, Jul 12, 2022 at 12:34 AM Shoufa Chen ***@***.***> wrote: Hi, @unixpickle <https://github.com/unixpickle> Thanks for your awesome work and open source. I met the nan issue when training on ImageNet 128x128, ----------------------------- | lg_loss_scale | -1.62e+04 | | loss | nan | | loss_q0 | nan | | loss_q1 | nan | | loss_q2 | nan | | loss_q3 | nan | | mse | nan | | mse_q0 | nan | | mse_q1 | nan | | mse_q2 | nan | | mse_q3 | nan | | samples | 3.92e+07 | | step | 1.53e+05 | | vb | nan | | vb_q0 | nan | | vb_q1 | nan | | vb_q2 | nan | | vb_q3 | nan | ----------------------------- Found NaN, decreased lg_loss_scale to -16199.354 Found NaN, decreased lg_loss_scale to -16200.354 Found NaN, decreased lg_loss_scale to -16201.354 Found NaN, decreased lg_loss_scale to -16202.354 Found NaN, decreased lg_loss_scale to -16203.354 I used fp16. Did you meet similar issues? Thanks in advance. — Reply to this email directly, view it on GitHub <#50>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADDEBJI44CYURC4FKFL2G3VTUNXTANCNFSM53J7YVXA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from guided-diffusion.

ShoufaChen commented on June 29, 2024

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

from guided-diffusion.

unixpickle commented on June 29, 2024

Do you have a record of the loss before the NaN occurred? Did it spike right before NaNs started happening?

Your command itself looks good to me, so I don't think it's a simple hyperparameter issue. Also, have you tried looking at samples from before the divergence, as a sanity check that the model is actually learning correctly?

from guided-diffusion.

unixpickle commented on June 29, 2024

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

from guided-diffusion.

ShoufaChen commented on June 29, 2024

Thanks for your help.

I will patch this bug and try again. I will post my results in about 2 days.

from guided-diffusion.

realPasu commented on June 29, 2024

The problem of NaNs still exists with this changing.

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

from guided-diffusion.

ShoufaChen commented on June 29, 2024

I am now at 430180 steps and don't meet NaN.

from guided-diffusion.

realPasu commented on June 29, 2024

That's so strange. I'm training a 256*256 model with batch size 256 and learning rate 1e-4 on 8 nodes.
You say that you didn't meet NaNs. I wonder that the exact meaning of your commet is you didn't meet NaNs anymore or you didn't meet the problem of infinitely decreasing lg_loss_scale even if you met NaN?
After applying the changing, my training log is still similar to the origin one. My training process is resumed from a partly trained model with about 300k iterations. While training, I met NaNs after thousands of iterations but it can be solved by decreasing the lg_loss_scale at most conditions. But the training will finally fail after about 10-20k iterations (decreasing lg_loss_scale) and I have to stop training and resume a new training process from the last normal checkpoint.

from guided-diffusion.

ShoufaChen commented on June 29, 2024

I am training a 128*128 ImageNet model.

from guided-diffusion.

forever208 commented on June 29, 2024

The problem of NaNs still exists with this change.

Perhaps this bug is related to the issue here: #44
If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

me too, still have nan loss and the training fails when training on ImageNet 64*64

from guided-diffusion.

forever208 commented on June 29, 2024

@ShoufaChen , hi, have you thoroughly solved this issue? have you got any nan loss and fail anymore?

from guided-diffusion.

HoJ-Onle commented on June 29, 2024

Hello! I also had this problem. Did you solved it? In fact, I met this problem but the program still works. Maybe the loss is not broken yet.. But it told me that "Found Nan".

----------------------------
| lg_loss_scale | -909     |
| loss          | 0.115    |
| loss_q0       | 0.261    |
| loss_q1       | 0.0599   |
| loss_q2       | 0.0339   |
| loss_q3       | 0.0241   |
| mse           | 0.111    |
| mse_q0        | 0.25     |
| mse_q1        | 0.0594   |
| mse_q2        | 0.0336   |
| mse_q3        | 0.0237   |
| samples       | 1.98e+03 |
| step          | 990      |
| vb            | 0.00385  |
| vb_q0         | 0.0104   |
| vb_q1         | 0.00048  |
| vb_q2         | 0.00031  |
| vb_q3         | 0.000323 |
----------------------------
Found NaN, decreased lg_loss_scale to -915.944
Found NaN, decreased lg_loss_scale to -916.944
Found NaN, decreased lg_loss_scale to -917.944
Found NaN, decreased lg_loss_scale to -918.944
Found NaN, decreased lg_loss_scale to -919.944
Found NaN, decreased lg_loss_scale to -920.944
Found NaN, decreased lg_loss_scale to -921.944
Found NaN, decreased lg_loss_scale to -922.944
Found NaN, decreased lg_loss_scale to -923.944

Looking forward to your reply.

from guided-diffusion.

forever208 commented on June 29, 2024

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

from guided-diffusion.

HoJ-Onle commented on June 29, 2024

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists.
And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

from guided-diffusion.

ZGCTroy commented on June 29, 2024

@JawnHoan My solution is re-git the whole repo again and implement your own method...
I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

I think It is normal to find Nan during Mixed Precision Training and "decrease lg_loss_scale" is excatly the way of fixing the problem of Nan. However, if the program keeps finding Nan means that decreasing lg_loss_scale is not able to fix the problem.

from guided-diffusion.

fido20160817 commented on June 29, 2024

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

from guided-diffusion.

fido20160817 commented on June 29, 2024

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

Hi, how do you achieve multi node multi GPU training, do you changes the code? I try multi-node-multi-GPU on other programe, but I failed, because the slow commucation between different nodes, do you notice this, can you share some experience of multi-node-multi-GPU training?

from guided-diffusion.

forever208 commented on June 29, 2024

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

@fido20160817 it is normal, no worries about it

from guided-diffusion.

fido20160817 commented on June 29, 2024

Thanks！🤝

from guided-diffusion.

ONobody commented on June 29, 2024

@forever208 Hello, may I add your contact information to ask some questions? Thank you.

from guided-diffusion.

forever208 commented on June 29, 2024

Hi @ONobody, of course, my email: [email protected]

from guided-diffusion.

hxy-123-coder commented on June 29, 2024

Hello！ I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

from guided-diffusion.

hxy-123-coder commented on June 29, 2024

Thanks a lot.

from guided-diffusion.

nan occurs when training ImageNet 128x128 about guided-diffusion HOT 25 OPEN

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs