GithubHelp home page GithubHelp logo

Comments (25)

forever208 avatar forever208 commented on June 29, 2024 3

Hello! I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

about 0.055. ImageNet is the most time-consuming dataset to train, I suggest you first try with cifar10 or LSUN datasets.

Self-promotion: our ICML 2023 paper DDPM-IP shows an extremely easy way to dramatically improve the FID and training speed based on guided-diffusion, feel free to take a look.

from guided-diffusion.

forever208 avatar forever208 commented on June 29, 2024 2

@JawnHoan hi, if you still have this issue, I suggest you decrease the learning rate.

In my experiments, I use batch=128 for Imagenet64, lr=1e-4 cause this nan issue.
Therefore, I changed the learning rate from 1e-4 to 3e-5, problem solved.
Hope this will be helpful

from guided-diffusion.

unixpickle avatar unixpickle commented on June 29, 2024

from guided-diffusion.

ShoufaChen avatar ShoufaChen commented on June 29, 2024

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

from guided-diffusion.

unixpickle avatar unixpickle commented on June 29, 2024

Do you have a record of the loss before the NaN occurred? Did it spike right before NaNs started happening?

Your command itself looks good to me, so I don't think it's a simple hyperparameter issue. Also, have you tried looking at samples from before the divergence, as a sanity check that the model is actually learning correctly?

from guided-diffusion.

unixpickle avatar unixpickle commented on June 29, 2024

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

from guided-diffusion.

ShoufaChen avatar ShoufaChen commented on June 29, 2024

Thanks for your help.

I will patch this bug and try again. I will post my results in about 2 days.

from guided-diffusion.

realPasu avatar realPasu commented on June 29, 2024

The problem of NaNs still exists with this changing.

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

from guided-diffusion.

ShoufaChen avatar ShoufaChen commented on June 29, 2024

I am now at 430180 steps and don't meet NaN.

from guided-diffusion.

realPasu avatar realPasu commented on June 29, 2024

That's so strange. I'm training a 256*256 model with batch size 256 and learning rate 1e-4 on 8 nodes.
You say that you didn't meet NaNs. I wonder that the exact meaning of your commet is you didn't meet NaNs anymore or you didn't meet the problem of infinitely decreasing lg_loss_scale even if you met NaN?
After applying the changing, my training log is still similar to the origin one. My training process is resumed from a partly trained model with about 300k iterations. While training, I met NaNs after thousands of iterations but it can be solved by decreasing the lg_loss_scale at most conditions. But the training will finally fail after about 10-20k iterations (decreasing lg_loss_scale) and I have to stop training and resume a new training process from the last normal checkpoint.

from guided-diffusion.

ShoufaChen avatar ShoufaChen commented on June 29, 2024

I am training a 128*128 ImageNet model.

from guided-diffusion.

forever208 avatar forever208 commented on June 29, 2024

The problem of NaNs still exists with this change.

Perhaps this bug is related to the issue here: #44
If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

me too, still have nan loss and the training fails when training on ImageNet 64*64

from guided-diffusion.

forever208 avatar forever208 commented on June 29, 2024

@ShoufaChen , hi, have you thoroughly solved this issue? have you got any nan loss and fail anymore?

from guided-diffusion.

HoJ-Onle avatar HoJ-Onle commented on June 29, 2024

Hello! I also had this problem. Did you solved it? In fact, I met this problem but the program still works. Maybe the loss is not broken yet.. But it told me that "Found Nan".

----------------------------
| lg_loss_scale | -909     |
| loss          | 0.115    |
| loss_q0       | 0.261    |
| loss_q1       | 0.0599   |
| loss_q2       | 0.0339   |
| loss_q3       | 0.0241   |
| mse           | 0.111    |
| mse_q0        | 0.25     |
| mse_q1        | 0.0594   |
| mse_q2        | 0.0336   |
| mse_q3        | 0.0237   |
| samples       | 1.98e+03 |
| step          | 990      |
| vb            | 0.00385  |
| vb_q0         | 0.0104   |
| vb_q1         | 0.00048  |
| vb_q2         | 0.00031  |
| vb_q3         | 0.000323 |
----------------------------
Found NaN, decreased lg_loss_scale to -915.944
Found NaN, decreased lg_loss_scale to -916.944
Found NaN, decreased lg_loss_scale to -917.944
Found NaN, decreased lg_loss_scale to -918.944
Found NaN, decreased lg_loss_scale to -919.944
Found NaN, decreased lg_loss_scale to -920.944
Found NaN, decreased lg_loss_scale to -921.944
Found NaN, decreased lg_loss_scale to -922.944
Found NaN, decreased lg_loss_scale to -923.944

Looking forward to your reply.

from guided-diffusion.

forever208 avatar forever208 commented on June 29, 2024

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

from guided-diffusion.

HoJ-Onle avatar HoJ-Onle commented on June 29, 2024

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists.
And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

from guided-diffusion.

ZGCTroy avatar ZGCTroy commented on June 29, 2024

@JawnHoan My solution is re-git the whole repo again and implement your own method...
I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

I think It is normal to find Nan during Mixed Precision Training and "decrease lg_loss_scale" is excatly the way of fixing the problem of Nan. However, if the program keeps finding Nan means that decreasing lg_loss_scale is not able to fix the problem.

from guided-diffusion.

fido20160817 avatar fido20160817 commented on June 29, 2024

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

from guided-diffusion.

fido20160817 avatar fido20160817 commented on June 29, 2024

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

Hi, how do you achieve multi node multi GPU training, do you changes the code? I try multi-node-multi-GPU on other programe, but I failed, because the slow commucation between different nodes, do you notice this, can you share some experience of multi-node-multi-GPU training?

from guided-diffusion.

forever208 avatar forever208 commented on June 29, 2024

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

@fido20160817 it is normal, no worries about it

from guided-diffusion.

fido20160817 avatar fido20160817 commented on June 29, 2024

Thanks!🤝

from guided-diffusion.

ONobody avatar ONobody commented on June 29, 2024

@forever208 Hello, may I add your contact information to ask some questions? Thank you.

from guided-diffusion.

forever208 avatar forever208 commented on June 29, 2024

Hi @ONobody, of course, my email: [email protected]

from guided-diffusion.

hxy-123-coder avatar hxy-123-coder commented on June 29, 2024

Hello! I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

from guided-diffusion.

hxy-123-coder avatar hxy-123-coder commented on June 29, 2024

Thanks a lot.

from guided-diffusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.