Hi I am trying to train your model with the provided config for NTU-60 XSUB with --hal

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Function 'LogSoftmaxBackward' returned nan values in its 0th output about ms-g3d HOT 4 CLOSED

kenziyuliu commented on May 18, 2024

Function 'LogSoftmaxBackward' returned nan values in its 0th output

from ms-g3d.

Comments (4)

snknitin commented on May 18, 2024 1

If it is an error in the 0th output, that means your weights are still not fully updated so some values in your first predictions are nans. So it's not your inputs, but your model predictions that are nans. Could be an overflow or underflow error. This will make any loss function give you a tensor(nan).What you can do is put a check for when loss is nan and let the weights adjust themselves

criterion = SomeLossFunc()
eps = 1e-6
loss = criterion(preds,targets)
if loss.isnan(): loss=eps
else: loss = loss.item()
loss = loss+ L1_loss + ...

from ms-g3d.

saniazahan commented on May 18, 2024

If I don't use half precision then I had to reduce the batch size to 16 and forward to 8. And the "nan" occurs at 94th step "Function 'CudnnBatchNormBackward' returned nan values in its 0th output"

You suggested I might get poor performance or unstable loss. I am not really sure why half precision will do that.

from ms-g3d.

kenziyuliu commented on May 18, 2024

Hi @saniazahan,

Thanks for your interest. Please find below responses to the questions:

"Function 'LogSoftmaxBackward' returned nan values in its 0th output"

I've never seen this error before, could it be related to your package versions?

"... pretrained model trained on un-normalized data ..."

IIRC the data preprocessing steps should follow directly from 2s-AGCN: https://github.com/lshiwjx/2s-AGCN. Can you clarify what "normalized data" you are referring to? One particular thing to note is that following previous work there's also a BN layer at the beginning of the model to do normalization: https://github.com/kenziyuliu/MS-G3D/blob/master/model/msg3d.py#L156

"reduce the batch size to 16 and forward to 8. And the "nan" occurs at 94th step "Function 'CudnnBatchNormBackward' returned nan values in its 0th output"

Unfortunately I have not seen this error before. In general batch size is a hyperparameter that often affects performance, so to reproduce results from the paper you should use the default settings. Note also that small batch sizes don't go well with BatchNorm.

from ms-g3d.

kenziyuliu commented on May 18, 2024

Hi there, I'll be closing this issue for now. Feel free to comment below if the issue was not resolved.

from ms-g3d.

Recommend Projects

Function 'LogSoftmaxBackward' returned nan values in its 0th output about ms-g3d HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs