Hi, I meet a question during the training process, the accuracy suddenly approaches ze

Hi @liwei9719 <a class="user-mention notranslate" data-hovercard-type="user" data-hove

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I met the same problem.. <a target="_blank" rel="noopener noreferrer

Training accuracy suddenly approaches zero about alphanet HOT 6 CLOSED

facebookresearch commented on August 15, 2024

Training accuracy suddenly approaches zero

from alphanet.

Comments (6)

dilinwang820 commented on August 15, 2024

Hi @liwei9719,

Thanks for your interest in this repo. Does this happen all the time? Would you be able to load the previous checkpoint and resume the training with expected performance? From my experiences, the training is largely stable if the gradients are clipped properly (e.g., set clip_grad_val to 1.0)

Regarding the negative training loss, I must apologize for the confusion here. The reason is that we don't use the actual alpha-divergence for logging. Instead we first compute the gradient we need, and then constructing a surrogate loss to produce this gradient. The surrogate loss might be negative. Thanks.

from alphanet.

liwei109 commented on August 15, 2024

I only trained once because of the huge cost.I have set clip_grad_val to 1.0. Does this parameter setting cause the above phenomenon?

from alphanet.

dilinwang820 commented on August 15, 2024

Maybe try to resume from a previously saved checkpoint with a different random seed?

from alphanet.

jun-fang commented on August 15, 2024

Hi @liwei9719 @dilinwang820,

I met the same problem at around 60 epoch, and I tried multiple times to resume from a saved checkpoint with another random seed but it did not solve the issue.

@liwei9719 did you find a solution for this?

@dilinwang820 do you have an idea why this happens and what could be a good way to avoid this phenomenon?

Thanks!

from alphanet.

dilinwang820 commented on August 15, 2024

@jun-fang to my experience, the training is always stable with the default settings; maybe try to warm up a little bit with less regularization and data augmentation, and then resume with default settings?

from alphanet.

pprp commented on August 15, 2024

I met the same problem..

from alphanet.

Recommend Projects

Training accuracy suddenly approaches zero about alphanet HOT 6 CLOSED

Comments (6)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs