When I went through the function to calculate photometric reconstruction loss I found

Reconstruction loss as NaN about sfmlearner-pytorch HOT 8 CLOSED

clementpinard commented on May 25, 2024

Reconstruction loss as NaN

from sfmlearner-pytorch.

Comments (8)

ClementPinard commented on May 25, 2024

A NaN reconstruction Loss causes all the gradients to be NaN. Even if only one value is NaN over the whole diff map, you will get NaN at the next optimizer step, so you really want to avoid that !

This line is here to help you figure out what goes wrong if you get to a NaN training loss, since as soon as this gets to NaN, your network is basically bound to output NaN until the end.

Now on how it got to be NaN depends on your problem, I advise you to identify a seed on which it appears each time and try to find where a first NaN appears.

My guess is maybe when computing u,v coordinates here https://github.com/ClementPinard/SfmLearner-Pytorch/blob/master/inverse_warp.py#L65 since we divide by a Z value, and when it's 0 you get NaN.

from sfmlearner-pytorch.

asprasan commented on May 25, 2024

I do understand that once loss goes to NaN, the whole training becomes pointless. Gradients go to NaN and there's no way of coming back.

However, I was intrigued by the fact that checking for loss to be NaN is done only for the photometric reconstruction loss. Other loss functions are not checked for NaN. So, I was wondering whether you encountered any special scenario in which photometric reconstruction loss became NaN?

The Z value in the "inverse_warp.py" is being clamped to 1e-3. Hence I don't see any chance under which a divide by zero will occur. The kinect depth that I'm using has lot of zeros. If division with zero was the issue then it should have happened at every iteration. Could this be due to some overflow error?

from sfmlearner-pytorch.

ClementPinard commented on May 25, 2024

other loss function are actually much simpler since their target value is fixed (smooth loss and explainability loss) but you can check for them too.

you can also try discard 0 values in your Kinect because it can cause high distance to warp for a particular translation.

The other potnetial source of NaN can be Adam optimizer which has a second order term which can diverge if your learning rate is too, you should check for weights values after optimizer step too.

from sfmlearner-pytorch.

asprasan commented on May 25, 2024

The 0 values in Kinect are either due to objects are very far away or they don't reflect the IR light that is projected. Currently I'm not taking care for that in the warping part. However I'm setting the photometric loss to zero at those pixels where the depth is zero.

Previously I have faced issue with Adam optimizer. However, it may not be the case here because only the photometric loss is going to NaN. Other losses are all within reasonable range.

I will have a more careful look at the code and try to log things properly whenever I encounter the loss to be NaN.

from sfmlearner-pytorch.

versatran01 commented on May 25, 2024

I tried to reimplement this and also got nans in photometric reconstruction loss. It only happened in the monocular case, not the stereo one. It is very annoying that training would suddenly die. I haven't tried to clamp the depth, hopefully that will fix it.

from sfmlearner-pytorch.

anuragranj commented on May 25, 2024

Well, the depth computation is monocular here. What do you mean by the stereo case? Are the NaNs because of zero depth?

from sfmlearner-pytorch.

asprasan commented on May 25, 2024

It's definitely not because of zero depth. Depth is being clamped to 0.001 before being divided. One reason could be some overflow/underflow of the numbers.

In the LSD-SLAM paper authors modify inverse depth such that the mean inverse depth is 1 in every iteration. May be that could solve the problem.

It's definitely very annoying to see that the training stops abruptly and we can't figure out what's wrong.

from sfmlearner-pytorch.

versatran01 commented on May 25, 2024

I was talking about my own implementation, apologize for the confusion.
Originally I did not clamp depth to nonzero and I got nans, but after clamping it never happen again.

from sfmlearner-pytorch.

Reconstruction loss as NaN about sfmlearner-pytorch HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs