GithubHelp home page GithubHelp logo

Comments (8)

ClementPinard avatar ClementPinard commented on May 25, 2024

A NaN reconstruction Loss causes all the gradients to be NaN. Even if only one value is NaN over the whole diff map, you will get NaN at the next optimizer step, so you really want to avoid that !

This line is here to help you figure out what goes wrong if you get to a NaN training loss, since as soon as this gets to NaN, your network is basically bound to output NaN until the end.

Now on how it got to be NaN depends on your problem, I advise you to identify a seed on which it appears each time and try to find where a first NaN appears.

My guess is maybe when computing u,v coordinates here https://github.com/ClementPinard/SfmLearner-Pytorch/blob/master/inverse_warp.py#L65 since we divide by a Z value, and when it's 0 you get NaN.

from sfmlearner-pytorch.

asprasan avatar asprasan commented on May 25, 2024

I do understand that once loss goes to NaN, the whole training becomes pointless. Gradients go to NaN and there's no way of coming back.

However, I was intrigued by the fact that checking for loss to be NaN is done only for the photometric reconstruction loss. Other loss functions are not checked for NaN. So, I was wondering whether you encountered any special scenario in which photometric reconstruction loss became NaN?

The Z value in the "inverse_warp.py" is being clamped to 1e-3. Hence I don't see any chance under which a divide by zero will occur. The kinect depth that I'm using has lot of zeros. If division with zero was the issue then it should have happened at every iteration. Could this be due to some overflow error?

from sfmlearner-pytorch.

ClementPinard avatar ClementPinard commented on May 25, 2024

other loss function are actually much simpler since their target value is fixed (smooth loss and explainability loss) but you can check for them too.

you can also try discard 0 values in your Kinect because it can cause high distance to warp for a particular translation.

The other potnetial source of NaN can be Adam optimizer which has a second order term which can diverge if your learning rate is too, you should check for weights values after optimizer step too.

from sfmlearner-pytorch.

asprasan avatar asprasan commented on May 25, 2024

The 0 values in Kinect are either due to objects are very far away or they don't reflect the IR light that is projected. Currently I'm not taking care for that in the warping part. However I'm setting the photometric loss to zero at those pixels where the depth is zero.

Previously I have faced issue with Adam optimizer. However, it may not be the case here because only the photometric loss is going to NaN. Other losses are all within reasonable range.

I will have a more careful look at the code and try to log things properly whenever I encounter the loss to be NaN.

from sfmlearner-pytorch.

versatran01 avatar versatran01 commented on May 25, 2024

I tried to reimplement this and also got nans in photometric reconstruction loss. It only happened in the monocular case, not the stereo one. It is very annoying that training would suddenly die. I haven't tried to clamp the depth, hopefully that will fix it.

from sfmlearner-pytorch.

anuragranj avatar anuragranj commented on May 25, 2024

Well, the depth computation is monocular here. What do you mean by the stereo case? Are the NaNs because of zero depth?

from sfmlearner-pytorch.

asprasan avatar asprasan commented on May 25, 2024

It's definitely not because of zero depth. Depth is being clamped to 0.001 before being divided. One reason could be some overflow/underflow of the numbers.

In the LSD-SLAM paper authors modify inverse depth such that the mean inverse depth is 1 in every iteration. May be that could solve the problem.

It's definitely very annoying to see that the training stops abruptly and we can't figure out what's wrong.

from sfmlearner-pytorch.

versatran01 avatar versatran01 commented on May 25, 2024

I was talking about my own implementation, apologize for the confusion.
Originally I did not clamp depth to nonzero and I got nans, but after clamping it never happen again.

from sfmlearner-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.