GithubHelp home page GithubHelp logo

Comments (5)

fitsumreda avatar fitsumreda commented on August 19, 2024 1

Hi @topriss and @sczhou,

If you directly train flownet2 (entire network), the loss might explode.

To avoid this:

  1. use a very small learning rate 1e-10,
  2. enable BatchNormalisation
  3. or do the training as suggested in the flownet2 paper. I.e., train Flownet-C first, fix this, then train flownetS1, flownetS2, and so on.

from flownet2-pytorch.

fitsumreda avatar fitsumreda commented on August 19, 2024

@sczhou, could you give more information? Did it run fine for a few epochs before you run into this issue?

from flownet2-pytorch.

sczhou avatar sczhou commented on August 19, 2024

@fitsumreda
yes, it ran fine for a few epochs, I try to run the code twice, and it ran into this error in 15 epochs and 156epochs respectively, I think it might be random.
Thanks for your reply and help.

from flownet2-pytorch.

fitsumreda avatar fitsumreda commented on August 19, 2024

I wasn't able to reproduce the error

from flownet2-pytorch.

topriss avatar topriss commented on August 19, 2024

Hi, thanks for your great code.
I think I have the same problem, using params
--model FlowNet2 --loss=L2Loss --training_dataset FlyingChairs --validation_dataset MpiSintelClean
it rans into inf error in epoch 1 batch 1302 as shown in the log below. (I made some change in logging style).
Any response would be appreciated.

`
[12-27 06:18:15.72] Training Epoch 1 1295/2859 L2: 29.240, EPE: 29.240, lr: 1.0e-04
Operation forward : 172.17 ms
Operation backward : 725.52 ms
Operation optim step : 36.59 ms
[12-27 06:18:16.99] Training Epoch 1 1296/2859 L2: 64.268, EPE: 64.268, lr: 1.0e-04
Operation forward : 170.37 ms
Operation backward : 677.62 ms
Operation optim step : 35.81 ms
[12-27 06:18:17.88] Training Epoch 1 1297/2859 L2: 784.324, EPE: 784.324, lr: 1.0e-04
Operation forward : 170.28 ms
Operation backward : 725.62 ms
Operation optim step : 35.86 ms
[12-27 06:18:18.81] Training Epoch 1 1298/2859 L2: 19857.357, EPE: 19857.357, lr: 1.0e-04
Operation forward : 176.19 ms
Operation backward : 789.99 ms
Operation optim step : 35.88 ms
[12-27 06:18:19.82] Training Epoch 1 1299/2859 L2: 3731196.750, EPE: 3731196.750, lr: 1.0e-04
Operation forward : 171.34 ms
Operation backward : 687.62 ms
Operation optim step : 36.55 ms
[12-27 06:18:21.13] Training Epoch 1 1300/2859 L2: 18071404544.000, EPE: 18071404544.000, lr: 1.0e-04
Operation forward : 171.53 ms
Operation backward : 748.87 ms
Operation optim step : 35.85 ms
[12-27 06:18:22.09] Training Epoch 1 1301/2859 L2: 23380791132160.000, EPE: 23380791132160.000, lr: 1.0e-04
Operation forward : 170.73 ms
Operation backward : 682.66 ms
Operation optim step : 35.91 ms
[12-27 06:18:22.98] Training Epoch 1 1302/2859 L2: 7276938393550848.000, EPE: 7276938393550848.000, lr: 1.0e-04
Operation forward failed
[12-27 06:18:23.16] Operation failed

Traceback (most recent call last):
File "mymain.py", line 342, in
logger=None)
File "mymain.py", line 291, in train
losses = model(data[0], target[0])
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
RuntimeError: value cannot be converted to type double without overflow: inf
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=29 : driver shutting down
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=29 : driver shutting down
`

from flownet2-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.