Hi, When I run the example on MPISintel Final and Clean, with L1Loss on FlowNet2 model

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Runtime Error: Value cannot be converted to type double without overflow: inf about flownet2-pytorch HOT 5 CLOSED

sczhou commented on August 19, 2024

Runtime Error: Value cannot be converted to type double without overflow: inf

from flownet2-pytorch.

Comments (5)

fitsumreda commented on August 19, 2024 1

Hi @topriss and @sczhou,

If you directly train flownet2 (entire network), the loss might explode.

To avoid this:

use a very small learning rate 1e-10,
enable BatchNormalisation
or do the training as suggested in the flownet2 paper. I.e., train Flownet-C first, fix this, then train flownetS1, flownetS2, and so on.

from flownet2-pytorch.

fitsumreda commented on August 19, 2024

@sczhou, could you give more information? Did it run fine for a few epochs before you run into this issue?

from flownet2-pytorch.

sczhou commented on August 19, 2024

@fitsumreda
yes, it ran fine for a few epochs, I try to run the code twice, and it ran into this error in 15 epochs and 156epochs respectively, I think it might be random.
Thanks for your reply and help.

from flownet2-pytorch.

fitsumreda commented on August 19, 2024

I wasn't able to reproduce the error

from flownet2-pytorch.

topriss commented on August 19, 2024

Hi, thanks for your great code.
I think I have the same problem, using params
--model FlowNet2 --loss=L2Loss --training_dataset FlyingChairs --validation_dataset MpiSintelClean
it rans into inf error in epoch 1 batch 1302 as shown in the log below. (I made some change in logging style).
Any response would be appreciated.

`
[12-27 06:18:15.72] Training Epoch 1 1295/2859 L2: 29.240, EPE: 29.240, lr: 1.0e-04
Operation forward : 172.17 ms
Operation backward : 725.52 ms
Operation optim step : 36.59 ms
[12-27 06:18:16.99] Training Epoch 1 1296/2859 L2: 64.268, EPE: 64.268, lr: 1.0e-04
Operation forward : 170.37 ms
Operation backward : 677.62 ms
Operation optim step : 35.81 ms
[12-27 06:18:17.88] Training Epoch 1 1297/2859 L2: 784.324, EPE: 784.324, lr: 1.0e-04
Operation forward : 170.28 ms
Operation backward : 725.62 ms
Operation optim step : 35.86 ms
[12-27 06:18:18.81] Training Epoch 1 1298/2859 L2: 19857.357, EPE: 19857.357, lr: 1.0e-04
Operation forward : 176.19 ms
Operation backward : 789.99 ms
Operation optim step : 35.88 ms
[12-27 06:18:19.82] Training Epoch 1 1299/2859 L2: 3731196.750, EPE: 3731196.750, lr: 1.0e-04
Operation forward : 171.34 ms
Operation backward : 687.62 ms
Operation optim step : 36.55 ms
[12-27 06:18:21.13] Training Epoch 1 1300/2859 L2: 18071404544.000, EPE: 18071404544.000, lr: 1.0e-04
Operation forward : 171.53 ms
Operation backward : 748.87 ms
Operation optim step : 35.85 ms
[12-27 06:18:22.09] Training Epoch 1 1301/2859 L2: 23380791132160.000, EPE: 23380791132160.000, lr: 1.0e-04
Operation forward : 170.73 ms
Operation backward : 682.66 ms
Operation optim step : 35.91 ms
[12-27 06:18:22.98] Training Epoch 1 1302/2859 L2: 7276938393550848.000, EPE: 7276938393550848.000, lr: 1.0e-04
Operation forward failed
[12-27 06:18:23.16] Operation failed

Traceback (most recent call last):
File "mymain.py", line 342, in
logger=None)
File "mymain.py", line 291, in train
losses = model(data[0], target[0])
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
RuntimeError: value cannot be converted to type double without overflow: inf
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=29 : driver shutting down
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=29 : driver shutting down
`

from flownet2-pytorch.

Runtime Error: Value cannot be converted to type double without overflow: inf about flownet2-pytorch HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs