Comments (5)
If you directly train flownet2 (entire network), the loss might explode.
To avoid this:
- use a very small learning rate 1e-10,
- enable BatchNormalisation
- or do the training as suggested in the flownet2 paper. I.e., train Flownet-C first, fix this, then train flownetS1, flownetS2, and so on.
from flownet2-pytorch.
@sczhou, could you give more information? Did it run fine for a few epochs before you run into this issue?
from flownet2-pytorch.
@fitsumreda
yes, it ran fine for a few epochs, I try to run the code twice, and it ran into this error in 15 epochs and 156epochs respectively, I think it might be random.
Thanks for your reply and help.
from flownet2-pytorch.
I wasn't able to reproduce the error
from flownet2-pytorch.
Hi, thanks for your great code.
I think I have the same problem, using params
--model FlowNet2 --loss=L2Loss --training_dataset FlyingChairs --validation_dataset MpiSintelClean
it rans into inf error in epoch 1 batch 1302 as shown in the log below. (I made some change in logging style).
Any response would be appreciated.
`
[12-27 06:18:15.72] Training Epoch 1 1295/2859 L2: 29.240, EPE: 29.240, lr: 1.0e-04
Operation forward : 172.17 ms
Operation backward : 725.52 ms
Operation optim step : 36.59 ms
[12-27 06:18:16.99] Training Epoch 1 1296/2859 L2: 64.268, EPE: 64.268, lr: 1.0e-04
Operation forward : 170.37 ms
Operation backward : 677.62 ms
Operation optim step : 35.81 ms
[12-27 06:18:17.88] Training Epoch 1 1297/2859 L2: 784.324, EPE: 784.324, lr: 1.0e-04
Operation forward : 170.28 ms
Operation backward : 725.62 ms
Operation optim step : 35.86 ms
[12-27 06:18:18.81] Training Epoch 1 1298/2859 L2: 19857.357, EPE: 19857.357, lr: 1.0e-04
Operation forward : 176.19 ms
Operation backward : 789.99 ms
Operation optim step : 35.88 ms
[12-27 06:18:19.82] Training Epoch 1 1299/2859 L2: 3731196.750, EPE: 3731196.750, lr: 1.0e-04
Operation forward : 171.34 ms
Operation backward : 687.62 ms
Operation optim step : 36.55 ms
[12-27 06:18:21.13] Training Epoch 1 1300/2859 L2: 18071404544.000, EPE: 18071404544.000, lr: 1.0e-04
Operation forward : 171.53 ms
Operation backward : 748.87 ms
Operation optim step : 35.85 ms
[12-27 06:18:22.09] Training Epoch 1 1301/2859 L2: 23380791132160.000, EPE: 23380791132160.000, lr: 1.0e-04
Operation forward : 170.73 ms
Operation backward : 682.66 ms
Operation optim step : 35.91 ms
[12-27 06:18:22.98] Training Epoch 1 1302/2859 L2: 7276938393550848.000, EPE: 7276938393550848.000, lr: 1.0e-04
Operation forward failed
[12-27 06:18:23.16] Operation failed
Traceback (most recent call last):
File "mymain.py", line 342, in
logger=None)
File "mymain.py", line 291, in train
losses = model(data[0], target[0])
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/lyh/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
RuntimeError: value cannot be converted to type double without overflow: inf
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=29 : driver shutting down
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=29 : driver shutting down
`
from flownet2-pytorch.
Related Issues (20)
- Flownet2 model for faces HOT 1
- Thanks for this repository! could you please provide the pretrained model for FlowNet2ss or FlowNet2s (lower-case letters, pytorch)? HOT 2
- AttributeError: Module 'time' has no attribute 'clock'
- 想问下如何跑多个文件
- I cannot load the pretrained FlowNetS
- Training with MultiScale loss
- Should I extact "tar" for FlowNet2_checkpoint.pth.tar ?
- Groundtruth of my dataset!!!
- How to achieve the 3D version of correlation layer?
- CUDA error: the provided PTX was compiled with an unsupported toolchain
- start_epoch argument fix
- Error when training Flownet2S HOT 1
- Wrong calculation of EPE HOT 1
- Installation is a nightmare HOT 3
- TypeError: forward() takes 2 positional arguments but 3 were given,flownetc_flow2 = self.flownetc(x)[0].why!!!!
- cuda version HOT 2
- Segmentation fault
- dimension unmatch problem in model
- No module named 'flownet2_models
- What version of pytorch and cuda should I choose
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flownet2-pytorch.