Hi There, I was wondering if there are any recommended ways to preve

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Preventing NaN Issues with Large Networks about freia HOT 2 CLOSED

jstmn commented on September 28, 2024

Preventing NaN Issues with Large Networks

from freia.

Comments (2)

ardizzone commented on September 28, 2024

Hi!

There is a lot to say about this! :D
(This is still active research, so we haven't put much certain information out there so far, just because we were still working it out orselves)

Beforehand, I want to point out:

We are working on re-writing the tutorial to include such things and better explain how to build a stable model.
If you look at the following repository, this network does all the point I mention below, and is very stable, even with >100 coupling blocks: https://github.com/VLL-HD/IB-INN You should be able to just use the architecture directly, or just parts of it.
Here is a very big INN that even gives good performance on ImageNet, and is still completely stable (just to show it's possible): https://github.com/VLL-HD/trustworthy_GCs

For the actual answers:

Indeed, BatchNorm does increase the stability in most of our experiments
The testing error problem also took as a long while to work out. What happens is that the running average kept by the pytorch batchNorm layers that is used when the network is set to .eval() is not accurate enough (especially because of how sensitive NFs are to shifting mean/std.
With the network in .train() mode, the mean/std is computed for each batch, and the running average is ignored, so the problem doesn't occur there.
The way around it: for validation druing training, leave the model in .train() mode (not perfect, but better than the unreliable numbers)
At test time, keeping the network fixed, reset the batchnorm running averages, set the momentum of the batchnorm layers to None (infinite average), and run the train dataset through for one or two epochs. Then, the test loss is correct.
You can find that in https://github.com/VLL-HD/IB-INN/blob/master/evaluation/__init__.py#L18
initialization also plays a big role. There is an AllInOneBlock coupling block in FrEIA (since recently), that combines coupling, scaling and permutation in one easy to use block (the three things are almost always used together anyway, so it only slows things down having them separate). You can find the initialization here: https://github.com/VLL-HD/IB-INN/blob/master/inn_architecture.py#L30 (although note, the arguments to the AllInOneBlock have changed with the inclusion to FrEIA to be more understandable. The docstring should contain everything you need to know. Specifically, try to set the global_affine_init to something like 0.7, that stops the outputs from exploding.)
Gradient clipping (as for RNNs) can also help. I tend to get good results with torch.nn.utils.clip_grad_norm_(parameters, 5.)

Feel free to re-open the issue if you are still having NaN troubles after that!

from freia.

jstmn commented on September 28, 2024

Hi @ardizzone,

Thanks so much for the detailed reply. Interesting find on the testing error problem! It looks like this problem may be general to pytorch (https://discuss.pytorch.org/t/model-eval-gives-incorrect-loss-for-model-with-batchnorm-layers/7561/21)

I spent a while exploring different regions of the parameter space (number of coupling layers, coefficient function network depth & width, learning rate) until understood when training was likely to diverge.

My general approach was to:

Find the smallest possible model capacity for the distribution being modeled. This is done by increasing the number of neurons in the coefficient network until testing error no longer improved. Generally, for a fixed number of neurons I found no significant impact on model performance when increasing the number of coupling layers (with smaller coefficient networks) vs. increasing the size of the coefficient networks while decreasing the number of coupling layers. I can't speak to training stability as a function of this trade-off
Once the architecture is set, coarsely increase the learning rate from 1e-4 incrementally until training diverges (1e-4, 2.5e-4, ...). Then increase the learning rate from the largest stable learning rate. For example, if training converged for 5e-4 but diverged at 7.5e-4, test 5e-4, 5.5e-4, 6e-4, ....

Cheers

from freia.

Preventing NaN Issues with Large Networks about freia HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs