GithubHelp home page GithubHelp logo

Comments (2)

ardizzone avatar ardizzone commented on September 28, 2024

Hi!

There is a lot to say about this! :D
(This is still active research, so we haven't put much certain information out there so far, just because we were still working it out orselves)

Beforehand, I want to point out:

  • We are working on re-writing the tutorial to include such things and better explain how to build a stable model.
  • If you look at the following repository, this network does all the point I mention below, and is very stable, even with >100 coupling blocks: https://github.com/VLL-HD/IB-INN You should be able to just use the architecture directly, or just parts of it.
  • Here is a very big INN that even gives good performance on ImageNet, and is still completely stable (just to show it's possible): https://github.com/VLL-HD/trustworthy_GCs

For the actual answers:

  • Indeed, BatchNorm does increase the stability in most of our experiments
  • The testing error problem also took as a long while to work out. What happens is that the running average kept by the pytorch batchNorm layers that is used when the network is set to .eval() is not accurate enough (especially because of how sensitive NFs are to shifting mean/std.
    With the network in .train() mode, the mean/std is computed for each batch, and the running average is ignored, so the problem doesn't occur there.
    The way around it: for validation druing training, leave the model in .train() mode (not perfect, but better than the unreliable numbers)
    At test time, keeping the network fixed, reset the batchnorm running averages, set the momentum of the batchnorm layers to None (infinite average), and run the train dataset through for one or two epochs. Then, the test loss is correct.
    You can find that in https://github.com/VLL-HD/IB-INN/blob/master/evaluation/__init__.py#L18
  • initialization also plays a big role. There is an AllInOneBlock coupling block in FrEIA (since recently), that combines coupling, scaling and permutation in one easy to use block (the three things are almost always used together anyway, so it only slows things down having them separate). You can find the initialization here: https://github.com/VLL-HD/IB-INN/blob/master/inn_architecture.py#L30 (although note, the arguments to the AllInOneBlock have changed with the inclusion to FrEIA to be more understandable. The docstring should contain everything you need to know. Specifically, try to set the global_affine_init to something like 0.7, that stops the outputs from exploding.)
  • Gradient clipping (as for RNNs) can also help. I tend to get good results with torch.nn.utils.clip_grad_norm_(parameters, 5.)

Feel free to re-open the issue if you are still having NaN troubles after that!

from freia.

jstmn avatar jstmn commented on September 28, 2024

Hi @ardizzone,

Thanks so much for the detailed reply. Interesting find on the testing error problem! It looks like this problem may be general to pytorch (https://discuss.pytorch.org/t/model-eval-gives-incorrect-loss-for-model-with-batchnorm-layers/7561/21)

I spent a while exploring different regions of the parameter space (number of coupling layers, coefficient function network depth & width, learning rate) until understood when training was likely to diverge.

My general approach was to:

  1. Find the smallest possible model capacity for the distribution being modeled. This is done by increasing the number of neurons in the coefficient network until testing error no longer improved. Generally, for a fixed number of neurons I found no significant impact on model performance when increasing the number of coupling layers (with smaller coefficient networks) vs. increasing the size of the coefficient networks while decreasing the number of coupling layers. I can't speak to training stability as a function of this trade-off

  2. Once the architecture is set, coarsely increase the learning rate from 1e-4 incrementally until training diverges (1e-4, 2.5e-4, ...). Then increase the learning rate from the largest stable learning rate. For example, if training converged for 5e-4 but diverged at 7.5e-4, test 5e-4, 5.5e-4, 6e-4, ....

Cheers

from freia.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.