GithubHelp home page GithubHelp logo

Comments (7)

jonycgn avatar jonycgn commented on June 11, 2024

from compression.

trougnouf avatar trougnouf commented on June 11, 2024

Thank you Johannes !

Liu's code also applies bounds to the output of the hyper-decoder (1e-10, 1e10), I don't know why adding a ReLU yields such bad results there (at least *1.5 bpp) since it seems to essentially do the same thing. Anyway I will leave it off, it's reassuring that it doesn't seem to be an error and is accounted for somehow.

Likewise I will keep the Laplace since it shouldn't make a significant difference and most of my completed trainings use it.

All the best,

Benoit

from compression.

jonycgn avatar jonycgn commented on June 11, 2024

One more note regarding ReLU.

It could be that you are observing the "dead unit" problem. If an activation consistently ends up in the "flat" region of the ReLU, that means that the gradient for this unit will always be zero. Hence, it will never learn anything.

This is one of the reasons we have implemented the lower_bound function, which substitutes a gradient in the flat part. IIRC, GaussianConditional should be using that, but if you manually add a ReLU it won't be effective.

from compression.

larjung avatar larjung commented on June 11, 2024

I am also trying to do a similar thing as @trougnouf described in, i.e. trying to replicate Balle2018 based on the pytorch implementation defined in https://github.com/liujiaheng/compression.
And I still can't figure it out completely.
@trougnouf didn't mention in the original post but in liujiaheng's repo they are clipping the values of all gradients during training, to the range [-5,5], probably to avoid exploding gradients. @jonycgn's code doesn't apply such clipping as far as I know.
This looked somewhat a too surgical intervention to me, so I tried to see if I can still converge without this grad clipping.
Based on this thread I tried various train configurations: e.g. (a) Gaussian / Laplacian models, (b) ReLU/exponent/nothing in hyper decoder's final activation , (c) applying the likelihood_bound (1e-9) as implemented by @jonycgn.
Using the gradient clipping, all of these settings could converge without much problem.
Unfortunately, none of these configuration worked when I disabled gradient clipping, they all suffered from very big gradients causing psnr to oscillate and not converge.
-@trougnouf, I am very curious to know if/how you managed to make training converge without gradient clipping.

Also, @trougnouf wrote above that the exponential function used by liujiaheng (for the hyper decoder), was mentioned in the paper's equations and possibly in @jonycgn's code as well. I looked closely into it, but I couldn't find any indication to this exponent in the paper nor the code. I am curious to know, where you saw it exactly.

Another point that I observed - liujiaheng's implementation uses a simplified version of EntropyBottleneck (the non-parametric model for hyperprior density, Balle2018 appendix 6.1), which is nearly equivalent to using filters=(1, 1, 1) instead of (3,3,3). So I implemented the full EntropyBottleneck in pytorch, I believe it is fully equivalent to Balle2018's code. However it didn't make a real difference in terms of psnr/bpp (yet training was a lot faster).

Also @trougnouf, if you eventually managed to replicate Balle2018 and discovered any other noteworthy observations in the process, it would be great if you could share more interesting insights.

The help from both of you is much appreciated!

All the best,
Danny

from compression.

trougnouf avatar trougnouf commented on June 11, 2024

Hi Danny ,

Thank you for sharing these insights !

I never thought of taking off the gradient clipping, in the end my configuration is pretty much that defined by @liujiaheng. I tried different optimizers (RAdam, RangeLars=RAdam+LARS+Lookahead) but Adam seems to give me the best performance in a given amount of training time. The Gaussian distribution never gave me as good results so I kept Laplace.

My learning rate scheduler is a bit different, I multiply the learning rate of the entropy model and/or autoencoder by 0.99 iff the overall performance and the individual bpp/psnr gets worse two test steps in a row. (1 test step = 2500 train steps) I don't know if this results in an improvement but the rate-distortion continuously improves instead of waiting 4M steps to see a big drop.

With this I was getting similar same rate-distortion when training the pytorch version as when I train the given tensorflow2 code after a limited number of steps, but I found the pytorch version trained much faster so I never trained the TF2 version for a full 6M steps. I haven't compared to the results published in the paper because I find it's meaningless without the same training data.

I still have difficulties with the MS-SSIM loss, it tends to get unstable after 500k steps or stops improving beforehand whereas the lambda=4096 model that's used to pretrain trains for 6M steps and never has issues. Still the performance is there in the end (significantly better than BPG with the MS-SSIM loss)

from compression.

larjung avatar larjung commented on June 11, 2024

Hi Benoit,
Thanks for your detailed answer! It is reassuring to know that you could make it work, well done.
Regarding using the Laplace distribution, although I get that Laplace probably converges slightly better, I still cannot figure out the explanation for applying the exponent on the scales (predicted by hyper-decoder) when using Laplace. Except for @liujiaheng's code, I haven't found a clue anywhere else (I also see that you still have an open ticket there with the same issue...). Did you happen to find a plausible explanation?

from compression.

trougnouf avatar trougnouf commented on June 11, 2024

No I can't pinpoint it in Johannes Ballé's code, maybe he would have an insight.

from compression.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.