The model defined in <a href="https://github.com/tensorflow/compression/blob/master/mo

I am also trying to do a similar thing as <a class="user-mention notranslate" data-hov

Difference with hyperprior paper: hyperprior (decoder) loses a ReLU about compression HOT 7 CLOSED

tensorflow commented on June 11, 2024

Difference with hyperprior paper: hyperprior (decoder) loses a ReLU

from compression.

Comments (7)

jonycgn commented on June 11, 2024

Hi Benoit, sorry for the late reply. The code in bmshj2018.py uses the GaussianConditional class, which lower bounds the values appropriately. So I think that the relu in the figure is basically redundant. We noticed the difference between our paper and some other implementations and did some experiments with Laplacian vs. Gaussian as well. As far as I can tell, Laplacians may have some benefit in terms of training (converging faster or more stably), but after convergence, it didn't seem to make a big difference for us. The difference you are seeing could have to do with the exact training schedule. In some cases, we found it beneficial to train with a 2x higher lambda in the first third/half of training, and then lower it. It is possible that using Laplacians makes that unnecessary, because Laplacian gradients are numerically somewhat more predictable, which has an effect on training dynamics (magnitude is always 1, whereas Gaussian gradients magnitude grows linearly). The same holds for the absolute value in the hyper-encoder. It should not make a big difference for the zero-mean conditional model, but note that for the follow-up model which defines a conditional Gaussian with non-zero mean, it is probably better to remove it. Hope this helps! Johannes

…

On Thu, Sep 24, 2020 at 5:25 AM Benoit Brummer ***@***.***> wrote: The model defined in compression/models/bmshj2018.py <https://github.com/tensorflow/compression/blob/master/models/bmshj2018.py> differs from the one shown in the paper <https://arxiv.org/abs/1802.01436> (whose figure is shown below): [image: Screenshot from 2020-09-24 10-45-38] <https://user-images.githubusercontent.com/6899116/94122919-6b0d5d00-fe53-11ea-9fe9-02da231aa961.png> The paper has 3 ReLUs in the hyper-decoder, but the HyperSynthesisTransform <https://github.com/tensorflow/compression/blob/fa8ac89ba717a29f64e9e04a70465110131a66d8/models/bmshj2018.py#L166> class omits the last ReLU. Are the results shown in the paper based on the paper's definition, or the published code? I am trying to replicate Balle2018 to compare the results of my work, for that I used the pytorch implementation defined in https://github.com/liujiaheng/compression and I was getting good results, but I noticed several differences from the paper's definition. One is the lack of a ReLU which is present in this code as well. liujiaheng also replaces the gaussian distribution by a laplace distribution and the hyper-encoder absolute value with an exponential function (that also doesn't seem to be mentioned in a paper). Once I revert those modifications I no longer seem to get very good results (although training is still far from complete). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#53>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABYGBVBROFNRHMLE7RZJRNTSHMGARANCNFSM4RYB64WQ> .

from compression.

trougnouf commented on June 11, 2024

Thank you Johannes !

Liu's code also applies bounds to the output of the hyper-decoder (1e-10, 1e10), I don't know why adding a ReLU yields such bad results there (at least *1.5 bpp) since it seems to essentially do the same thing. Anyway I will leave it off, it's reassuring that it doesn't seem to be an error and is accounted for somehow.

Likewise I will keep the Laplace since it shouldn't make a significant difference and most of my completed trainings use it.

All the best,

Benoit

from compression.

jonycgn commented on June 11, 2024

One more note regarding ReLU.

It could be that you are observing the "dead unit" problem. If an activation consistently ends up in the "flat" region of the ReLU, that means that the gradient for this unit will always be zero. Hence, it will never learn anything.

This is one of the reasons we have implemented the lower_bound function, which substitutes a gradient in the flat part. IIRC, GaussianConditional should be using that, but if you manually add a ReLU it won't be effective.

from compression.

larjung commented on June 11, 2024

I am also trying to do a similar thing as @trougnouf described in, i.e. trying to replicate Balle2018 based on the pytorch implementation defined in https://github.com/liujiaheng/compression.
And I still can't figure it out completely.
@trougnouf didn't mention in the original post but in liujiaheng's repo they are clipping the values of all gradients during training, to the range [-5,5], probably to avoid exploding gradients. @jonycgn's code doesn't apply such clipping as far as I know.
This looked somewhat a too surgical intervention to me, so I tried to see if I can still converge without this grad clipping.
Based on this thread I tried various train configurations: e.g. (a) Gaussian / Laplacian models, (b) ReLU/exponent/nothing in hyper decoder's final activation , (c) applying the likelihood_bound (1e-9) as implemented by @jonycgn.
Using the gradient clipping, all of these settings could converge without much problem.
Unfortunately, none of these configuration worked when I disabled gradient clipping, they all suffered from very big gradients causing psnr to oscillate and not converge.
-@trougnouf, I am very curious to know if/how you managed to make training converge without gradient clipping.

Also, @trougnouf wrote above that the exponential function used by liujiaheng (for the hyper decoder), was mentioned in the paper's equations and possibly in @jonycgn's code as well. I looked closely into it, but I couldn't find any indication to this exponent in the paper nor the code. I am curious to know, where you saw it exactly.

Another point that I observed - liujiaheng's implementation uses a simplified version of EntropyBottleneck (the non-parametric model for hyperprior density, Balle2018 appendix 6.1), which is nearly equivalent to using filters=(1, 1, 1) instead of (3,3,3). So I implemented the full EntropyBottleneck in pytorch, I believe it is fully equivalent to Balle2018's code. However it didn't make a real difference in terms of psnr/bpp (yet training was a lot faster).

Also @trougnouf, if you eventually managed to replicate Balle2018 and discovered any other noteworthy observations in the process, it would be great if you could share more interesting insights.

The help from both of you is much appreciated!

All the best,
Danny

from compression.

trougnouf commented on June 11, 2024

Hi Danny ,

Thank you for sharing these insights !

I never thought of taking off the gradient clipping, in the end my configuration is pretty much that defined by @liujiaheng. I tried different optimizers (RAdam, RangeLars=RAdam+LARS+Lookahead) but Adam seems to give me the best performance in a given amount of training time. The Gaussian distribution never gave me as good results so I kept Laplace.

My learning rate scheduler is a bit different, I multiply the learning rate of the entropy model and/or autoencoder by 0.99 iff the overall performance and the individual bpp/psnr gets worse two test steps in a row. (1 test step = 2500 train steps) I don't know if this results in an improvement but the rate-distortion continuously improves instead of waiting 4M steps to see a big drop.

With this I was getting similar same rate-distortion when training the pytorch version as when I train the given tensorflow2 code after a limited number of steps, but I found the pytorch version trained much faster so I never trained the TF2 version for a full 6M steps. I haven't compared to the results published in the paper because I find it's meaningless without the same training data.

I still have difficulties with the MS-SSIM loss, it tends to get unstable after 500k steps or stops improving beforehand whereas the lambda=4096 model that's used to pretrain trains for 6M steps and never has issues. Still the performance is there in the end (significantly better than BPG with the MS-SSIM loss)

from compression.

larjung commented on June 11, 2024

Hi Benoit,
Thanks for your detailed answer! It is reassuring to know that you could make it work, well done.
Regarding using the Laplace distribution, although I get that Laplace probably converges slightly better, I still cannot figure out the explanation for applying the exponent on the scales (predicted by hyper-decoder) when using Laplace. Except for @liujiaheng's code, I haven't found a clue anywhere else (I also see that you still have an open ticket there with the same issue...). Did you happen to find a plausible explanation?

from compression.

trougnouf commented on June 11, 2024

No I can't pinpoint it in Johannes Ballé's code, maybe he would have an insight.

from compression.

Difference with hyperprior paper: hyperprior (decoder) loses a ReLU about compression HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs