Comments (7)
from compression.
Thank you Johannes !
Liu's code also applies bounds to the output of the hyper-decoder (1e-10, 1e10), I don't know why adding a ReLU yields such bad results there (at least *1.5 bpp) since it seems to essentially do the same thing. Anyway I will leave it off, it's reassuring that it doesn't seem to be an error and is accounted for somehow.
Likewise I will keep the Laplace since it shouldn't make a significant difference and most of my completed trainings use it.
All the best,
Benoit
from compression.
One more note regarding ReLU.
It could be that you are observing the "dead unit" problem. If an activation consistently ends up in the "flat" region of the ReLU, that means that the gradient for this unit will always be zero. Hence, it will never learn anything.
This is one of the reasons we have implemented the lower_bound function, which substitutes a gradient in the flat part. IIRC, GaussianConditional should be using that, but if you manually add a ReLU it won't be effective.
from compression.
I am also trying to do a similar thing as @trougnouf described in, i.e. trying to replicate Balle2018 based on the pytorch implementation defined in https://github.com/liujiaheng/compression.
And I still can't figure it out completely.
@trougnouf didn't mention in the original post but in liujiaheng's repo they are clipping the values of all gradients during training, to the range [-5,5], probably to avoid exploding gradients. @jonycgn's code doesn't apply such clipping as far as I know.
This looked somewhat a too surgical intervention to me, so I tried to see if I can still converge without this grad clipping.
Based on this thread I tried various train configurations: e.g. (a) Gaussian / Laplacian models, (b) ReLU/exponent/nothing in hyper decoder's final activation , (c) applying the likelihood_bound (1e-9) as implemented by @jonycgn.
Using the gradient clipping, all of these settings could converge without much problem.
Unfortunately, none of these configuration worked when I disabled gradient clipping, they all suffered from very big gradients causing psnr to oscillate and not converge.
-@trougnouf, I am very curious to know if/how you managed to make training converge without gradient clipping.
Also, @trougnouf wrote above that the exponential function used by liujiaheng (for the hyper decoder), was mentioned in the paper's equations and possibly in @jonycgn's code as well. I looked closely into it, but I couldn't find any indication to this exponent in the paper nor the code. I am curious to know, where you saw it exactly.
Another point that I observed - liujiaheng's implementation uses a simplified version of EntropyBottleneck (the non-parametric model for hyperprior density, Balle2018 appendix 6.1), which is nearly equivalent to using filters=(1, 1, 1) instead of (3,3,3). So I implemented the full EntropyBottleneck in pytorch, I believe it is fully equivalent to Balle2018's code. However it didn't make a real difference in terms of psnr/bpp (yet training was a lot faster).
Also @trougnouf, if you eventually managed to replicate Balle2018 and discovered any other noteworthy observations in the process, it would be great if you could share more interesting insights.
The help from both of you is much appreciated!
All the best,
Danny
from compression.
Hi Danny ,
Thank you for sharing these insights !
I never thought of taking off the gradient clipping, in the end my configuration is pretty much that defined by @liujiaheng. I tried different optimizers (RAdam, RangeLars=RAdam+LARS+Lookahead) but Adam seems to give me the best performance in a given amount of training time. The Gaussian distribution never gave me as good results so I kept Laplace.
My learning rate scheduler is a bit different, I multiply the learning rate of the entropy model and/or autoencoder by 0.99 iff the overall performance and the individual bpp/psnr gets worse two test steps in a row. (1 test step = 2500 train steps) I don't know if this results in an improvement but the rate-distortion continuously improves instead of waiting 4M steps to see a big drop.
With this I was getting similar same rate-distortion when training the pytorch version as when I train the given tensorflow2 code after a limited number of steps, but I found the pytorch version trained much faster so I never trained the TF2 version for a full 6M steps. I haven't compared to the results published in the paper because I find it's meaningless without the same training data.
I still have difficulties with the MS-SSIM loss, it tends to get unstable after 500k steps or stops improving beforehand whereas the lambda=4096 model that's used to pretrain trains for 6M steps and never has issues. Still the performance is there in the end (significantly better than BPG with the MS-SSIM loss)
from compression.
Hi Benoit,
Thanks for your detailed answer! It is reassuring to know that you could make it work, well done.
Regarding using the Laplace distribution, although I get that Laplace probably converges slightly better, I still cannot figure out the explanation for applying the exponent on the scales (predicted by hyper-decoder) when using Laplace. Except for @liujiaheng's code, I haven't found a clue anywhere else (I also see that you still have an open ticket there with the same issue...). Did you happen to find a plausible explanation?
from compression.
No I can't pinpoint it in Johannes Ballé's code, maybe he would have an insight.
from compression.
Related Issues (20)
- what's .tfci file and how to get the real codewords of an image HOT 1
- metagraphs link is outdated HOT 2
- Metagraph error while performing compression HOT 1
- libcudart.so.11.0 file
- Regarding GPU and tf compatibility HOT 1
- INVALID_ARGUMENT error during decompression on GPU HOT 1
- tfc-2.9.1 import error HOT 6
- Use tensorflow compression with tensorflow federated on apple silicon HOT 4
- Running all tests fail in Colab Pro+, Premium, High-Ram environment (A100-SXM4-40GB) HOT 5
- tfci.py recognizing, but not using GPUs HOT 2
- TypeError: pack() missing 1 required positional argument: 'arrays' HOT 1
- The memory size of the compressed and decompressed image has become larger? HOT 4
- How to compress an image by the trained model
- How to build tensorflow-compression package for aarch64?
- Unable to save model
- module 'tensorflow_compression' has no attribute 'SignalConv2D'
- Binaries for MacOs only available for X86 platform and not Apple Silicon
- 安卓
- Could not find variable conv1/gdn_0/reparam_gamma
- will not run on GPU on COLAB
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compression.