Comments (11)
Hi @Maybe2022,
I have also encountered this, but only very very sparsely. I usually deal with this by one of the following methods: 1) slightly increasing the lam
factor in the anderson
function; 2) catching this exception (i.e., using try
and except
) and switch to a stronger solver like Broyden's method; or 3) applying some sort of DEQ stabilization (e.g., Jacobian regularization or auxiliary losses).
This anderson error usually happens as a result of the growing instability of the DEQ models. You probably want to check first whether the model converges properly for most of the time (i.e., measure ||f(z)-z||/||f(z)||; if this is <0.05 at the end of the solver iterations, then it's good convergence). If stability is an issue, fix it; if not, then I suggest you go with approaches (1) and (2) first.
Let me know if this helps!
from deq.
Thank you very much for your answer,
I'm learning the combination of SNN and DEQ, which may be unstable. I've changed to your Broyden's method now.
At first, I learned from your tutorial on Colab. The Anderson code above is directly aimed at 4 dimensions. I don't understand the code of Broyden's solver, and it's difficult to understand the transformation of Broyden's data shape.
from deq.
Hi @Maybe2022 ,
You probably want to consider some of the stabilization techniques then. Besides the Jacobian regularization, another method is to apply auxiliary losses on the fixed-point solving process. E.g., typically we do L(z*, y)
where z*=z^{30}
is the result of the root-solving; but we can also additionally add 0.2 * L(z^{10}, y)
as an extra loss term that encourages the model to converge earlier. This extra loss is memory inefficient due to IFT, so you can apply a JFB on it (see this paper).
As for Broyden's method, I'm really just following the Sherman-Morrison formula in Wikipedia, see here. Note that we should never form the J^{-1} matrix, but only keep a low-rank-update version of it (i.e., -I + u1v1^T + u2v2^T + ...
).
Let me know if you want me to clarify further!
from deq.
Thank you very much for your help. I'm trying these technologies you introduced.
Another doubt is that when I try multi GPUs training, the acceleration effect is not obvious; For example, when I use 8 GPUs for training, the training time is only reduced by half. Is this caused by too much CPU participation in fixed point iteration? How to solve it?
Thank you very much.
from deq.
Hi @Maybe2022 ,
There shouldn't be much CPU ops in the fixed point solving process. That Anderson/Broyden method is entirely on GPUs. Note that in general, when you use 2x more GPUs, you don't get 2x speedup. There could be a few solutions to what you observed:
- The speed could be bottlenecked by things like data loading. You may want to check that.
- The JFB idea I previously mentioned should make the backward pass almost free. Assuming the model is stable, it should make the training 2x faster.
- Check your pytorch and CUDA versions. The latest DEQ implements its backward pass with hook, which is only fully supported with PyTorch v1.10.
- There is some cost as you use more GPUs (mostly GPU context and communication cost). Therefore, as you use more GPUs, you probably want to also adjust the batch size accordingly.
from deq.
Thank you very much;
Another problem is about over fitting; When I only train on cifar10, the convergence of the training set is normal, and the maximum of the test set can only reach about 87.
Is this related to the Eps and threshold of fixed point iteration?
Thank you very much for your help
from deq.
I stole a lazy
Only the calculation graph is rewritten, and hook is not applied
from deq.
Not sure what exactly went wrong but if you just use cls_mdeq_LARGE_reg.yaml
you should expect ~93% accuracy pretty consistently.
If you are using Jacobian-free, then yes, you do want to monitor the stability of the fixed-point iterations. And with such instability, because you are using inexact gradients, overfitting might happen indeed. Could you check that the convergence is proper?
from deq.
Hi @Maybe2022 ,
You probably want to consider some of the stabilization techniques then. Besides the Jacobian regularization, another method is to apply auxiliary losses on the fixed-point solving process. E.g., typically we do
L(z*, y)
wherez*=z^{30}
is the result of the root-solving; but we can also additionally add0.2 * L(z^{10}, y)
as an extra loss term that encourages the model to converge earlier. This extra loss is memory inefficient due to IFT, so you can apply a JFB on it (see this paper).As for Broyden's method, I'm really just following the Sherman-Morrison formula in Wikipedia, see here. Note that we should never form the J^{-1} matrix, but only keep a low-rank-update version of it (i.e.,
-I + u1v1^T + u2v2^T + ...
).Let me know if you want me to clarify further!
Hi @jerrybai1995,
Thanks for your excellent work! I really enjoy this series of papers. Regarding Broyden's method, I notice that seems you're using the "bad Broyden's method". May I ask whether you intentionally choose this version? Why don't use the "good" one? (However, I look up some literatures. These two versions seems have similar performance.) Do you mind to share some insights in terms of this choice? Thanks in advance.
from deq.
Hi @liu-jc ,
Thanks for the question. There is no specific reason why Broyden good was not picked (it was more like an empirical choice...? when I started to work on this project I tried both Broyden good and bad, and found the bad to be a bit better, so I went on writing a PyTorch version of that one). But as you can see, eventually Anderson acceleration also works very well; and if you were to apply Jacobian regularization, the naive fixed-point iteration also works quite well. So I definitely don't think Broyden good wouldn't work 😆
Hope this clarifies things!
from deq.
Hi @jerrybai1995 ,
Thanks for your prompt answer! Now I understand. Thanks for your help again :-)
from deq.
Related Issues (20)
- Two slightly different process for Deq HOT 2
- Segmentation Fault when Loss Backward CIFAR cls_mdeq_LARGE_reg HOT 10
- CIFAR-10 Reproduction HOT 6
- Test ImageNet Pre-trained Model HOT 10
- Segmentation fault after removing hook HOT 3
- RuntimeError: einsum(): the number of subscripts in the equation (3) does not match the number of dimensions (4) for operand 0 and no ellipsis was given HOT 1
- DEQ for Vision Transformer HOT 2
- Memory consumption on CIFAR-10 HOT 4
- Does MDEQ have different inference results for different batch sizes? HOT 6
- Expected a 'cuda' device type for generator (related to speed issues?) HOT 5
- RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment HOT 2
- Question about Remove Hook HOT 6
- Higher order derivatives
- UnboundLocalError: local variable 'lowest_xest' referenced before assignment HOT 4
- Broyden defeats the purpose of DEQs? HOT 6
- UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown HOT 4
- Expected a 'cuda' device type for generator but found 'cpu' HOT 2
- Mismatch between a pretrained ImageNet model and a config file HOT 1
- Hyperparameters for MDEQ-XL on ImageNet
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deq.