This repository, maintained by woodminus, is a comprehensive benchmarking of various attention mechanisms used in Vision Transformers. It not only provides a re-implementation but also furnishes a performance benchmark on parameters, FLOPs and CPU/GPU throughput of different attention mechanisms.
- Pytorch 1.8+
- timm
- ninja
- einops
- fvcore
- matplotlib
- NVIDIA RTX 3090
- Intel® Core™ i9-10900X CPU @ 3.70GHz
- Memory 32GB
- Ubuntu 22.04
- PyTorch 1.8.1 + CUDA 11.1
- input: 14 x 14 = 196 tokens (1/16 scale feature maps in common ImageNet-1K training)
- batch size for speed testing (images/s): 64
- embedding dimension:768
- number of heads: 12
For example, to test HiLo attention,
cd attentions/
python hilo.py
By default, the script will test models on both CPU and GPU. FLOPs is measured by fvcore. You may want to edit the source file as needed.
Outputs:
Number of Params: 2.2 M
FLOPs = 298.3 M
throughput averaged with 30 times
batch_size 64 throughput on CPU 1029
throughput averaged with 30 times
batch_size 64 throughput on GPU 5104
- Numerous attention mechanisms along with their respective papers and codes.
| Name | Params (M) | FLOPs (M) | CPU Speed | GPU Speed | Demo |
- Various attention mechanisms along with their respective computational parameters and speed.
Note: Each method has its own hyperparameters. For a fair comparison on 1/16 scale feature maps, all methods in the above table adopt their default 1/16 scale settings, as shown in their released code repo. For example, when dealing with 1/16 scale feature maps, HiLo in LITv2 adopt a window size of 2 and alpha of 0.9. Future works will consider more scales and memory benchmarking.
This repository is released under the Apache 2.0 license as found in the LICENSE file.