Authors - Diganta Misra 1†, Trikay Nalamada 1,2†, Ajay Uppili Arasanipalai 1,3†, Qibin Hou 4
1 - Landskape 2. IIT Guwahati 3. University of Illinois, Urbana Champaign 4. National University of Singapore
† - Denotes Equal Contribution
Abstract - Benefiting from the capability of building inter-dependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly used in a variety of computer vision tasks recently. In this paper, we investigate light-weight but effective attention mechanisms and present triplet attention, a novel method for computing attention weights by capturing cross-dimension interaction using a three-branch structure. For an input tensor, triplet attention builds inter-dimensional dependencies by the rotation operation followed by residual transformations and encodes inter-channel and spatial information with negligible computational overhead. Our method is simple as well as efficient and can be easily plugged into classic backbone networks as an add-on module. We demonstrate the effectiveness of our method on various challenging tasks including image classification on ImageNet-1k and object detection on MSCOCO and PASCAL VOC datasets. Furthermore, we provide extensive in-sight into the performance of triplet attention by visually inspecting the GradCAM and GradCAM++ results. The empirical evaluation of our method supports our intuition on the importance of capturing dependencies across dimensions when computing attention weights.
Figure 1. (a). Squeeze Excitation Block. (b). Convolution Block Attention Module (CBAM) (Note - GMP denotes - Global Max Pooling). (c). Global Context (GC) block. (d). Triplet Attention (ours).
Figure 2. GradCAM and GradCAM++ comparisons for ResNet-50 based on sample images from ImageNet dataset.
For generating GradCAM and GradCAM++ results, please follow the code on this repository.
Model | Parameters | GFLOPs | Top-1 Error | Top-5 Error | Weights |
---|---|---|---|---|---|
ResNet-18 + Triplet Attention (k = 3) | 11.69 M | 1.823 | 29.67% | 10.42% | Google Drive |
ResNet-18 + Triplet Attention (k = 7) | 11.69 M | 1.825 | 28.91% | 10.01% | Google Drive |
ResNet-50 + Triplet Attention (k = 7) | 25.56 M | 4.169 | 22.52% | 6.326% | Google Drive |
ResNet-50 + Triplet Attention (k = 3) | 25.56 M | 4.131 | 23.88% | 6.938% | Google Drive |
MobileNet v2 + Triplet Attention (k = 3) | 3.506 M | 0.322 | 27.38% | 9.23% | Google Drive |
MobileNet v2 + Triplet Attention (k = 7) | 3.51 M | 0.327 | 28.01% | 9.516% | Google Drive |
All models are trained with 1x learning schedule.
Backbone | Detectors | AP | AP50 | AP75 | APS | APM | APL | Weights |
---|---|---|---|---|---|---|---|---|
ResNet-50 + Triplet Attention (k = 7) | Faster R-CNN | 39.2 | 60.8 | 42.3 | 23.3 | 42.5 | 50.3 | Google Drive |
ResNet-50 + Triplet Attention (k = 7) | RetinaNet | 38.2 | 58.5 | 40.4 | 23.4 | 42.1 | 48.7 | Google Drive |
ResNet-50 + Triplet Attention (k = 7) | Mask RCNN | 39.8 | 61.6 | 42.8 | 24.3 | 42.9 | 51.3 | Google Drive |
Backbone | Detectors | AP | AP50 | AP75 | APS | APM | APL | Weights |
---|---|---|---|---|---|---|---|---|
ResNet-50 + Triplet Attention (k = 7) | Mask RCNN | 35.8 | 57.8 | 38.1 | 18 | 38.1 | 50.7 | Google Drive |
Backbone | Detectors | AP | AP50 | AP75 | APM | APL | Weights |
---|---|---|---|---|---|---|---|
ResNet-50 + Triplet Attention (k = 7) | Keypoint RCNN | 64.7 | 85.9 | 70.4 | 60.3 | 73.1 | Google Drive |
BBox AP results using Keypoint RCNN:
Backbone | Detectors | AP | AP50 | AP75 | APS | APM | APL | Weights |
---|---|---|---|---|---|---|---|---|
ResNet-50 + Triplet Attention (k = 7) | Keypoint RCNN | 54.8 | 83.1 | 59.9 | 37.4 | 61.9 | 72.1 | Google Drive |
Backbone | Detectors | AP | AP50 | AP75 | APS | APM | APL | Weights |
---|---|---|---|---|---|---|---|---|
ResNet-50 + Triplet Attention (k = 7) | Faster R-CNN | 39.3 | 60.8 | 42.7 | 23.4 | 42.8 | 50.3 | Google Drive |
ResNet-50 + Triplet Attention (k = 7) | RetinaNet | 37.6 | 57.3 | 40.0 | 21.7 | 41.1 | 49.7 | Google Drive |
@misc{misra2020rotate,
title={Rotate to Attend: Convolutional Triplet Attention Module},
author={Diganta Misra and Trikay Nalamada and Ajay Uppili Arasanipalai and Qibin Hou},
year={2020},
eprint={2010.03045},
archivePrefix={arXiv},
primaryClass={cs.CV}
}