Efficient Video Transformers with Spatial-Temporal Token Selection

Official PyTorch implementation of STTS, from the following paper:

Efficient Video Transformers with Spatial-Temporal Token Selection, ECCV 2022.

Junke Wang^*,Xitong Yang^*, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang.

Fudan University, University of Maryland, BirenTech Research

We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.

Model Zoo

MViT with STTS on Kinetics-400

name	acc@1	FLOPs	model
MViT-T⁰_0.9-S⁴_0.9	78.1	56.4	model
MViT-T⁰_0.8-S⁴_0.9	77.9	47.2	model
MViT-T⁰_0.6-S⁴_0.9	77.5	38.1	model
MViT-T⁰_0.5-S⁴_0.7	76.6	23.3	model
MViT-T⁰_0.4-S⁴_0.6	75.6	12.1	model

VideoSwin with STTS on Kinetics-400

name	acc@1	FLOPs	model
VideoSwin-T⁰_0.9	81.9	252.5	model
VideoSwin-T⁰_0.8	81.6	223.4	model
VideoSwin-T⁰_0.6	81.4	181.4	model
VideoSwin-T⁰_0.5	81.1	121.6	model
VideoSwin-T⁰_0.4	80.7	91.4	model

Installation

Please check MViT and VideoSwin for installation instructions and data preparation.

Training and Evaluation

MViT

For both training and evaluation with MViT as backbone, you could use:

cd MViT

python tools/run_net.py --cfg path_to_your_config

For example, to evaluate MViT-T⁰_0.6-S⁴_0.9, run:

python tools/run_net.py --cfg configs/Kinetics/t0_0.6_s4_0.9.yaml

VideoSwin

For training, you could use:

cd VideoSwin

bash tools/dist_train.sh path_to_your_config $NUM_GPUS --checkpoint path_to_your_checkpoint --validate --test-last

while for evaluation, you could use:

bash tools/dist_test.sh path_to_your_config path_to_your_checkpoint $NUM_GPUS --eval top_k_accuracy

For example, to evaluate VideoSwin-T⁰_0.9 on a single node with 8 gpus, run:

cd VideoSwin

bash tools/dist_test.sh configs/Kinetics/t0_0.875.py ./checkpoints/t0_0.875.pth 8 --eval top_k_accuracy

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

If you find this repository helpful, please consider citing:

@inproceedings{wang2021efficient,
  title={Efficient video transformers with spatial-temporal token selection},
  author={Wang, Junke and Yang, Xitong and Li, Hengduo and Li, Liu and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle={ECCV},
  year={2022}
}

	OpenMMLab 1.0 branch	OpenMMLab 2.0 branch
MMEngine		0.x
MMCV	1.x	2.x
MMDetection	0.x 、1.x、2.x	3.x
MMAction2	0.x	1.x
MMClassification	0.x	1.x
MMSegmentation	0.x	1.x
MMDetection3D	0.x	1.x
MMEditing	0.x	1.x
MMPose	0.x	1.x
MMDeploy	0.x	1.x
MMTracking	0.x	1.x
MMOCR	0.x	1.x
MMRazor	0.x	1.x
MMSelfSup	0.x	1.x
MMRotate	1.x	1.x
MMYOLO		0.x

wdrink / stts Goto Github PK

stts's Introduction

Model Zoo

MViT with STTS on Kinetics-400

VideoSwin with STTS on Kinetics-400

Installation

Training and Evaluation

MViT

VideoSwin

License

Citation

stts's People

Contributors

Stargazers

Watchers

Forkers

stts's Issues

Welcome update to OpenMMLab 2.0

Recommend Projects

Recommend Topics

Recommend Org

Jobs