kkahatapitiya / x3d-multigrid Goto Github PK

PyTorch implementation of X3D models with Multigrid training.

License: MIT License

Python 100.00%

x3d multigrid efficient-video-architectures efficient-training

x3d-multigrid's Introduction

PyTorch Implementation of X3D with Multigrid Training

This repository contains a PyTorch implementation for "X3D: Expanding Architectures for Efficient Video Recognition models" [CVPR2020] with "A Multigrid Method for Efficiently Training Video Models" [CVPR2020]. In contrast to the original repository (here) by FAIR, this repository provides a simpler, less modular and more familiar structure of implementation for faster and easier adoptation.

Introduction

X3D is an efficient video architecture, searched/optimized for learning video representations. Here, the author expands a tiny base network along axes: space and time (of the input), width and depth (of the network), optimizing for the performace at a given complexity (params/FLOPs). It further relies on depthwise-separable 3D convolutions [1], inverted-bottlenecks in residual blocks [2], squeeze-and-excitation blocks [3], swish (soft) activations [4] and sparse clip sampling (at inference) to improve its efficiency.

Multigrid training is a mechanism to train video architectures efficiently. Instead of using a fixed batch size for training, this method proposes to use varying batch sizes in a defined schedule, yet keeping the computational budget approximately unchaged by keeping batch x time x height x width a constant. Hence, this follows a coarse-to-fine training process by having lower spatio-temporal resolutions at higher batch sizes and vice-versa. In contrast to conventioanl training with a fixed batch size, Multigrid training benefit from 'seeing' more inputs during a training schedule at approximately the same computaional budget.

Our implementaion achieves 62.62% Top-1 accuracy (3-view) on Kinetics-400 when trained for ~200k iterations from scratch (a 4x shorter schedule compared to the original, when adjusted with the linear scaling rule [5]), which takes only ~2.8 days on 4 Titan RTX GPUs. This is much faster than previous Kinetics-400 training schedules on a single machine. Longer schedules can achieve SOTA results. We port and include the weights trained by FAIR for a longer schedule on 128 GPUs, which achieves 71.48% Top-1 accuracy (3-view) on Kinetics-400. This can be used for fine-tuning on other datasets. For instance, we can train on Charades classification (35.01% mAP) and localization (17.71% mAP) within a few hours on 2 Titan RTX GPUs. All models and training logs are included in the repository.

Note: the Kinetics-400 dataset that we trained on contains ~220k (~240k) training and ~17k (~20k) validation clips compared to (original dataset) due to availability.

Tips and Tricks

3D depthwise-separable convolutions are slow in current PyTorch releases as identified by FAIR. Make sure to build from source with this fix. Only a few files are changed, this can be manually edited easily in the version of the source you use. In our setting, this fix reduced the training time from ~4 days to ~2.8 days.
In my experience, dataloading and preprocessing speeds are as follows: accimage ≈ Pillow-SIMD >> Pillow > OpenCV. This is not formally verified by me, but check here for some benchmarks.
Use the linear scaling rule [5] to adjust the learning rate and training schedule when using a different base batch size.
For longer schedules, enable random spatial scaling, and use the original temporal stride (we use 2x stride in the shorter schedule).

Dependencies

Python 3.7.6
PyTorch 1.7.0 (built from source, with this fix). This issue is fixed in PyTorch >= 1.9 releases.
torchvision 0.8.0 (built from source)
accimage 0.1.1
pkbar 0.5

Quick Start

Edit the Dataset directories to fit yours, adjust the learning rate and the schedule, and,

Use python train_x3d_kinetics_multigrid.py -gpu 0,1,2,3 for training on Kinetics-400.
Use python train_x3d_charades.py -gpu 0,1 for training on Charades classification.
Use python train_x3d_charades_loc.py -gpu 0,1 for training on Charades localization.

Charades dataset can be found here. Kinetics-400 data is only partially available on YouTube now. Use annotations here. I would recommend this repo for downloading Kinetics data. If you want access to our Kinetics-400 data (~220k training and ~17k validation), please drop me an email.

Reference

If you find this work useful, please consider citing the original authors:

@inproceedings{feichtenhofer2020x3d,
  title={X3D: Expanding Architectures for Efficient Video Recognition},
  author={Feichtenhofer, Christoph},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={203--213},
  year={2020}
}

@inproceedings{wu2020multigrid,
  title={A Multigrid Method for Efficiently Training Video Models},
  author={Wu, Chao-Yuan and Girshick, Ross and He, Kaiming and Feichtenhofer, Christoph and Krahenbuhl, Philipp},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={153--162},
  year={2020}
}

Acknowledgements

I would like to thank the original authors for their work. Also, I thank AJ Piergiovanni for sharing his Multigrid implementation.

x3d-multigrid's People

Contributors

Stargazers

Watchers

Forkers

amaljithcf kiyoshikawasaki sunruina2 krystal0606 hitersyw jerichosu jovian-dsouza daheehan333 huangjun12 zeyut seunghoon-yi tdtce shujunyy123

x3d-multigrid's Issues

Dataset generation

How to generate the following files?
KINETICS_TRAIN_ANNO
KINETICS_VAL_ANNO
KINETICS_CLASS_LABELS

How to test video-level acc?

Hi,appreciate your beautiful work! @kkahatapitiya Could you tell me how you implement performance validation 71.48% Top-1 accuracy (3-view) on Kinetics-400？Have you opensource your video-level accuracy test code? I test the pretrained performance lower than the performance you offered. （video-level acc which is average all clips of a test video of my approach）

x3d.py

Add these codes to the file

if __name__=='__main__':
    net = generate_model('S',).cuda()
    #print(net)    
    from torchsummary import summary
    inputs = torch.rand(8, 3, 10, 112, 112).cuda()
    output = net(inputs)
    print(output.shape)
    summary(net,input_size=(3,10,112,112),batch_size=8,device='cuda')

The code can run success, but except the summary,
The error report was

 File "x3d.py", line 382, in <module>
    summary(net,input_size=(3,10,112,112),batch_size=8,device='cuda')
  File "D:\software\program\Anaconda3\envs\pytorch1\lib\site-packages\torchsummary\torchsummary.py", line 72, in summary
    model(*x)
  File "D:\software\program\Anaconda3\envs\pytorch1\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "x3d.py", line 324, in forward
    x = self.bn1(x)
  File "D:\software\program\Anaconda3\envs\pytorch1\lib\site-packages\torch\nn\modules\module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "x3d.py", line 52, in forward
    x = x.view(n // self.num_splits, c * self.num_splits, t, h, w)
RuntimeError: shape '[0, 192, 10, 56, 56]' is invalid for input of size 1505280

I found that the shape of x was (2,3,10,112,112) in the forwad other than (8,3,10,112,112), and I don`t konw why.
Do you konw that?

Model convertion

Thank you a lot for sharing your implementation. It is really helpful for implementing X3D network on custom deep learning problem.

The original repo only provides Caffe2 pretrained model. How did you convert them to pytorch format ? (I also want to try other version of X3D)

Performance Comparison

Hi，@kkahatapitiya, Thanks for your clear reproduction.
I have two question when I test your code:

What is the specific performance on kinetics-400？ Because you said that it achieves 62.62% Top-1 accuracy (3-view) on Kinetics-400 when trained for ~200k iterations from scratch, I don not know which version of x3d got this result. How many epoch you trained to get this results?
As for the figure below in the original paper, x3d-M got 4.73G FLOPs but I test this x3d-M of this code and got 3.76G FLOPs. Could you please explain about it?

Why eval mode degenerated?

Thanks for your clean implementation! @kkahatapitiya
I have two problem to consult you:

I find out the prediction in eval mode always same when I finish training x3d on kinetics-200 dataset. But it's normal if inference in model.train().I failed to find the reason.(base_bn_splits=8 or 1 got same observation, I trained the model in normal way.)
Why some layerx.x.bnx.split_bn.running_var and running_mean keep still alone the whole training process ?

As the chart above, why running_mean and running_var keep same along the whole training process?
appreciate it

X3D No Multigrid

I am planning to use your implementation in x3d.py and use it in my own training environment to train X3D with a constant batch size. I don't want to use any multigrid features. I will be using my own dataloaders and datasets and so on.
In the below model instantiation snippet, I am unsure about one parameter:

x3d = resnet_x3d.generate_model(x3d_version=X3D_VERSION, n_classes=400, n_input_channels=3,
                                dropout=0.5, base_bn_splits=BASE_BS_PER_GPU//CONST_BN_SIZE)

What is base_bn_splits? If I use a single GPU and a constant batch size, what value do I need to give this parameter? Thanks a lot! @kkahatapitiya

Training from scratch

Is it possible to train it from scratch? Eventually which is the dataset format I have to provide?

Pretrained models

Hello, which configurations of X3d have you trained and included in the repo ? (X3D-M , X3D- L , X3D-XL ....)

How to set the super-parameter when doing validation?

Hi, thanks a lot for sharing your implementation! I want to use your pretrained model to do validation, and if I only have one GPU, how should I modify the super-parameters, especially base_bn_splits used in generate_model. And I want to know whether the model named "x3d_multigrid_kinetics_fb_pretrained.pt" is modified from the provided model by facebook? Looking forward to your reply.

num_samples error

Hi,

When I run train_x3d_charades.py, I get the following error:

raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

I'm using the same dataset in the code (Charades_v1_rgb). Do you have any suggestions?
Thank you.

Changing input clip length

Good day!

I have troubles on finding where to specify input clip length parameter when defining X3D model. Currently I'm aiming to change input frames (temporal duration parameter) to 20 for X3D-M training and so that input clip (gamma_tau) is sampled at 10FPS.
Please provide some insight on how that can be achieved.