Video Classifiication Paper List
- Cao, Haoyuan, Shining Yu, and Jiashi Feng.
"Compressed Video Action Recognition with Refined Motion Vector." arXiv(2019).[PDF]
- TEA:Yan Li, Bin Ji, et al.
"TEA: Temporal Excitation and Aggregation for Action Recognition" CVPR(2020)[PDF] - TPN:Ceyuan Yang, Yinghao Xu, et al.
"TPN: Temporal Pyramid Network for Action Recognition" CVPR(2020)[PDF][Code]
- Dmc-net:Shou, Zheng, et al.
"Dmc-net: Generating discriminative motion cues for fast compressed video action recognition." CVPR(2019).[PDF][Code]
- SlowFast: Feichtenhofer C, Fan H, Malik J, et al.
"Slowfast Networks for Video Recognition",ICCV(2019 oral).[PDF][Code] - TSM: Chuang Gan, Song Han,Ji Lin
"Temporal Shift Module for Efficient Video Understanding",ICCV(2019).[PDF][Code] - STM: Jiang, Boyuan, et al.
"STM: SpatioTemporal and motion encoding for action recognition." ICCV(2019).[PDF]
- bLVNet-TAM: Quanfu Fan, Chun-Fu (Richard) Chen, Hilde Kuehne, Marco Pistoia, David Cox
"More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation".NIPS(2019)[PDF][Code]
- R(2+1)D: Tran, Du, et al.
"A closer look at spatiotemporal convolutions for action recognition." CVPR(2018).[PDF][Code] - CoViAR:Wu, Chao-Yuan, et al.
"Compressed video action recognition." CVPR(2018).[PDF][Code] - Non-local:Wang, Xiaolong, et al.
"Non-local neural networks." CVPR(2018).[PDF][Code]
- TrajectoryNet: Zhao, Yue, Yuanjun Xiong, and Dahua Lin.
"Trajectory convolution for action recognition." NIPS(2018)[PDF]
- I3D: Carreira Joao and Andrew Zisserman.
"Quo vadis, action recognition? a new model and the kinetics dataset" CVPR(2017).[PDF][Code]
- TSN:Wang, Limin, et al.
"Temporal segment networks: Towards good practices for deep action recognition." ECCV(2016)[PDF][Code]
- Two Stream: Simonyan, Karen, and Andrew Zisserman.
"Two-stream convolutional networks for action recognition in videos." NIPS(2014).[PDF][Code]
- IDT:Wang, Heng, and Cordelia Schmid.
"Action recognition with improved trajectories." ICCV(2013).[PDF]
- UCF101
13320 videos; average time ~10s; 101 human action categories,each class has 25 groups,videos in same group share some common features; datasets are not realistic and are staged by actors. - HMDB51
6849 videos; average time ~5s; 51 human action categories, each containing a minimum of 101 videos; datasets are most from movies clips, and a small proportion from other public datasets and web videos. - Kinetics(due to the missing videos in kinetics source csv, the 'nolocal net' reseachers offer a pre-downloaded version of kinetics-400,here it's the relevent issue)
650000 videos; average time ~10s; 700/600/400 human categories, each action class has at least 600 video clips; datasets are most from youtube videos. - Something-something v2
220847 videos; average time 2~6s; 174 human basic action categories; datasets focus the human fine-grined actions,such as "Putting something on a surface". - Charades
average ~30s per video, long-term video dataset. - Moments in Time
about one millon videos; average time ~3s, involving people, animals, objects or natural phenomena, that capture the gist of a dynamic scene.
- Liming,Wang Nanjing University