GithubHelp home page GithubHelp logo

video-mac / videomac Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 1.0 5.13 MB

Official code for CVPR2024 “VideoMAC: Video Masked Autoencoders Meet ConvNets”

Home Page: https://arxiv.org/abs/2402.19082

License: MIT License

Python 97.85% Shell 2.15%
convnets mae masked-autoencoder self-supervised-learning video-representation-learning video-segmentation

videomac's Introduction

VideoMAC: Video Masked Autoencoders Meet ConvNets

Abstract

Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as VideoMAC, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+5.2% / 6.4% J&F), body part propagation (+6.3% / 3.1% mIoU), and human pose tracking (+10.2% / 11.1% [email protected]).

Image 1 Image 2
Comparison Pipeline

An illustration of VideoMAC for ConvNet-based MVM. During pre-training, we mask 75% of symmetric patches from two frames randomly. In our VideoMAC, the MVM of frame pairs is achieved by an online network optimized by gradients ( , online loss equation ) and a target network updated by EMA ( , target loss equation ). equation is computed as the reconstruction consistency loss between reconstructed patches of frame pairs.

Quantitative Results

VideoMAC_Results

Qualitative Results

davis
Visualization of frame reconstruction and video object segmentation on DAVIS.
davis
Visualization of frame reconstruction and body part propagation on VIP.
davis
Visualization of frame reconstruction and human pose tracking on JHMDB.

Acknowledgement

This repository borrows from CNXv2, MAE and MinkowskiEngine.

License

VideoMAC is released under the MIT license and inherits all licenses of the aforementioned methods. If you want to use our code for non-academic use, please check the license first.

Citation

@inproceedings{pei2024videomac,
  title={VideoMAC: Video Masked Autoencoders Meet ConvNets},
  author={Pei, Gensheng and Chen, Tao and Jiang, Xiruo and Liu, Huafeng and Sun, Zeren and Yao, Yazhou},
  booktitle={CVPR},
  year={2024}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.