GithubHelp home page GithubHelp logo

tsujuifu / pytorch_violet Goto Github PK

View Code? Open in Web Editor NEW
136.0 9.0 7.0 117.74 MB

A PyTorch implementation of VIOLET

Python 100.00%
pytorch vision-and-language pre-training video-retrieval video-question-answering

pytorch_violet's Introduction

[2023/03/09 Update] VIOLETv2

We have released our empirical study of masked visual modeling for VidL learning as VIOLETv2.

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

We have our used datasets and the best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all downstream datasets and trained checkpoints.

Citation

@inproceedings{fu2023empirical-mvm, 
  author = {Tsu-Jui Fu* and Linjie Li* and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {{An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling}}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023} 
}
@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {{VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

pytorch_violet's People

Contributors

tsujuifu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch_violet's Issues

msvd-qa test

Thanks for your great work! But I have some questions about the msvd-qa test split. you use an answer set for msvd-qa, How do you deal with questions whose answers are not in the answer set?Just throw them away? I just throw them away get 11983 qa pairs from the original test file(13157 qa pairs). And I use the finetuned-checkpoint you provided, but get a lower accuracy 0.4554

Swin Base or Small

Hi, I have noticed that the output original Swin-Base is 1024, but according to your code, the output is 768. Did you use the Swin Small for experiment?

MVM for CLIP feature

Hi, I would like to know how to compute the loss between VideoSwin and the CLIP features in latest paper. Since the Swin family models take the patch size as 4x4, however for ViT the patch size is 16. I would like to know how to compute the loss between these two? (i.e., l1 loss)?

Thanks.

Rank Errors in pre-training

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py
When I tried to run this, a rank error occurred. Do I need to specify the rank somewhere?

Errors are the following

usage: main_pretrain.py [-h] [--local_rank LOCAL_RANK]
usage: main_pretrain.py [-h] [--local_rank LOCAL_RANK]
usage: main_pretrain.py [-h] [--local_rank LOCAL_RANK]
main_pretrain.py: error: unrecognized arguments: --local-rank=1
main_pretrain.py: error: unrecognized arguments: --local-rank=0
main_pretrain.py: error: unrecognized arguments: --local-rank=3
usage: main_pretrain.py [-h] [--local_rank LOCAL_RANK]
main_pretrain.py: error: unrecognized arguments: --local-rank=2
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 1499185) of binary: /bin/python

Error step_pretrain on Rank

    Hello, I run the pre-training model in the environment of 4 gpus, and Error step_pretrain on Rank 1, 3, 2 0 is displayed, but the pre-training is not successful。

Question about the img_tgif.pkl file.

Hello, thank you for your code.

I was trying to implement VIOLET on Next-QA dataset. However I couldn't find out how to make pkl file in 'data' folder.
I was wondering if I can get the code that makes img
.pkl files.

Performance check

Hi, thank you for sharing the code and models.

I have used the ckpt_violet_pretrain.pt and ckpt_violet_msrvtt-retrieval with our data processing (5 frames with interval num_frames // 5) for msrvtt t2v retrieval evaluation.
I got rank@1 22.6/32.9 which is lower than the number (25.9/34.7) in the paper. I also tested the CLIP model and got a similar result. Are the released models achieving the reported results?
If yes, could you provide the processing pipeline or describe how to get the reported performance?
Thank you!

zero-shot evaluation (video retrieval)

Hello,

Congratulations on the amazing work. I have a few questions about zero-shot evaluation in Table-1.

  1. Which checkpoint is used for zero-shot evaluation?
  2. The retrieval model has fully connected layers on the top of VIOLET Base model? In zero-shot evaluation, are these layers randomly initialized?
  3. If I need separate video and text features, which layer outputs are the most suitable (EncImg / EncTxt /Cross transformer)?

Thank you.

Processed data release

Hi, Thanks for ur great work.
Please release all related processed data like many other works, if it's convenient for u.

Some metadata files are missing

Hello, where can I get missing metadata files like txt_msvd-retrieval.json, args_msrvtt-qa.json, or args_msvd-retrieval.json, etc? Should I make those by myself?
In addition, how can I evaluate qa? It seems only evaluation code for retrieval is provided

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.