tsujuifu / pytorch_violet Goto Github PK

View Code? Open in Web Editor NEW

136.0 9.0 7.0 117.74 MB

A PyTorch implementation of VIOLET

Python 100.00%

pytorch vision-and-language pre-training video-retrieval video-question-answering

pytorch_violet's Introduction

[2023/03/09 Update] VIOLETv2

We have released our empirical study of masked visual modeling for VidL learning as VIOLETv2.

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

We have our used datasets and the best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

Multiple-Choice Question Answering (TGIF-Action, TGIF-Transition, MSRVTT-MC, and LSMDC-MC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json

Open-Ended Question Answering (TGIF-Frame, MSRVTT-QA, LSMDC-FiB, and MSVD-QA)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json

Text-to-Video Retrieval (MSRVTT, DiDeMo, YouCook2, and LSMDC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all downstream datasets and trained checkpoints.

Citation

@inproceedings{fu2023empirical-mvm, 
  author = {Tsu-Jui Fu* and Linjie Li* and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {{An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling}}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023} 
}

@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {{VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

pytorch_violet's People

Contributors

Stargazers

Watchers

Forkers

ngthanhtin leo-bright miso-choi bennokrojer shinying mengqidyangge andreapdr

pytorch_violet's Issues

DiDeMo caption

Hello, can you release DiDeMo caption? Thanks

msvd-qa test

Thanks for your great work! But I have some questions about the msvd-qa test split. you use an answer set for msvd-qa, How do you deal with questions whose answers are not in the answer set？Just throw them away? I just throw them away get 11983 qa pairs from the original test file(13157 qa pairs). And I use the finetuned-checkpoint you provided, but get a lower accuracy 0.4554

when do you expect to release the code?

Hi, can you share your plan to release the code for Violet?
Thanks!

Swin Base or Small

Hi, I have noticed that the output original Swin-Base is 1024, but according to your code, the output is 768. Did you use the Swin Small for experiment?

how to get the backbone features from the video foundation model?

Given an input video, how to get the backbone feature for the violet video foundation model? can anyone please let me know how to output this?

MVM for CLIP feature

Hi, I would like to know how to compute the loss between VideoSwin and the CLIP features in latest paper. Since the Swin family models take the patch size as 4x4, however for ViT the patch size is 16. I would like to know how to compute the loss between these two? (i.e., l1 loss)?

Thanks.

Rank Errors in pre-training

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py
When I tried to run this, a rank error occurred. Do I need to specify the rank somewhere?

Errors are the following

usage: main_pretrain.py [-h] [--local_rank LOCAL_RANK]
usage: main_pretrain.py [-h] [--local_rank LOCAL_RANK]
usage: main_pretrain.py [-h] [--local_rank LOCAL_RANK]
main_pretrain.py: error: unrecognized arguments: --local-rank=1
main_pretrain.py: error: unrecognized arguments: --local-rank=0
main_pretrain.py: error: unrecognized arguments: --local-rank=3
usage: main_pretrain.py [-h] [--local_rank LOCAL_RANK]
main_pretrain.py: error: unrecognized arguments: --local-rank=2
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 1499185) of binary: /bin/python

Error step_pretrain on Rank

    Hello, I run the pre-training model in the environment of 4 gpus, and Error step_pretrain on Rank 1, 3, 2 0 is displayed, but the pre-training is not successful。

Question about the img_tgif.pkl file.

Hello, thank you for your code.

I was trying to implement VIOLET on Next-QA dataset. However I couldn't find out how to make pkl file in 'data' folder.
I was wondering if I can get the code that makes img.pkl files.

Performance check

Hi, thank you for sharing the code and models.

I have used the ckpt_violet_pretrain.pt and ckpt_violet_msrvtt-retrieval with our data processing (5 frames with interval num_frames // 5) for msrvtt t2v retrieval evaluation.
I got rank@1 22.6/32.9 which is lower than the number (25.9/34.7) in the paper. I also tested the CLIP model and got a similar result. Are the released models achieving the reported results?
If yes, could you provide the processing pipeline or describe how to get the reported performance?
Thank you!

zero-shot evaluation (video retrieval)

Hello,

Congratulations on the amazing work. I have a few questions about zero-shot evaluation in Table-1.

Which checkpoint is used for zero-shot evaluation?
The retrieval model has fully connected layers on the top of VIOLET Base model? In zero-shot evaluation, are these layers randomly initialized?
If I need separate video and text features, which layer outputs are the most suitable (EncImg / EncTxt /Cross transformer)?

Thank you.

When the whole model would be available?

Treat FFOE Video QA as a classifcation task on answers candidate set

Violet uses most commen answers as candidates, but there are other answers in the test set. How do you deal with them? Are they abandoned according to the txt_msvd.json?

Processed data release

Hi, Thanks for ur great work.
Please release all related processed data like many other works, if it's convenient for u.

I found some wrong places, can you see if it's right?

In main_pretrain.py, line 165:
p = (1+_h*_w)*i_t + i_h*_w + i_w
I think it should plus 1.
p = (1+_h*_w)*i_t + i_h*_w + i_w + 1
Because your first position is for the separator.

Could you provide the script for generating 'txt_xxx.json' in '_data'

Some metadata files are missing

Hello, where can I get missing metadata files like txt_msvd-retrieval.json, args_msrvtt-qa.json, or args_msvd-retrieval.json, etc? Should I make those by myself?
In addition, how can I evaluate qa? It seems only evaluation code for retrieval is provided