GithubHelp home page GithubHelp logo

vindlu's Introduction

VindLU

VindLU : A Recipe for Effective Video-and-Language Pretraining [arXiv] [project page]

Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

Official PyTorch code for VindLU, a recipe for effective Video-and-Language (VidL) Pretraining.

News:

  • 2022-12-07: Our annotation files and trained checkpoints are available on Google Drive.

Highlights:

  • Revealed the importance of each component in VidL pretraining (see our paper for details).
  • Cheap to train: 82 V100 GPU days to train on joint 10M video and 15M image datasets; 15 V100 days on 5M datasets.
  • State-of-the-art performance on video retrieval task and VidQA task. Specifically, our model achieves 61.2%(+7.8%) R@1 on DiDeMo and 55.0%(+6.1%) on ActivityNet-Captions.

Results

Text-to-Video Retrieval (R@1 accuracy).
Pretrained Data MSR-VTT DiDeMo ANet SSV2-Label SSv2-Template Checkpoints
5M 43.8 54.6 51.1 51.2 82.2 model
17M 45.3 59.2 54.4 53.0 86.2 model
25M 46.5 61.2 55.0 53.1 83.3 model
Video Question Answering (Top-1 accuracy).
Pretrained Data ANet-QA MSRVTT-QA MSRVTT-MC TVQA Checkpoints
5M 44.2 43.6 95.2 79.0 model
17M 44.6 43.8 96.7 78.8 model
25M 44.7 44.6 97.1 79.0 model

Setup

The specific packages used in our experiment are detailed in vl.yml, you can easily create a conda env containing these packages.

# create 
conda env create -f vl.yml
# activate
conda activate vl

In your ~/.bashrc file, set the environment variables:

export VL_EXP_DIR="/path/to/ckpts_and_logs"
export VL_DATA_DIR="/path/to/data"

The datasets are stored under $VL_DATA_DIR and experiment outputs are stored under $VL_EXP_DIR. These variables are accessed by the config files in the configs/ directory.

[Optional] Our codebase support using wandb to monitor training. If you want to use wandb, you will need to set up it following this very short instruction, and also set wandb.enable in the configs to be True.

Data

Put your data following the following structure:

$VL_DATA_DIR
    |-- anno_pretrain     
        |-- webvid_train.sqlite.db
        |-- ...
    |-- anno_downstream
        |-- didemo_ret_train.json
        |-- ...
    |-- videos_images
        |-- webvid_2fps_224
            |-- 1053400385.mp4
            |-- ...
        |-- ...

Our prepared annotations are available on Google Drive.

Refer DATA.md to check how to prepare the image/video datasets.

The annotation file is in json format, which can be loaded as a list of dictionaries. Each dictionary is {'image': path_to_image, 'caption': image_caption} for image-text dataset, and is {'image': path_to_video, 'caption': video_caption} for video-text dataset. Note that we use the same key image for both image-text and video-text datasets for simplicity.

We store the pretraining annotation files using file-based database SQLite. SQLite allows us to load the captions on demand and thus save lots of CPU memory. If using json format, the Dataloader will cost more than 200GB CPU memory for 8 GPUs and 3 workers per GPU process. This is because each worker needs to maintain a copy of the json files in memory and the json files are too large (~5GB, and will be even larger when loaded as python objects).

You can use create_sqlite_db.py to convert the json annotation files into SQLite files.

Training and Inference

All the tasks can be launched via the python script tools/run.py.

  • Support slurm and run locally.

If there is no slurm, you need to submit the training script to each node.

It will use slurm if command sbatch exists. You can force to run locally by add the argument --no_slurm.

Usage:

python tools/run.py --slurm_args SLURM_ARGS --jobname JOBNAME \
    --dep_jobname DEP_JOBNAME \
    --nnodes NNODES --ngpus NGPUS --task TASK \
    --config CONFIG_FILE --model_args MODEL_ARGS
  • SLURM_ARGS: the additional arguments for slurm. You can set the default arguments (DEFAULT_SLURM_ARGS in tools/run.py). SLURM_ARGS will override the default arguments.
  • JOBNAME: The experiment name and job_name in slurm. All the outputs (checkpoint and logs) will be write to $VL_EXP_DIR/JOBNAME.
  • DEP_JOBNAME: The dependent job. This job will start only when DEP_JOBNAME is finished. You can use this feature to submit your pretraining, finetuning and evaluation jobs in the same time. Only valid when slurm is available.
  • NNODES: The number of nodes to use.
  • NGPUS: How many GPUs to use in each node.
  • TASK: This job will run the script tasks/TASK.py in tasks. Supported tasks:
    • "pretrain": for pretraining.
    • "retrieval": for text-to-video retrieval task.
    • "retrieval_mc": for multi-choice VidQA on MSRVTT-MC dataset.
    • "vqa": for open-ended V(id)QA task.
  • CONFIG_FILE: The path to the config file. For example, configs/pretrain.py for pretrain and configs/ret_didemo.py for video retrieval task on DiDeMo dataset.
  • MODEL_ARGS: The arguments to override the predefined arguments in CONFIG_FILE. Format: "key1 value1 key2 value2 ...". The value of format "eval(SOME_CODE)" will be evaluated using python's eval function.

Pre-Training

Example for pretraining on webvid_cc3m (5M):

corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
    --jobname $pt_name \
    --config configs/pretrain.py \
    --model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0"

You can use this script if 1) with slurm or 2) no slurm but only 1 node is used.

If using slurm, remember to add --slurm_args SLURM_ARGS according to your cluster's settings. The same for the following examples.

You can change corpus to "webvid_14m" for 17M corpus and "webvid10m_14m" for 25M corpus.

See variable available_corpus in configs/data.py for all the supported pretraining corpus. You can add your own datasets by adding them to available_corpus.

Multi-node pretrain without slurm

The following example will do pretrain on 2 nodes with 4 GPUs per node without slurm.

When running locally without slurm, you need

  • specify the MASTER_ADDR and MASTER_PORT explicitly to make sure all the nodes use the same endpoint.
  • run the script on each node. The logs will only display on the master node.
export MASTER_ADDR="ip address of master node" # change to your real ip.
export MASTER_PORT=40041 # some unused port.
corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
    --jobname $pt_name \
    --config configs/pretrain.py \
    --model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0" \
    --no_slurm

Finetuning and Evaluation

Our following examples are based on the pretrained model in the above section.

Text-to-video retrieval

Supported datasets: msrvtt, msrvtt-9k, didemo, anet. Example for msrvtt dataset:

dataset=msrvtt
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_${dataset}

if [[ "$dataset" == *"msrvtt"* ]]; then ngpus=4; else ngpus=1; fi
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi

# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name} --dep_jobname ${pt_name} \
    --config configs/ret_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/ret_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}" 
Video Question Answering
  • Open-ended QA:
dataset=msrvtt # supported: msrvtt, anet
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-qa_${dataset}

ngpus=1
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi

# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
    --jobname ${ft_name} --dep_jobname ${pt_name} \
    --config configs/qa_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/qa_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}" 
  • MSRVTT-MC (multiple-choice). We directly evaluate using the fintuned retrieval model.
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_msrvtt

# evaluation
python tools/run.py --nnodes 1 --ngpus 1 --task retrieval_mc \
    --jobname ${ft_name}/eval_${nfrm_test}frm-mc --dep_jobname ${ft_name} \
    --config configs/ret_msrvtt_mc.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test 12"

Acknowledgement

This code used resources from Singularity, transformers, ALBEF, ClipBERT, frozen. The code is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{cheng2022vindlu,
  title={VindLU: A Recipe for Effective Video-and-Language Pretraining},
  author={Cheng, Feng and Wang, Xizi and Lei, Jie and Crandall, David and Bansal, Mohit and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2212.05051},
  year={2022}
}

vindlu's People

Contributors

klauscc avatar

Stargazers

 avatar Orr Zohar avatar ermu2001 avatar Zhanwen Chen avatar Qi Li avatar  avatar Yi Liu avatar WenKang Han avatar zengqun chen  avatar Jian Xiao avatar hl-Chen avatar LikeGiver avatar Charlestar avatar Eric Liang avatar  avatar Wufei Ma avatar  avatar TianYuan avatar  avatar Conna avatar Shawn J. avatar SUN, Pengzhan avatar Songmiao Wang avatar Limmy avatar Haotian Liu avatar  avatar dgo2dance avatar Dickson Neoh avatar adnen abdessaied avatar Yunusemre avatar Chaofan Tao avatar Huabin avatar  avatar  avatar HWOLF avatar Suhail Kamal avatar Xinhao Li avatar Reza Armandpour avatar LLI Hufei avatar Jeff Carpenter avatar Andrei V. Konstantinov avatar Michael Dorkenwald avatar  avatar stdKonjac avatar Kacper Bąk avatar  avatar Thanh Tin Nguyen avatar Akshita Gupta avatar Jun Chen avatar yangchao avatar Arka Sadhu avatar Raphaël avatar cgoe avatar Stan Lei avatar Coder.Jun avatar Yura Choi avatar Lin Zhang avatar Sunan He avatar Yuanhan Zhang avatar Ziyang Wang avatar Baymax avatar Jian avatar Jiyeol Park avatar  avatar  avatar Nanyang Wang avatar  avatar Guang YANG avatar  avatar  avatar Jinrui Zhang avatar İlker Kesen avatar Piyush Bagad avatar Sishuo Chen avatar Xinxin Zhu avatar  avatar Minjoon Jung avatar An-zhi WANG avatar Qin Liu avatar Abdelhak Loukkal avatar  avatar Johnny Salazar avatar  avatar RyShuey avatar 爱可可-爱生活 avatar Phillip Peng avatar  avatar Binhui Xie (谢斌辉) avatar Researcher.YuanYuhui avatar Kevin avatar Xiyang Chen avatar Hakeem Demi avatar Mike avatar Saicoco avatar Yuchong Yao avatar Vateye avatar Kunchang Li avatar  avatar Jie Lei 雷杰 avatar

Watchers

Mike avatar  avatar Shuhuai Ren avatar Jian avatar

vindlu's Issues

Problem with finetuning speed

Hi, thanking for the great work.
But when I tried to finetune the network on my own data, I encounter problems with efficiency.

  1. If I set the num_workers in the Dataloader to >0, the data loading process becomes extremely slow and the loading time increases with each worker.
  2. The time to backpropagate tho the graph (the time to execute this line of code) increases in proportion to the batch size.
    scaler.scale(loss).backward()

I want to ask if this is normal in finetuning or have I somehow introduced a bug. Also, is there anyway to speed up?

The zero-shot performance of pre-trained ckpt is extremely low

Thanks for the great work.

I evaluate the zero-shot performance of the 25M pre-trained ckpt on the DiDeMo dataset, my command is

export VL_DATA_DIR=/home/renshuhuai/VindLU/
export VL_EXP_DIR=/home/renshuhuai/VindLU/output

dataset=didemo
pt_name=25M-pretrain.pth
ft_name=${pt_name}-ret_${dataset}

ngpus=4
num_frames=8
nfrm_test=8
batch_size=32

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/ret_${dataset}.py \
    --no_slurm \
    --model_args "pretrained_path /home/renshuhuai/VindLU/checkpoints/${pt_name} \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}"

However, I got a extremely low result (e.g., 0.1 R@1 for video retrieval):
image

I want to know if this is normal?

Performance on action recognition

Hi Authors,

Thank you for sharing your great work. I'm curious about the performance of your models on action recognition tasks. Have you attempted to benchmark on any standard action recognition tasks such as SSV2, K400/700?

Thank you.

MVM loss

Hi, is there any code for MVM pretraining?

What about the checkpoint for MSRVTT-MC?

Hi! Thanks for your good job.
I find that for MSRVTT-MC, you simply use 7k split to train, just like singularity. However, there isn't any checkpoint for MSRVTT-7k. Is the result for MSRVTT-MC generated from the 9k split?

Question about video qa

Hello author, thanks for your recipe on video text learning.

I have a question: how to use VindLU_QA for video description generation. I see that you are in the paper, which only involves video Q&A (yes or no type and multiple choice).

Question about the Temporal model

Hi, thanks a lot for sharing your solid work, I have learned much from your paper and code. Here I still have a question about the part of temporal modeling.
I saw that you have compared the performance between Timesformer and XCLIP, which show that Timesformer works better, but in the paper of XCLIP, it used pretrained CLIP weights, and XCLIP found a trade-off way between keeping performance of pretrained CLIP weights and Temporal modeling.
I want to ask if you have test the performance of using XCLIP with pretrained CLIP, and did you found the way to used both Timesformer's temporal modeling and CLIP pretrained weights, which I think will beat XCLIP in theory. 😊

potential video preprocessing on DiDeMo videos

Thanks for the great work!

I used your released checkpoint 25M-retrieval-didemo.pth and directly conducted inference, but got accuracy lower than that reported in the repo:

reported (12 test frm): 61.2
re-inference (12 test frm): 59.5
re-inference (32 test frm): 59.8

I noticed that it seems you prepross the videos in DiDeMo, i.e., 2fps_360_trimed30 in the following config. I wonder if you resize the frames to 360 and what does trimed30 mean? Can you provide the preprossing script if it has great impact to the final results? Thanks a lot.

train_file = [
f"{anno_root_downstream}/didemo_ret_train.json",
f"{data_root}/didemo_2fps_360_trimed30",
"video",
]

Two questions about the implementation

Thanks for your great job! I really appreciate the detailed experiments. However, I find some differences between the implementation and the paper:

  1. As in #1, the MVM loss is not used. Does MVM really help?
  2. In the original paper, Temporal Attention is inserted before spatial attention as in TimeSformer. But in the code, the temporal attention seems to be inserted after FFN. Is it better?
    # second residual connection
    layer_output = self.drop_path(layer_output) + hidden_states
    layer_output = einops.rearrange(layer_output, "(b t) l c -> b t l c", b=b)
    # apply temporal modeling block
    if self.temp_model is not None and self.temporal_model_position == "last":
    layer_output = self.temp_model(layer_output)
    outputs = (layer_output,) + outputs
    return outputs

Problems about speed of pretraining

Hi, I am reproducing the pretraining in your work. The eta shows that it needs to over 20 days to comleting the pretraining on webvid2.5M+cc3M for 10 epochs, which is far from the 1.8 days reported in the paper. Here is all of configs I think is relative.

8 * A10 cards without slurm, each card has 23GB (similar with the A5000)
OMP_NUM_THREADS=64 # for torchrun
Dataset = webvid2.5M+cc3M (use the .sqlite.db file), and the data are pre-processed by the preprocess/compress.py. Video are sampled by 2 fps. The resolution is 224.
num_workers = 32
batch_size =  64
Model: BEIT-base + BERT-base

Now the ETA for one epoch is over 2 days, so 20+ days for 10 epochs. The following is part of the training log:

 utils.basic_utils: Train Epoch: [0]  [  200/10175]  eta: 2 days, 11:04:48  
lr: 0.000002  temperature: 0.0702  image-loss_vtc: 6.2285  
video-loss_vtc: 6.2430  image-loss_mlm: 5.3662  video-loss_mlm: 5.8240  image-loss_vtm: 0.6576  video-loss_vtm: 0.6384  
time: 40.1906  data: 38.0570  max mem: 10768 res mem: 11456

In addition, I follow #9 to set Dataloader(multiprocessing_context="spawn", ....) during pretraining, but it also has bug:

Traceback (most recent call last):                                                                                                                                   
  File "tasks/pretrain.py", line 285, in <module>                                                                                                                    
    main(cfg)                                                                                                                                                        
  File "tasks/pretrain.py", line 214, in main                                                                                                                        
    config,                                                                                                                                                          
  File "tasks/pretrain.py", line 59, in train                                                                                                                        
    train_loader = MetaLoader(name2loader=dict(list(zip(media_types, train_loaders))))                                                                               
  File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in __init__                                                                                        
    self.name2iter = {name: iter(l) for name, l in name2loader.items()}                                                                                              
  File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in <dictcomp>                                                                                      
    self.name2iter = {name: iter(l) for name, l in name2loader.items()}                                                                                              
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__                                                       
    self._iterator = self._get_iterator()                                                                                                                            
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator                                                  
    return _MultiProcessingDataLoaderIter(self)                                                                                                                      
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__                                                      
    w.start()                                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/process.py", line 112, in start                                                                            
    self._popen = self._Popen(self)                                                                                                                                  
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen                                                                           
    return Popen(process_obj)                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__                                                                       
    self._launch(process_obj)                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch                                                                 
    reduction.dump(process_obj, fp)                                                                                                                                  
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump                                                                            
    ForkingPickler(file, protocol).dump(obj)                                                                                                                         
AttributeError: Can't pickle local object 'create_dataset.<locals>.<lambda>'  

How is that happended? Thank you for your time!

pretraining logs

Hi! I'm trying to pretrain VindLU using 5M data, can you provide the pretraining logs for reference? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.