microsoft / swinbert Goto Github PK

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"

Home Page: https://arxiv.org/abs/2111.13196

License: MIT License

Shell 0.19% Python 99.81%

swinbert's Introduction

SwinBERT

This is our research code for CVPR 2022 paper: SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning.

We present SwinBERT, an end-to-end transformer-based model for video captioning. SwinBERT takes video frame patches directly as inputs, and outputs a natural language description. In this repository, we provide our research code for training and testing SwinBERT for video captioning.

News

05/05/2022: Init release

Model Card

We release our best performing checkpoints for each dataset (corresponding to Table 1 in our paper). For clarity, we report performance on both validation and test splits below.
We also report our results on private test splits, where the scores are obtained from VALUE Leaderboard Evaluation Server.

Dataset	Checkpoint	CIDEr (val split)	CIDEr (test split)	CIDEr (private test split)
VATEX	URL	84.4	73.0	74.35
MSRVTT	URL	55.1	53.8	N/A
MSVD	URL	160	120.6	N/A
TVC	URL	57.0	N/A	49.74
YouCook2	URL	109	N/A	101.39

We also release our 32-frame model below.

Dataset	Checkpoint	CIDEr (val split)	CIDEr (test split)	CIDEr (private test split)
VATEX	URL	82.1	71.6	73.06
MSRVTT	URL	55.1	53.8	N/A
MSVD	URL	147.6	109.4	N/A
TVC	URL	53.8	N/A	47.6
YouCook2	URL	104.8	N/A	97.69

Note: All results are based on single model. No CIDEr optimization used in our experiments.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended. Our scripts require the user to have the docker group membership so that docker commands can be run without sudo.

Download

Create folders that store pretrained models, datasets, and predictions.

export REPO_DIR=$PWD
mkdir -p $REPO_DIR/models  # pre-trained models
mkdir -p $REPO_DIR/datasets  # datasets
mkdir -p $REPO_DIR/predictions  # prediction outputs

Download pretrained models.

Our pre-trained models can be downloaded with the following command.

cd $REPO_DIR
bash scripts/download_models.sh

The script will download our models that are trained for VATEX, MSRVTT, MSVD, TVC and YouCook2, respectively. It will also download our training logs and output predictions.

The resulting data structure should follow the hierarchy as below.

${REPO_DIR}  
|-- models  
|   |-- table1
|   |   |-- vatex
|   |   |   |-- best-checkpoint
|   |   |   |   |-- model.bin
|   |   |   |   |-- optmizer_state.bin
|   |   |   |   |-- pred.*
|   |   |   |-- tokenizer
|   |   |   |   |-- added_tokens.json
|   |   |   |   |-- special_tokens_map.json
|   |   |   |   |-- vocab.txt
|   |   |   |-- log
|   |   |   |   |-- log.txt
|   |   |   |   |-- args.json
|   |   |-- msrvtt
|   |   |-- msvd
|   |   |-- tvc
|   |   |-- youcook2
|   |-- 32frm
|   |   |-- vatex
|   |   |   |-- best-checkpoint
|   |   |   |   |-- model.bin
|   |   |   |   |-- optmizer_state.bin
|   |   |   |   |-- pred.*
|   |   |   |-- tokenizer
|   |   |   |   |-- added_tokens.json
|   |   |   |   |-- special_tokens_map.json
|   |   |   |   |-- vocab.txt
|   |   |   |-- log
|   |   |   |   |-- log.txt
|   |   |   |   |-- args.json
|   |   |-- msrvtt
|   |   |-- msvd
|   |   |-- tvc
|   |   |-- youcook2
|-- docs 
|-- src
|-- scripts 
|-- README.md 
|-- ... 
|-- ...

Download pretrained Video Swin Transformers.

To run our code smoothly, please visit Video Swin Transformer to download pre-trained weights models.

Download swin_base_patch244_window877_kinetics*_22k.pth, and place them under ${REPO_DIR}/models/video_swin_transformer directory. The data structure should follow the hierarchy below.
```
${REPO_DIR}  
|-- models  
|   |-- video_swin_transformer
|    |   |-- swin_base_patch244_window877_kinetics600_22k.pth
|    |   |-- swin_base_patch244_window877_kinetics400_22k.pth
|   |-- table1
|   |-- 32frm
|-- docs 
|-- src
|-- scripts 
|-- README.md 
|-- ... 
|-- ... 
```
Download prediction files that were evaluated on VALUE Leaderboard Evaluation Server

The prediction files can be downloaded with the following command.
```
cd $REPO_DIR
bash scripts/download_value_preds.sh
```
You could submit the prediction files to VALUE Leaderboard and reproduce our results.

Download datasets for training and evaluation

In this project, we provide our pre-parsed annotation files in TSV format. To download the files, please use the following command.

cd $REPO_DIR
bash scripts/download_annotations.sh

Following prior studies, we use the standard train/val/test splits for each dataset. Here, we just reorganize the data format in TSV files to better fit our codebase.

Due to copyright issue, we could not release the raw videos. We suggest downloading the orignal raw videos from the official dataset websites. Please place the downloaded videos under raw_videos or videos of each dataset folder.

The datasets directory structure should follow the below hierarchy.

${ROOT}  
|-- datasets  
|   |-- VATEX  
|   |   |-- *.yaml 
|   |   |-- *.tsv  
|   |   |-- raw_videos  <<< please place the downloaded videos under this folder 
|   |   |   |-- val_all
|   |   |   |   |-- *.mp4
|   |   |   |-- holdout_test
|   |   |   |   |-- test
|   |   |   |   |   |-- *.mp4
|   |-- MSRVTT-v2  
|   |   |-- *.yaml 
|   |   |-- *.tsv  
|   |   |-- videos <<< please place the downloaded videos under this folder 
|   |   |   |-- *.mp4 
|   |-- MSVD  
|   |   |-- *.yaml 
|   |   |-- *.tsv  
|   |   |-- videos <<< please place the downloaded videos under this folder 
|   |   |   |-- *.avi 
|   |-- TVC  
|   |   |-- *.yaml 
|   |   |-- *.tsv  
|   |   |-- videos <<< please place the downloaded videos under this folder 
|   |   |   |-- bbt_new
|   |   |   |-- castle
|   |   |   |-- friends
|   |   |   |-- grey
|   |   |   |-- house
|   |   |   |-- met 
|   |-- YouCook2  
|   |   |-- *.yaml 
|   |   |-- *.tsv  
|   |   |-- training <<< please place the downloaded training videos under this folder 
|   |   |   |-- *.mp4 
|   |   |-- validation <<< please place the downloaded validation videos under this folder 
|   |   |   |-- *.mp4 
|   |   |-- testing <<< please place the downloaded testing videos under this folder 
|   |   |   |-- *.mp4 
|-- docs
|-- src
|-- scripts
|-- models 
|-- README.md 
|-- ... 
|-- ...

We also provide example scripts to reproduce our annotation tsv files. You may find the examples below.

${ROOT}  
|-- prepro  
|   |-- tsv_preproc_vatex.py
|   |-- tsv_preproc_msrvtt.py
|   |-- tsv_preproc_msvd.py
|   |-- tsv_preproc_tvc.py
|   |-- tsv_preproc_youcook2.py
|-- docs
|-- src
|-- scripts
|-- README.md 
|-- ... 
|-- ...

Before Running Code: Launch Docker Container

We provide a Docker image for easier reproduction. Please launch the docker container before running our codes.

export REPO_DIR=$PWD
DATASETS=$REPO_DIR'/datasets/'
MODELS=$REPO_DIR'/models/'
OUTPUT_DIR=$REPO_DIR'/output/'
source launch_container.sh $DATASETS $MODELS $OUTPUT_DIR

Our latest docker image linjieli222/videocap_torch1.7:fairscale supports the following mixed precision training

Torch.amp (with limited GPU memory optimization, deprecated from this codebase)
Nvidia Apex O2
deepspeed (Best setting on VATEX, deepspeed fp16 with zero_opt_stage=1)
fairscale

Quick Demo

We provide a demo to run end-to-end inference on the test video.

Our inference code will take a video as input, and generate video caption.

# After launching the docker container 
EVAL_DIR='./models/table1/vatex/best-checkpoint/'
CHECKPOINT='./models/table1/vatex/best-checkpoint/model.bin'
VIDEO='./docs/G0mjFqytJt4_000152_000162.mp4'
CUDA_VISIBLE_DEVICES=0 python src/tasks/run_caption_VidSwinBert_inference.py \
       --resume_checkpoint $CHECKPOINT  \
       --eval_model_dir $EVAL_DIR \
       --test_video_fname $VIDEO \
       --do_lower_case \
       --do_test

The prediction should look like

Prediction: a young boy is showing how to make a paper airplane.

Evaluation

We provide example scripts to evaluate pre-trained checkpoints

VATEX

# Assume in the docker container 
EVAL_DIR='./models/table1/vatex/best-checkpoint/'
CUDA_VISIBLE_DEVICES=0 python src/tasks/run_caption_VidSwinBert.py \
       --val_yaml VATEX/public_test_128frames.yaml  \
       --do_eval true \
       --do_train false \
       --eval_model_dir $EVAL_DIR

Notes: Our dataloader supports two different modes:

Online decoding: Extract video frames on-the-fly during experiments. It has less data prepro efforts.
Offline decoding: Need to store all the extracted frames in a TSV file. But it usually run faster.

For online decoding, please use VATEX/public_test.yaml For offline decoding, please use VATEX/public_test_128frames.yaml

MSRVTT

# Assume in the docker container 
EVAL_DIR='./models/table1/msrvtt/best-checkpoint/'
CUDA_VISIBLE_DEVICES=0 python src/tasks/run_caption_VidSwinBert.py \
       --val_yaml MSRVTT-v2/val_128frames.yaml  \
       --do_eval true \
       --do_train false \
       --eval_model_dir $EVAL_DIR

For online decoding, please use MSRVTT-v2/val.yaml For offline decoding, please use MSRVTT-v2/val_128frames.yaml

YouCook2

# Assume in the docker container 
EVAL_DIR='./models/table1/youcook2/best-checkpoint/'
CUDA_VISIBLE_DEVICES=0 python src/tasks/run_caption_VidSwinBert.py \
       --val_yaml YouCook2/testing_128frames.yaml  \
       --do_eval true \
       --do_train false \
       --eval_model_dir $EVAL_DIR

For online decoding, please use YouCook2/testing.yaml For offline decoding, please use YouCook2/testing_128frames.yaml

MSVD

# Assume in the docker container 
EVAL_DIR='./models/table1/msvd/best-checkpoint/'
CUDA_VISIBLE_DEVICES=0 python src/tasks/run_caption_VidSwinBert.py \
       --val_yaml MSVD/val_32frames.yaml  \
       --do_eval true \
       --do_train false \
       --eval_model_dir $EVAL_DIR

For online decoding, please use MSVD/val.yaml For offline decoding, please use MSVD/val_32frames.yaml

TVC

# Assume in the docker container 
EVAL_DIR='./models/table1/tvc/best-checkpoint/'
CUDA_VISIBLE_DEVICES=0 python src/tasks/run_caption_VidSwinBert.py \
       --val_yaml TVC/val_128frames.yaml  \
       --do_eval true \
       --do_train false \
       --eval_model_dir $EVAL_DIR

For online decoding, please use TVC/val.yaml For offline decoding, please use TVC/val_128frames.yaml

Training

We provide example scripts to train our model (with 32-frame inputs, soft sparse attention)

VATEX

# Assume in the docker container 
python src/tasks/run_caption_VidSwinBert.py
        --config src/configs/VidSwinBert/vatex_8frm_default.json
        --train_yaml VATEX/train_32frames.yaml
        --val_yaml VATEX/public_test_32frames.yaml
        --per_gpu_train_batch_size 6
        --per_gpu_eval_batch_size 6
        --num_train_epochs 15
        --learning_rate 0.0003
        --max_num_frames 32
        --pretrained_2d 0
        --backbone_coef_lr 0.05
        --mask_prob 0.5
        --max_masked_token 45
        --zero_opt_stage 1
        --mixed_precision_method deepspeed
        --deepspeed_fp16
        --gradient_accumulation_steps 1
        --learn_mask_enabled
        --loss_sparse_w 0.5
        --output_dir ./output

MSRVTT

# Assume in the docker container 
python src/tasks/run_caption_VidSwinBert.py
        --config src/configs/VidSwinBert/msrvtt_8frm_default.json
        --train_yaml MSRVTT-v2/train_32frames.yaml
        --val_yaml MSRVTT-v2/val_32frames.yaml
        --per_gpu_train_batch_size 6
        --per_gpu_eval_batch_size 6
        --num_train_epochs 15
        --learning_rate 0.0003
        --max_num_frames 32
        --pretrained_2d 0
        --backbone_coef_lr 0.05
        --mask_prob 0.5
        --max_masked_token 45
        --zero_opt_stage 1
        --mixed_precision_method deepspeed
        --deepspeed_fp16
        --gradient_accumulation_steps 4
        --learn_mask_enabled
        --loss_sparse_w 0.5
        --output_dir ./output

YouCook2

# Assume in the docker container 
python src/tasks/run_caption_VidSwinBert.py
        --config src/configs/VidSwinBert/youcook2_8frm_default.json
        --train_yaml YouCook2/training_128frames.yaml
        --val_yaml YouCook2/validation_128frames.yaml
        --per_gpu_train_batch_size 6
        --per_gpu_eval_batch_size 6
        --num_train_epochs 40
        --learning_rate 0.0003
        --max_num_frames 32
        --pretrained_2d 0
        --backbone_coef_lr 0.05
        --mask_prob 0.5
        --max_masked_token 45
        --zero_opt_stage 1
        --mixed_precision_method deepspeed
        --deepspeed_fp16
        --gradient_accumulation_steps 4
        --learn_mask_enabled
        --loss_sparse_w 0.5
        --output_dir ./output

MSVD

# Assume in the docker container 
python src/tasks/run_caption_VidSwinBert.py
        --config src/configs/VidSwinBert/msvd_8frm_default.json
        --train_yaml MSVD/train_32frames.yaml
        --val_yaml MSVD/val_32frames.yaml
        --per_gpu_train_batch_size 6
        --per_gpu_eval_batch_size 6
        --num_train_epochs 15
        --learning_rate 0.0003
        --max_num_frames 32
        --pretrained_2d 0
        --backbone_coef_lr 0.05
        --mask_prob 0.5
        --max_masked_token 45
        --zero_opt_stage 1
        --mixed_precision_method deepspeed
        --deepspeed_fp16
        --gradient_accumulation_steps 1
        --learn_mask_enabled
        --loss_sparse_w 0.5
        --output_dir ./output

TVC

# Assume in the docker container 
python src/tasks/run_caption_VidSwinBert.py
        --config src/configs/VidSwinBert/tvc_8frm_default.json
        --train_yaml TVC/train_128frames.yaml
        --val_yaml TVC/val_128frames.yaml
        --per_gpu_train_batch_size 6
        --per_gpu_eval_batch_size 6
        --num_train_epochs 40
        --learning_rate 0.0003
        --max_num_frames 32
        --pretrained_2d 0
        --backbone_coef_lr 0.05
        --mask_prob 0.5
        --max_masked_token 45
        --zero_opt_stage 1
        --mixed_precision_method apex
        --amp_opt_level 2
        --gradient_accumulation_steps 1
        --learn_mask_enabled
        --loss_sparse_w 0.1
        --output_dir ./output

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{lin2021end-to-end,
title={SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning},
author={Lin, Kevin and Li, Linjie and Lin, Chung-Ching and Ahmed, Faisal and Gan, Zhe and Liu, Zicheng and Lu, Yumao and Wang, Lijuan},
booktitle = {CVPR},
year = {2022},
}

License

Our research code is released under MIT license.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Acknowledgments

We thank Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhengyuan Yang, Ehsan Azarnasab, Yue Cao, Lei Ji, Huaishao Luo and Ze Liu for their helpful discussions.

We also thank the anonymous reviewers for their constructive feedback.

Our code is built on top of open-source GitHub repositories. We thank all the authors who made their code public, which tremendously accelerates our project progress. If you find these works helpful, please consider citing them as well.

huggingface/transformers

jayleicn/ClipBERT

linjieli222/HERO

Microsoft/Oscar&VinVL

swinbert's People

Contributors

Stargazers

Watchers

swinbert's Issues

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/MSVD/frame_tsv/train_32frames.img.tsv'

05/09/2023 23:31:40 - INFO - __main__ - Init model from scratch. 05/09/2023 23:31:40 - INFO - __main__ - Model total parameters: 136106810 05/09/2023 23:31:41 - INFO - __main__ - yaml_file:MSVD/train_32frames.yaml Traceback (most recent call last): File "src/tasks/run_caption_VidSwinBert.py", line 679, in <module> main(args) File "src/tasks/run_caption_VidSwinBert.py", line 657, in main train_dataloader = make_data_loader(args, args.train_yaml, tokenizer, args.distributed, is_train=True) File "/videocap/src/datasets/vl_dataloader.py", line 87, in make_data_loader dataset = build_dataset(args, yaml_file, tokenizer, is_train=is_train) File "/videocap/src/datasets/vl_dataloader.py", line 22, in build_dataset return dataset_class(args, yaml_file, tokenizer, tensorizer, is_train, args.on_memory) File "/videocap/src/datasets/vision_language_tsv.py", line 364, in __init__ super(VisionLanguageTSVYamlDataset, self).__init__( File "/videocap/src/datasets/vision_language_tsv.py", line 44, in __init__ self.visual_tsv = self.get_tsv_file(self.visual_file) File "/videocap/src/datasets/vision_language_tsv.py", line 129, in get_tsv_file tsv_path = find_file_path_in_yaml(tsv_file, self.root) File "/videocap/src/utils/load_files.py", line 73, in find_file_path_in_yaml raise FileNotFoundError( FileNotFoundError: [Errno 2] No such file or directory: 'datasets/MSVD/frame_tsv/train_32frames.img.tsv'

When I try to train with MSVD dataset, the above error is reported
What should I do to solve this problem?

raise KeyError(key) from None KeyError: 'RANK'

我在运行vatex部分的training命令，得到了这样的错误，我上网查了下，手动给os.environ['RANK‘]赋值可跳过此错误，但是后面会报错：os.environ['WORLD_SIZE'] key error，我思考这个问题应该不简单，搞不懂了，请各位大神教我，如何把程序跑通是第一步。。谢谢

File "src/tasks/run_caption_VidSwinBert.py", line 689, in
main(args)
File "src/tasks/run_caption_VidSwinBert.py", line 675, in main
args, vl_transformer, optimizer, scheduler = mixed_precision_init(args, vl_transformer)
File "src/tasks/run_caption_VidSwinBert.py", line 105, in mixed_precision_init
model, optimizer, _, _ = deepspeed.initialize(
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/init.py", line 129, in initialize
dist.init_distributed(dist_backend=dist_backend, dist_init_required=dist_init_required)
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed
init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method)
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend
rank = int(os.environ["RANK"])
File "/home/bwang/anaconda3/envs/qysu_vc/lib/python3.8/os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

What are the entire packages and dependencies for running the code? Is someone run the code in local conda environment successfully?

Hi!:smiley::smiley:I want to run your code in conda environment and I generate the environment.yaml from base conda environment in the docker container. But when I conda env create -f environment.yaml, there are always many errors like the packages are conflicting😥. It makes me so crazy:tired_face:. So I wonder if you can do me a favor about how to run the code in conda environment locally? Thank you very much!!!

Can you share the train-val-test.json

I need 'msvd-retrieval_train-val-test.json' very much ,but i can't find it on the internet,Can you share the data to me.please.

The reproduced results are inconsistent with the results in the paper

Thanks for your contributions.
I have successfully run the inference code and got the results on MSRVTT (B4:41.8, M:29.9, R:62.1, C:54.4).
I found the results were inconsistent with the results in the paper.
I run the code by:

EVAL_DIR='./models/table1/msrvtt/best_checkpoint/'
CUDA_VISIBLE_DEVICES=0 python src/tasks/run_caption_VidSwinBert.py \
       --val_yaml MSRVTT-v2/test_32frames.yaml  \
       --do_eval true \
       --do_train false \
       --eval_model_dir $EVAL_DIR

Do I need to modify hyperparameters or other settings if I want to reproduce the results in the paper?
Thanks again for your contribution！

Is there any parameter to tweak in the code to get longer captions?

I am hoping I can get more descriptive captions on longer videos containing more events. Is there any parameter we can tweak to get that?
Thank you.

How to download the raw videos of YouCook2?

Does anyone know it?

Some files about src/evalcap

Could you provide other files in the src/evalcap?
Thank you so much

Can you share the multi-GPU training command?

I saw you privide the single GPU training command, and run it successfully.
But I got some troubles to use multi-GPU training.Can you privide the multi-GPU training command such as on the msrvtt dataset?
Thanks for your work!

Is there any other settings? I run evaluation code directly always gets very low scores

like the above image, the absolute value only are (B4 0.220, M 0.194, R 0.502, C 0.062)
run code:
EVAL_DIR='./models/table1/msrvtt/best-checkpoint/'
CUDA_VISIBLE_DEVICES=0 python src/tasks/run_caption_VidSwinBert.py --val_yaml MSRVTT-v2/test_32frames.yaml --do_eval true --do_train false --eval_model_dir $EVAL_DIR

evalcap preparation

Thanks for sharing the amazing work. Here I encounter a problem that I have to run the script in conda envs, instead of dockers. From the setup.sh, maybe /evalcap/coco_caption and /evalcap/cider are required in the project, as the error information:

Traceback (most recent call last):
  File "/home/SwinBERT/src/tasks/run_caption_VidSwinBert.py", line 22, in <module>
    from src.evalcap.utils_caption_evaluate import evaluate_on_coco_caption
  File "/home/SwinBERT/src/evalcap/utils_caption_evaluate.py", line 13, in <module>
    from .coco_caption.pycocotools.coco import COCO
ModuleNotFoundError: No module named 'src.evalcap.coco_caption'

Is there any instruction to prepare coco_caption and cider? Thanks for your attention!

TVC Dataset

Hi,

Thanks for the great work and publicly available code.

For the TVC dataset, 3 FPS video frames are provided officially due to copyright issues. According to your code, it seems that you use videos from the TVC dataset. I am wondering how did you obtain the videos?

Thanks in advance.

Question about learning rate in Multi-gpu training

Hello, thanks for your awesome work!@kevinlin311tw

I have noticed that in your official tutorial of multi-gpus training, when facing with 2 gpus, you set args.learning_rate = 3e-4 and args.backbone_coef_lr = 0.05, which means your backbone will reach 1.5e-5 after warm up epoch.

And in your official tensorboard_log extracted from msrvtt-table1, your model used 16 gpus. Meanwhile, after warm up epoch, tensorboard shows your learning rate also reached 1.5*e-5, which is a same number with respect to 2-gpus situation mentioned above.

It seems a problem to me that if you should change the learning rate according to the world size? In my opinion, the learning rate should be bigger when facing with bigger world size, but I haven't seen any relevant operation in your code.

Looking forward for your apply!

Missing Caption Files For YouCook2 Dataset

Hi,

I am unable to evaluate SwinBert on the YouCook2 dataset due to missing coco captions files. The evaluation expects the following files:

validation.caption_coco_format.json
testing.caption_coco_format.json
training.caption_coco_format.json

However, the annotations downloaded for YouCook2 using the provided scripts does not contain these files.

I'm training/evaluating with the default configurations provided in the source code and README so I expect I should be able to run the code without modifying the evaluation logic for this dataset. Please advise on how to train/evaluate this dataset. Thanks!

can't find src/timm/models

When will you release the tutorial for frame-based TSV generation? Thanks!

Besides, how can we prepare the data files like *.label.tsv / *.caption.tsv / *.caption.linelist.tsv to train SwinBert on our own dataset? Thank you very much ~

程序运行

能否分享不需要建立docker的运行指令呢？在单卡服务器上运行

can't download pretrained models from the https://datarelease.blob.core.windows.net/swinbert/

Hi, guys! Thank you for the project a lot. But I have an issue with downloading pretrained models using download_models.sh. I've tied different networks, but it fails all the time. Do you have another source of the pretrained models? Or are there any chances to move them somewhere? Thank you in advance

Missing swin_small.py

There should be a file swin_small.py in the folder https://github.com/microsoft/SwinBERT/tree/main/src/modeling/video_swin.

Besides, the arguments in https://github.com/microsoft/SwinBERT/blob/main/src/modeling/video_swin/swin_small_patch244_window877_kinetics400_1k.py#L2 should be 'swin_small.py', 'default_runtime.py', and the arguments in https://github.com/microsoft/SwinBERT/blob/main/src/modeling/video_swin/swin_tiny_patch244_window877_kinetics400_1k.py#L2 should be 'swin_tiny.py', 'default_runtime.py'.

Could you please provide the file msvd-retrieval_train-val-test.json

Thanks for your inspired work!

When I run experiment on MSVD, FileNotFoundError: [Errno 2] No such file or directory: 'datasets/MSVD/frame_tsv/val_32frames.img.tsv'.

Since datasets/MSVD/frame_tsv is empty and I try to run tsv_preproc_msvd.py by myself,

could you please provide this file ''./datasets/MSVD/msvd-retrieval_train-val-test.json''.

Or the directory frame_tsv as instead.

Train MSVD dataset using VATEX pretrained model? Thanks

Hi, I am going to reproduce the reported performance on MSVD dataset with CIDEr of 120.6, but there exists a gap. In my experiment, the first evaluation after the initialization is poor, the initialized CIDEr is almost 0, as
11/02/2022 10:02:18 - INFO - __main__ - evaluation result: {'Bleu_1': 0.0006309148264980254, 'Bleu_2': 2.0612095211839896e-11, 'Bleu_3': 6.744216480610146e-14, 'Bleu_4': 3.9307266631435836e-15, 'METEOR': 0.010075708541163993, 'ROUGE_L': 0.0009159159159159158, 'CIDEr': 4.846693412127264e-06, 'SPICE': 0.0015076134016082418}

But in the provided log (./models/table1/msvd/log/log.txt), the first evaluation reveals a high CIDEs of 50.88.
11/11/2021 21:22:26 - INFO - __main__ - evaluation result: {'Bleu_1': 0.6993166287010636, 'Bleu_2': 0.5518608344658137, 'Bleu_3': 0.43743626324685403, 'Bleu_4': 0.33875510186099217, 'METEOR': 0.36200563653140405, 'ROUGE_L': 0.6269033486705, 'CIDEr': 0.508808824754587, 'SPICE': 0.09748587107161746}

To further check this problem in the log.txt, it seems the training on MSVD is initialized from a model pretrained on the VATEX dataset, by using a specific pretrained_checkpoint ("pretrained_checkpoint": "/xiyin1wu2_maskrcnn/keli/debug_output/videocap_github/vatex/20211023_VidSwinBert_base_224_seq_len50_bsz6_ep15_lr3e-4_mul0.05_featD512_frame32_wkr10_k600_2d0_mlm_mask0.5_45_grad_accu1_sparsemask0.5//checkpoint-15-40605/"), as

11/11/2021 21:21:23 - INFO - __main__ -   Init model from scratch.
11/11/2021 21:21:23 - INFO - __main__ -   Model total parameters: 136106810
11/11/2021 21:21:23 - INFO - __main__ -   video swin (config path): src/modeling/video_swin/swin_base_patch244_window877_kinetics600_22k.py
11/11/2021 21:21:26 - INFO - __main__ -   Loading state dict from checkpoint /xiyin1wu2_maskrcnn/keli/debug_output/videocap_github/vatex/20211023_VidSwinBert_base_224_seq_len50_bsz6_ep15_lr3e-4_mul0.05_featD512_frame32_wkr10_k600_2d0_mlm_mask0.5_45_grad_accu1_sparsemask0.5//checkpoint-15-40605/model.bin

Is it necessary to reproduce the MSVD performance using a VATEX pretrained model? Thanks for your time and attention!

BTW, the TVC log is also trained from a pretrained_checkpoint, but the MSRVTT, YouCook2 and VATEX logs are trained from scratch.

Inconsistent reproduced results with the results reported in the paper

Hi,
when I reproduced the test process, I found that the results were inconsistent with the results reported in your paper. Especially on MSVD dataset, there was a large gap.
When I use test.yaml as the "--val_yaml", the result I got is CIDEr: 109.4, which is the same as the result you reported on github page (https://github.com/microsoft/SwinBERT).
When I use test_32frames.yaml, the result I got is CIDEr: 120.6, which is also the same as the result you reported on github page (https://github.com/microsoft/SwinBERT).

However, these two results are far from the results you reported in the paper (CIDEr: 149.4).
Is there any parameter setting or tricks during the test? And why is there such a big performance gap?

149.4 vs 120.6 for CIDEr on MSVD dataset

I saw that in your cvf open access version (Table 2), the CIDEr performance is 149.4 on MSVD dataset.
However, in your latest arxiv, the performance is only 120.6.

I wonder, which one is correct? the cvf open access version or the arxiv version?

This is also written in your cvf version:
"Specifically, SWINBERT brings significant CIDEr improvements on MSVD (i.e., +54.2 higher than the prior arts)."
It seems that you compared 149.4 to 95.2 (ORG-TL).

cvf open access version:
https://openaccess.thecvf.com/content/CVPR2022/papers/Lin_SwinBERT_End-to-End_Transformers_With_Sparse_Attention_for_Video_Captioning_CVPR_2022_paper.pdf

How can I get frame_tsv/train_32frames.img.lineidx for MSVD?

Hi,

There is no train_32frames.img.lineidx in the provided codes. How can I get train_32frames.img.tsv, train_128frames_img_size256.img.tsv, test_32frames.img.lineidx,train_32frames.img.lineidx??

What are the difference between train_32frames.img.tsv andtrain_128frames_img_size256.img.tsv?

Questions about the value of 'loss_sparse_w' in command

I guess it's the regularization hyperparameter of $Loss_{SPARSE}$ , i.e. the $\lambda$ in your paper. In the appendix, it seems like for MSR-VTT, the model performs best when $\lambda$ = 5. But why the value of 'loss_sparse_w' in command is 0.5? Do we need to adjust it to 5? Thank you!

oserror (errno 30) read-only file system python docker

when I run test:

Adapt to custom datasets

container_abcs

cannot import container_abs from torch._six

Training problems when reproduce the results of MSRVTT dataset

Hi, I want to reproduce the results of MSRVTT dataset by training the model from scratch. Before training from scratch, I have reproduced the MSRVTT results using the officially released checkpoint (CIDEr 54.7 on val, CIDEr 54.3 on test). Then I use the provided codes to train the model, the problem is that the MLM accuracy will suddenly drop after a few training epochs. Training logs are the following:
In epoch 3, mlm acc drops to around 0.1, val set CIDEr drops to 0.0
I train with apex O1 or apex O0 method.

05/26/2022 21:20:08 - INFO - __main__ -   Save checkpoint to ./experiments/output_msrvtt_new/checkpoint-2-10854
05/26/2022 21:20:08 - INFO - __main__ -   Perform evaluation at iteration 43416, global_step 10854
05/26/2022 21:21:27 - INFO - __main__ -   Inference model computing time: 0.9087514216641346 seconds per batch
05/26/2022 21:21:52 - INFO - __main__ -   evaluation result: {'Bleu_1': 0.7691577416525643, 'Bleu_2': 0.6165830281270223, 'Bleu_3': 0.47283724241898767, 'Bleu_4': 0.34628522979313686, 'METEOR': 0.25367347312377875, 'ROUGE_L': 0.5866882628069936, 'CIDEr': 0.3412248963946503, 'SPICE': 0.049780527329883785}
05/26/2022 21:21:52 - INFO - __main__ -   evaluation result saved to ./experiments/output_msrvtt_new/checkpoint-2-10854/pred.MSRVTT-v2.val_32frames.beam1.max20.eval.json
05/26/2022 21:22:30 - INFO - __main__ -   eta: 4 days, 20:42:24  iter: 43440  global_step: 10860  speed: 0.5 images/sec  loss: 4.2006 (4.8763)  loss_sparsity: 0.2658 (0.3960)  acc: 0.3333 (0.2884)  batch_time: 1.5384 (1.5709)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.39e-05  lr (LM): 2.78e-04  max mem: 38796
05/26/2022 21:24:35 - INFO - __main__ -   eta: 4 days, 20:50:39  iter: 43520  global_step: 10880  speed: 1.0 images/sec  loss: 4.2026 (4.8756)  loss_sparsity: 0.2658 (0.3957)  acc: 0.2647 (0.2885)  batch_time: 1.5414 (1.5709)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.39e-05  lr (LM): 2.78e-04  max mem: 38796
05/26/2022 21:26:41 - INFO - __main__ -   eta: 4 days, 20:43:37  iter: 43600  global_step: 10900  speed: 1.0 images/sec  loss: 4.6793 (4.8763)  loss_sparsity: 0.2660 (0.3955)  acc: 0.2727 (0.2883)  batch_time: 1.5409 (1.5709)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.39e-05  lr (LM): 2.78e-04  max mem: 38796
05/26/2022 21:28:46 - INFO - __main__ -   eta: 4 days, 20:38:21  iter: 43680  global_step: 10920  speed: 1.0 images/sec  loss: 3.9995 (4.8757)  loss_sparsity: 0.2661 (0.3953)  acc: 0.3214 (0.2883)  batch_time: 1.5404 (1.5709)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.39e-05  lr (LM): 2.77e-04  max mem: 38796
........
05/26/2022 22:39:55 - INFO - __main__ -   eta: 4 days, 19:14:52  iter: 46400  global_step: 11600  speed: 1.0 images/sec  loss: 4.2682 (4.8405)  loss_sparsity: 0.2615 (0.3876)  acc: 0.3429 (0.2903)  batch_time: 1.5413 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:42:01 - INFO - __main__ -   eta: 4 days, 19:00:52  iter: 46480  global_step: 11620  speed: 1.0 images/sec  loss: 5.2393 (4.8417)  loss_sparsity: 0.2613 (0.3874)  acc: 0.1944 (0.2901)  batch_time: 1.5403 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:44:06 - INFO - __main__ -   eta: 4 days, 19:05:06  iter: 46560  global_step: 11640  speed: 1.0 images/sec  loss: 5.4463 (4.8426)  loss_sparsity: 0.2615 (0.3872)  acc: 0.2000 (0.2899)  batch_time: 1.5401 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:46:11 - INFO - __main__ -   eta: 4 days, 18:53:37  iter: 46640  global_step: 11660  speed: 1.0 images/sec  loss: 5.8740 (4.8429)  loss_sparsity: 0.2616 (0.3870)  acc: 0.1081 (0.2899)  batch_time: 1.5391 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:48:17 - INFO - __main__ -   eta: 4 days, 18:51:16  iter: 46720  global_step: 11680  speed: 1.0 images/sec  loss: 6.3330 (4.8455)  loss_sparsity: 0.2613 (0.3868)  acc: 0.0741 (0.2895)  batch_time: 1.5399 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:50:22 - INFO - __main__ -   eta: 4 days, 18:46:13  iter: 46800  global_step: 11700  speed: 1.0 images/sec  loss: 6.1326 (4.8479)  loss_sparsity: 0.2611 (0.3866)  acc: 0.0526 (0.2892)  batch_time: 1.5395 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 22:52:28 - INFO - __main__ -   eta: 4 days, 18:34:46  iter: 46880  global_step: 11720  speed: 1.0 images/sec  loss: 6.0650 (4.8498)  loss_sparsity: 0.2600 (0.3864)  acc: 0.1143 (0.2889)  batch_time: 1.5386 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 22:54:33 - INFO - __main__ -   eta: 4 days, 18:36:20  iter: 46960  global_step: 11740  speed: 1.0 images/sec  loss: 5.8276 (4.8519)  loss_sparsity: 0.2579 (0.3861)  acc: 0.1250 (0.2886)  batch_time: 1.5383 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 22:56:38 - INFO - __main__ -   eta: 4 days, 18:56:10  iter: 47040  global_step: 11760  speed: 1.0 images/sec  loss: 5.7397 (4.8539)  loss_sparsity: 0.2557 (0.3859)  acc: 0.1064 (0.2883)  batch_time: 1.5380 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 22:58:44 - INFO - __main__ -   eta: 4 days, 19:09:41  iter: 47120  global_step: 11780  speed: 1.0 images/sec  loss: 6.1058 (4.8561)  loss_sparsity: 0.2536 (0.3857)  acc: 0.1111 (0.2880)  batch_time: 1.5382 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 23:00:50 - INFO - __main__ -   eta: 4 days, 18:31:12  iter: 47200  global_step: 11800  speed: 1.0 images/sec  loss: 6.1348 (4.8582)  loss_sparsity: 0.2515 (0.3855)  acc: 0.0857 (0.2877)  batch_time: 1.5371 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.36e-05  lr (LM): 2.73e-04  max mem: 38796
.........
05/27/2022 06:49:13 - INFO - __main__ -   ModelSaver save trial NO. 0
05/27/2022 06:49:17 - INFO - __main__ -   Save checkpoint to ./experiments/output_msrvtt_new/checkpoint-3-16281
05/27/2022 06:49:17 - INFO - __main__ -   Perform evaluation at iteration 65124, global_step 16281
05/27/2022 06:51:28 - INFO - __main__ -   Inference model computing time: 1.5383756045835564 seconds per batch
05/27/2022 06:51:47 - INFO - __main__ -   evaluation result: {'Bleu_1': 0.1594008495416768, 'Bleu_2': 0.008687056706128082, 'Bleu_3': 2.1171728390057318e-08, 'Bleu_4': 3.3589638000914805e-11, 'METEOR': 0.05546962152086046, 'ROUGE_L': 0.21322680785039416, 'CIDEr': 1.1242001133315049e-05, 'SPICE': 0.0}
05/27/2022 06:51:47 - INFO - __main__ -   evaluation result saved to ./experiments/output_msrvtt_new/checkpoint-3-16281/pred.MSRVTT-v2.val_32frames.beam1.max20.eval.json

About Attention Mask

Hi, thanks for your code sharing.

I am confused about the values of attention_mask in your codes, which are either 0 or 1 if they are not learnable.

As the attention_mask will be added to the attention_scores before the softmax operation, shouldn't the values of attention_mask be either 0 or a large negative value (e.g., -1e6)?

By adding large negative values to some positions, we can make sure that these positions are not observable (i.e., they will not be attended to) for the query token. Your implementations doesn't seem to guarantee that. Do I get it wrong?

gpu memory requirements?

Hi, thanks for the great source.

I am trying to do train with the youcook dataset, and facing the error related to gpu resource.
I am using 3080Ti 10GiB single gpu, and modify the training command by chaing batch size as below.

python src/tasks/run_caption_VidSwinBert.py \
        --config src/configs/VidSwinBert/youcook2_8frm_default.json \
        --train_yaml YouCook2/training_32frames.yaml \
        --val_yaml YouCook2/validation_32frames.yaml \
        --per_gpu_train_batch_size 1 \
        --per_gpu_eval_batch_size 1 \
        --num_train_epochs 40 \
        --learning_rate 0.0003 \
        --max_num_frames 32 \
        --pretrained_2d 0 \
        --backbone_coef_lr 0.05 \
        --mask_prob 0.5 \
        --max_masked_token 45 \
        --zero_opt_stage 1 \
        --mixed_precision_method deepspeed \
        --deepspeed_fp16 \
        --gradient_accumulation_steps 4 \
        --learn_mask_enabled \
        --loss_sparse_w 0.5 \
        --output_dir ./output

Although I set up a batch size as 1 for training, the command returns the error.

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 9.78 GiB total capacity; 7.25 GiB already allocated; 40.00 MiB free; 7.44 GiB reserved in total by PyTorch)

Can I ask you what is the gpu resource minimum requirements?
Thanks,

couldn't find msrvtt-32frm.zip

when I run <bash scripts/download_models.sh>,it has this error:

--2022-05-23 21:26:20-- https://datarelease.blob.core.windows.net/swinbert/models/msrvtt-32frm.zip
Resolving datarelease.blob.core.windows.net (datarelease.blob.core.windows.net)... 20.150.35.196
Connecting to datarelease.blob.core.windows.net (datarelease.blob.core.windows.net)|20.150.35.196|:443... connected.
HTTP request sent, awaiting response... 404 The specified blob does not exist.
2022-05-23 21:26:21 ERROR 404: The specified blob does not exist..

Archive: ../SwinBERT/models/32frm/msrvtt-32frm.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of ../SwinBERT/models/32frm/msrvtt-32frm.zip or
../SwinBERT/models/32frm/msrvtt-32frm.zip.zip, and cannot find ..SwinBERT/models/32frm/msrvtt- 32frm.zip.ZIP, period.

A file is not found: no module named 'src.evalcap.coco_caption'

Hi
When I implemented 'run_caption_VidSwinBert.py' to train a new model, an error message appeared: no module named 'src.evalcap.coco_caption'.
I found that your code lacks relevant files. Can you share them?
Thanks!

anaconda environment instead of Docker

thank you for your amazing work.
Can I use only anaconda environment (conda env.) instead of the Docker and container environment with ubuntu 20.04 and GPU RTX-3060

can't find /models/table1/vatex/best_checkpoint/training_args.bin

do quick demo
FileNotFoundError: [Errno 2] No such file or directory: './models/table1/vatex/best_checkpoint/training_args.bin'
please help me. thank you.

Question for training process

Thanks for your contributions.
When I train model based on the help of readme, I meet this question:
#########################################
SwinBERT/src/modeling/load_bert.py", line 12, in get_bert_model
config.img_feature_type = 'frcnn'
AttributeError: 'NoneType' object has no attribute 'img_feature_type'
#########################################
When I debug this code, I find it might need models/bert-base-uncased/ this data.
Is this error caused by missing some files?

Thanks again for your contribution

One data typo

Hi,

The data reported on MSRVTT open-book (red mark) was wrong (it seems to be mixed with VATEX open-book). According to openbook paper (The second pic), B4, M, R should be 42.8, 29.3, and 61.7.

Paper: open-book

Inference error with CPU

Hello guys,
thank you for your amazing work and code.

I have replicated the container environment within a conda env. Everything works fine; the inference code works well with cuda, however when I set the device to cpu (in models/table1/vatex/log/args.json) I occur in the following error:

Traceback (most recent call last):
  File "src/tasks/run_caption_VidSwinBert_inference.py", line 231, in <module>
    main(args)
  File "src/tasks/run_caption_VidSwinBert_inference.py", line 226, in main
    inference(args, args.test_video_fname, vl_transformer, tokenizer, tensorizer)
  File "src/tasks/run_caption_VidSwinBert_inference.py", line 99, in inference
    outputs = model(**inputs)
  File "~/miniconda3/envs/swinbert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "~/SwinBERT/src/modeling/video_captioning_e2e_vid_swin_bert.py", line 53, in forward
    video_attention = (1. - diag_mask)*learn_att
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

The inference command line is the following:
python src/tasks/run_caption_VidSwinBert_inference.py --resume_checkpoint models/table1/vatex/best-checkpoint/model.bin --eval_model_dir models/table1/vatex/best-checkpoint/ --test_video_fname docs/G0mjFqytJt4_000152_000162.mp4 --do_lower_case --do_test

The relevant packages within the virtual environment are as follows:

python                    3.8.5
pytorch                   1.8.0
torchvision              0.9.0

However, by editing line 52 of src/modeling/video_captioning_e2e_vid_swin_bert.py with

            if kwargs['attention_mask'].is_cuda:
                diag_mask = torch.diag(torch.ones(vid_att_len)).cuda()
            else:
                diag_mask = torch.diag(torch.ones(vid_att_len))

everything works fine even with cpu.

Waiting for a feedback, thank you for the awesome work

Questions about masked_loss_img

Hi,
I find the masked_loss_img is not used in the code since masked_pos_img is always None.
Does it mean that the MSE loss between the img_feats and their corresponding BERT output (whose size is indicated by M in the paper) is not considered in the final loss?
Thanks a lot!

Questions About Frames extracting

In the Line 85 of SwinBERT/create_image_frame_tsv.py.
" current_image_path = previous_image_path "

Does it mean when the amount of extracted images is less than num_frames, you will pad them to num_frames with the last image? This step is a little confused to me. Is the result of it different from the one which do not copy the last image?

Do the input lengths have to be fixed during training?

Thank you for your contribution. I'm curious about whether fixed frame rate images (such as the 32-frame inputs in your example) are the only inputs that can be used during the training phase, or if inputs of any length can be used. Thank you!

The link to raw videos

Hello, guys! Thank you for providing the code. Could you provide a link to download the raw videos?

Trimming the YouCook2 videos + "./datasets/YouCook2/yc2_subtitles.jsonl"?

Thanks for your awesome work.
I have two questions about running the code on the YouCook2 dataset.

(1) It seems that to run on the YouCook2 dataset, we need to trim the downloaded YC2 videos with segments in "youcookii_annotations_trainval.json", right?
(i.e., GLd3aX16zBg -> GLd3aX16zBg_0, GLd3aX16zBg_1, GLd3aX16zBg_2, GLd3aX16zBg_3, GLd3aX16zBg_4)
Can you share the method of how you did it?
I have tried "ffmpeg -i input.mp4 -ss $START_TIME -to $END_TIME -c copy trim.mp4", but the command converts some frames into black images.

(2) Where can we get "./datasets/YouCook2/yc2_subtitles.jsonl" in "prepro/tsv_preproc_youcook2.py"?
I couldn't find it on the YC2 website.

Using Vatex dataset to train my own chinese data

I want to use the Vatex dataset to train my own chinese data. I count the len of chinese word according by the unicode format. I also use json parser to check each line of format, but it appear is correct. it's successed to complete one epoch, but I got an error at second epochs.

self.image_keys[img_idx]: datasets/VATE2/raw_videos/val_all/A19_2.mp4
train(args, train_dataloader, val_dataloader, vl_transformer, tokenizer, training_saver, optimizer, scheduler)
File "src/tasks/run_caption_VidSwinBert.py", line 146, in train
for iteration, (img_keys, batch, meta_data) in enumerate(train_dataloader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1179, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
TypeError: init() missing 2 required positional arguments: 'doc' and 'pos'

Training my own dataset, but the indices equal 0 and in the output path (coco json format) don't have any caption.

I want to train my own dataset and built the data like VATEX, after I try to train this dataset I got the result from output pred.LS.public_test_32frames.beam1.max20_coco_format.json that there were no caption on this file. (comparing with the best checkpoint of VATEX) Anyone faced this issue and can help me for this situation? And this is my log.
07/20/2022 01:51:44 - INFO - main - ModelSaver save trial NO. 0
07/20/2022 01:51:56 - INFO - main - Save checkpoint to output/ls_default1/checkpoint-8-232
07/20/2022 01:51:56 - INFO - main - Perform evaluation at iteration 232, global_step 232
07/20/2022 01:51:57 - INFO - main - Inference model computing time: 0.05405998229980469 seconds per batch
07/20/2022 01:52:06 - INFO - main - evaluation result: {'Bleu_1': 0.0, 'Bleu_2': 0.0, 'Bleu_3': 0.0, 'Bleu_4': 0.0, 'METEOR': 0.0, 'ROUGE_L': 0.0, 'CIDEr': 0.0, 'SPICE': 0.0}
07/20/2022 01:52:06 - INFO - main - evaluation result saved to output/ls_default1/checkpoint-8-232/pred.LS.public_test_32frames.beam1.max20.eval.json
07/20/2022 01:52:08 - INFO - main - eta: 0:00:17 iter: 240 global_step: 240 speed: 16.4 images/sec loss: 4.3013 (5.2290) acc: 0.1379 (0.1169) batch_time: 0.3488 (0.3479) data_time: 0.0001 (0.0002) lr (Visual Encoder): 2.99e-06 lr (LM): 5.98e-05 max mem: 7632
07/20/2022 01:52:16 - INFO - main - eta: 0:00:10 iter: 260 global_step: 260 speed: 68.0 images/sec loss: 4.1436 (5.1596) acc: 0.1250 (0.1181) batch_time: 0.3514 (0.3483) data_time: 0.0001 (0.0002) lr (Visual Encoder): 1.84e-06 lr (LM): 3.68e-05 max mem: 7632
07/20/2022 01:52:16 - INFO - main - ModelSaver save trial NO. 0
07/20/2022 01:52:28 - INFO - main - Save checkpoint to output/ls_default1/checkpoint-9-261
07/20/2022 01:52:28 - INFO - main - Perform evaluation at iteration 261, global_step 261
07/20/2022 01:52:30 - INFO - main - Inference model computing time: 0.05387449264526367 seconds per batch
07/20/2022 01:52:38 - INFO - main - evaluation result: {'Bleu_1': 0.0, 'Bleu_2': 0.0, 'Bleu_3': 0.0, 'Bleu_4': 0.0, 'METEOR': 0.0, 'ROUGE_L': 0.0, 'CIDEr': 0.0, 'SPICE': 0.0}
07/20/2022 01:52:38 - INFO - main - evaluation result saved to output/ls_default1/checkpoint-9-261/pred.LS.public_test_32frames.beam1.max20.eval.json
07/20/2022 01:52:45 - INFO - main - eta: 0:00:03 iter: 280 global_step: 280 speed: 16.3 images/sec loss: 4.0845 (5.0965) acc: 0.1481 (0.1193) batch_time: 0.3515 (0.3487) data_time: 0.0001 (0.0002) lr (Visual Encoder): 6.90e-07 lr (LM): 1.38e-05 max mem: 7632
07/20/2022 01:52:49 - INFO - main - eta: 0:00:00 iter: 290 global_step: 290 speed: 136.6 images/sec loss: 4.1788 (5.0666) acc: 0.1600 (0.1207) batch_time: 0.3501 (0.3488) data_time: 0.0001 (0.0002) lr (Visual Encoder): 1.15e-07 lr (LM): 2.30e-06 max mem: 7632
07/20/2022 01:52:49 - INFO - main - ModelSaver save trial NO. 0
07/20/2022 01:53:02 - INFO - main - Save checkpoint to output/ls_default1/checkpoint-10-290
07/20/2022 01:53:02 - INFO - main - Perform evaluation at iteration 290, global_step 290
07/20/2022 01:53:03 - INFO - main - Inference model computing time: 0.05416369438171387 seconds per batch
07/20/2022 01:53:12 - INFO - main - evaluation result: {'Bleu_1': 0.0, 'Bleu_2': 0.0, 'Bleu_3': 0.0, 'Bleu_4': 0.0, 'METEOR': 0.0, 'ROUGE_L': 0.0, 'CIDEr': 0.0, 'SPICE': 0.0}
07/20/2022 01:53:12 - INFO - main - evaluation result saved to output/ls_default1/checkpoint-10-290/pred.LS.public_test_32frames.beam1.max20.eval.json
07/20/2022 01:53:12 - INFO - main - Total training time: 0:05:55.848689 (1.2271 s / iter)

请问该代码支持多块gpu训练吗？

Using my own data to train

Thanks for your excellent code.

I would like to know if I want to use my dataset to train in order to fit with my case, do you have any steps or suggestion for this?

Can you directly provide the running environment of the code？ thank you！

The performance of adopting the Video Swin Transformer as offline-extracted video encoder?

Hi, Thanks for your nice work! @kevinlin311tw

However, could you please report the performance of adopting the Video Swin Transformer as an offline-extracted video encoder?

In Table 2, other methods adopt C3D or I3D while yours use Video Swin Transformer, it is not fair comparison, right?

Data split re MSVD and MSRVTT caption

Hi,

Congratulations on the great work!

Would you mind providing a pointer to where did you find the dataset split for the captioning datasets, as it seems they are not always consistent with the retrieval / QA counterparts.

@kevinlin311tw @linjieli222

Thanks.
Dongxu

microsoft / swinbert Goto Github PK

swinbert's Introduction

SwinBERT

News

Released items

Table of contents

Model Card

Requirements

Download

Before Running Code: Launch Docker Container

Quick Demo

Evaluation

VATEX

MSRVTT

YouCook2

MSVD

TVC

Training

VATEX

MSRVTT

YouCook2

MSVD

TVC

Citation

License

Contributing

Trademarks

Acknowledgments

swinbert's People

Contributors

Stargazers

Watchers

Forkers

swinbert's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs