GithubHelp home page GithubHelp logo

ttengwang / pdvc Goto Github PK

View Code? Open in Web Editor NEW
190.0 7.0 22.0 37.82 MB

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

License: MIT License

Python 78.08% Shell 5.83% C++ 1.46% Cuda 14.64%
dense-video-captioning activitynet-captions youcook2 video-paragraph-captioning

pdvc's Introduction

PDVC

Official implementation for End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

[paper] [valse论文速递(Chinese)]

This repo supports:

  • two video captioning tasks: dense video captioning and video paragraph captioning
  • two datasets: ActivityNet Captions and YouCook2
  • video features containing C3D, TSN, and TSP.
  • visualization of the generated captions of your own videos

Table of Contents:

Updates

  • (2021.11.19) add code for running PDVC on raw videos and visualize the generated captions (support Chinese and other non-English languages)
  • (2021.11.19) add pretrained models with TSP features. It achieves 9.03 METEOR(2021) and 6.05 SODA_c, a very competitive result on ActivityNet Captions without self-critical sequence training.
  • (2021.08.29) add TSN pretrained models and support YouCook2

Introduction

PDVC is a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. Without bells and whistles, extensive experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results, surpassing the state-of-the-art methods when its localization accuracy is on par with them. pdvc.jpg

Preparation

Environment: Linux, GCC>=5.4, CUDA >= 9.2, Python>=3.7, PyTorch>=1.5.1

  1. Clone the repo
git clone --recursive https://github.com/ttengwang/PDVC.git
  1. Create virtual environment by conda
conda create -n PDVC python=3.7
source activate PDVC
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
conda install ffmpeg
pip install -r requirement.txt
  1. Compile the deformable attention layer (requires GCC >= 5.4).
cd pdvc/ops
sh make.sh

Running PDVC on Your Own Videos

Download a pretrained model (GoogleDrive) with TSP features and put it into ./save. Then run:

video_folder=visualization/videos
output_folder=visualization/output
pdvc_model_path=save/anet_tsp_pdvc/model-best.pth
output_language=en
bash test_and_visualize.sh $video_folder $output_folder $pdvc_model_path $output_language

check the $output_folder, you will see a new video with embedded captions. Note that we generate non-English captions by translating the English captions by GoogleTranslate. To produce Chinese captions, set output_language=zh-cn. For other language support, find the abbreviation of your language at this url, and you also may need to download a font supporting your language and put it into ./visualization.

demo.gifdemo.gif

Training and Validation

Download Video Features

cd data/anet/features
bash download_anet_c3d.sh
# bash download_anet_tsn.sh
# bash download_i3d_vggish_features.sh
# bash download_tsp_features.sh

The preprocessed C3D features have been uploaded to baiduyun drive

Dense Video Captioning

  1. PDVC with learnt proposals
# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}
# The script will evaluate the model for every epoch. The results and logs are saved in `./save`.

# Evaluation
eval_folder=anet_c3d_pdvc # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}
  1. PDVC with ground-truth proposals
# Training
config_path=cfgs/anet_c3d_pdvc_gt.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}

# Evaluation
eval_folder=anet_c3d_pdvc_gt
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type gt_proposals --gpu_id ${GPU_ID}

Video Paragraph Captioning

  1. PDVC with learnt proposals
# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --criteria_for_best_ckpt pc --gpu_id ${GPU_ID} 

# Evaluation
eval_folder=anet_c3d_pdvc # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}
  1. PDVC with ground-truth proposals
# Training
config_path=cfgs/anet_c3d_pdvc_gt.yml
python train.py --cfg_path ${config_path} --criteria_for_best_ckpt pc --gpu_id ${GPU_ID}

# Evaluation
eval_folder=anet_c3d_pdvc_gt
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type gt_proposals --gpu_id ${GPU_ID}

Performance

Dense video captioning (with learnt proposals)

Model Features config_path Url Recall Precision BLEU4 METEOR2018 METEOR2021 CIDEr SODA_c
PDVC_light C3D cfgs/anet_c3d_pdvcl.yml Google Drive 55.30 58.42 1.55 7.13 7.66 24.80 5.23
PDVC C3D cfgs/anet_c3d_pdvc.yml Google Drive 55.20 57.36 1.82 7.48 8.09 28.16 5.47
PDVC_light TSN cfgs/anet_tsn_pdvcl.yml Google Drive 55.34 57.97 1.66 7.41 7.97 27.23 5.51
PDVC TSN cfgs/anet_tsn_pdvc.yml Google Drive 56.21 57.46 1.92 8.00 8.63 29.00 5.68
PDVC_light TSP cfgs/anet_tsp_pdvcl.yml Google Drive 55.24 57.78 1.77 7.94 8.55 28.25 5.95
PDVC TSP cfgs/anet_tsp_pdvc.yml Google Drive 55.79 57.39 2.17 8.37 9.03 31.14 6.05

Notes:

Video paragraph captioning (with learnt proposals)

Model Features config_path BLEU4 METEOR CIDEr
PDVC C3D cfgs/anet_c3d_pdvc.yml 9.67 14.74 16.43
PDVC TSN cfgs/anet_tsn_pdvc.yml 10.18 15.96 20.66
PDVC TSP cfgs/anet_tsp_pdvc.yml 10.46 16.42 20.91

Notes:

  • Paragraph-level scores are evaluated on the ActivityNet Entity ae-val set.

Citation

If you find this repo helpful, please consider citing:

@inproceedings{wang2021end,
  title={End-to-End Dense Video Captioning with Parallel Decoding},
  author={Wang, Teng and Zhang, Ruimao and Lu, Zhichao and Zheng, Feng and Cheng, Ran and Luo, Ping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6847--6857},
  year={2021}
}
@ARTICLE{wang2021echr,
  author={Wang, Teng and Zheng, Huicheng and Yu, Mingjing and Tian, Qian and Hu, Haifeng},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={Event-Centric Hierarchical Representation for Dense Video Captioning}, 
  year={2021},
  volume={31},
  number={5},
  pages={1890-1900},
  doi={10.1109/TCSVT.2020.3014606}}

Acknowledgement

The implementation of Deformable Transformer is mainly based on Deformable DETR. The implementation of the captioning head is based on ImageCaptioning.pytorch. We thanks the authors for their efforts.

pdvc's People

Contributors

ttengwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdvc's Issues

BLEU4/CIDEr

I'm sorry to bother you again, professor. As for the two indicators BLEU4 and CIDEr of Dense Video Caption task in the paper, how can I get these two results?

关于实验结果

您好,我想问一下当我用PDVC with learnt proposals 训练出来的结果是与readme中Dense video caption(with learnt proposals)中的结果对比还是跟论文中Predicted proposals对比呢 Predicted proposals与learnt proposals又有什么不同呢? 麻烦您能回答我这个困惑么

关于实验结果的问题

您好,我想复现出论文中的实验结果。但是我得出的结果是Bleu4和METEOR比论文中的要高,但是CIDEr比论文中的结果低很多,想请教一下这是什么原因造成的?

Does the code support multi-gpu training?

Hi, thanks for your great work! I use the command python train.py --cfg_path ${config_path} --gpu_id 0,1,2,3 to train the model, however, it seems only the first gpu is working. Does the code support multi-gpu training? Can you share the multi-GPU training command? Thanks a lot.

Ablation study of auxiliary losses?

Hello,
I was wondering about the role of auxiliary losses on each intermediate decoder layer. Can it help to accelerate the model convergence or for other purposes?
Thanks!

about "pred_event_count"

Thank you for the great work!

I am trying to run the model on different videos. But the "pred_event_count" seems always be 3, is this just a coincidence or is there something I have done wrongly?

I am using the pretrained TSP features provided in the repo and the model work well on the demo video ("pred_event_count" is 3 as well).

"Running PDVC on Your Own Videos": Did i miss something?

Hi

Thank you for your great work

I loaded your pretrained model and ran your code using my video dataset (SumMe: video summarization benchmark),

but the results are really weird.

most captions doesn't represent visual features

Capture

Capture2

Capture4

Cooking.mp4

Capture3

I just loaded your models and ran on the video datasets

most video captions are very weird

Did i miss something???

thank you

How do I train my own data set?

Thank you very much for your outstanding work, but I have my own batch of video data, I want to label my data and replicate your method, how should I label my data set? Looking forward to your reply!

visualization

Hello, Professor Wang
I want to use my own model to visualize some videos in the ActivityNet Captions.
I want to see the sentences generated by the original model and my own model.
What should I do, looking forward to your reply. Thanks.

Error when run make.sh

running build
running build_py
running build_ext
building 'MultiScaleDeformableAttention' extension
Emitting ninja build file /home/binzheng/code/PDVC-main/pdvc/ops/build/temp.linux-x86_64-3.7/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /usr/bin/nvcc -DWITH_CUDA -I/home/binzheng/code/PDVC-main/pdvc/ops/src -I/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/include -I/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/include/TH -I/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/include/THC -I/home/binzheng/anaconda3/envs/PDVCC/include/python3.7m -c -c /home/binzheng/code/PDVC-main/pdvc/ops/src/cuda/ms_deform_attn_cuda.cu -o /home/binzheng/code/PDVC-main/pdvc/ops/build/temp.linux-x86_64-3.7/home/binzheng/code/PDVC-main/pdvc/ops/src/cuda/ms_deform_attn_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=MultiScaleDeformableAttention -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=sm_75 -std=c++14
FAILED: /home/binzheng/code/PDVC-main/pdvc/ops/build/temp.linux-x86_64-3.7/home/binzheng/code/PDVC-main/pdvc/ops/src/cuda/ms_deform_attn_cuda.o 
/usr/bin/nvcc -DWITH_CUDA -I/home/binzheng/code/PDVC-main/pdvc/ops/src -I/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/include -I/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/include/TH -I/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/include/THC -I/home/binzheng/anaconda3/envs/PDVCC/include/python3.7m -c -c /home/binzheng/code/PDVC-main/pdvc/ops/src/cuda/ms_deform_attn_cuda.cu -o /home/binzheng/code/PDVC-main/pdvc/ops/build/temp.linux-x86_64-3.7/home/binzheng/code/PDVC-main/pdvc/ops/src/cuda/ms_deform_attn_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=MultiScaleDeformableAttention -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=sm_75 -std=c++14
nvcc fatal   : Unsupported gpu architecture 'compute_75'
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "setup.py", line 70, in <module>
    cmdclass={"build_ext": torch.utils.cpp_extension.BuildExtension},
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/command/build.py", line 135, in run
    self.run_command(cmd_name)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 670, in build_extensions
    build_ext.build_extensions(self)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
    self._build_extensions_serial()
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
    self.build_extension(ext)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
    _build_ext.build_extension(self, ext)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
    depends=ext.depends)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 500, in unix_wrap_ninja_compile
    with_cuda=with_cuda)
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1255, in _write_ninja_file_and_compile_objects
    error_prefix='Error compiling objects for extension')
  File "/home/binzheng/anaconda3/envs/PDVCC/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

This error message appeared when I run make.sh. How can I solve it? Thank you!

Some questions regarding the dataset

I would like to ask a question about the dataset. I looked at some papers, and I found that some papers are testing the modal performance on the ActivityNet Captions validation set, while some papers are testing the modal performance on the ActivityNet Captions test split. Is there the difference between ActivityNet Captions validation set and ActivityNet Captions test split?

Paper understanding

Hello,
What is the exact dimensions of input to deformable transformer encoder?
From what I understood:

  • Input video is sampled for T frames
  • Extract frame features using TSN/C3D {xt}t=1T
  • Then L convolutional layers for each frame feature giving {fl}l=1L

So the input is of TxL temporal dimension right?

No such file or directory: 'visualization/output/generated_captions/dvc_results.json'

hi
when I run code as below:

video_folder=visualization/videos
output_folder=visualization/output
pdvc_model_path=save/anet_tsp_pdvc/model-best.pth
output_language=en
bash test_and_visualize.sh $video_folder $output_folder $pdvc_model_path $output_language

and the error is generated:

**from densevid_eval3.SODA.soda import SODA

ModuleNotFoundError: No module named 'densevid_eval3.SODA.soda'
START VISUALIZATION
Traceback (most recent call last):
File "visualization/visualization.py", line 154, in
d = json.load(open(opt.dvc_file))['results']
FileNotFoundError: [Errno 2] No such file or directory: 'visualization/output/generated_captions/dvc_results.json'**

so where is dvc_results.json? and how can i got it?

thanks

关于在自己的视频上运行PDVC

您好,我是一名大二学生,对您这个项目十分感兴趣,感谢您对该项目的付出和无私奉献。想请教您一些问题,希望您不吝赐教,感激不尽
我尝试在自己的视频上运行PDVC,按照Readme中的操作步骤成功了,但仔细阅读test_and_visualize.sh文件后发现这个方法仅限于TSP这个模型。
而我一直训练的是C3D模型
所以引出问题一:
同时我也注意到关键操作是 START Dense-Captioning 用python 运行 eval.py ,利用之前步骤生成的特征文件再生成 dvc_caption.josn,那我如何用自己训练过的C3D模型来生成.npy,再用eval.py生成Caption呢?

我也尝试过训练TSP模型,下载download_tsp_features.sh对应的文件,注意里面提到download the following files and reformat them into data/features/tsp/VIDEO_ID.npy where VIDEO_ID starts with 'v_',不知如何将这些TSP的h5文件format为.npy文件,只有convert_c3d_h5_to_npy.py却不能为TSP所用。

问题二:
如何将TSP的特征文件直接转化为npy文件?我注意到训练时获取的就是.npy文件
关于问题二的猜想:
所以我是需要按照TSP的Readme所描述的 将[Activity Net]数据集全部下载下来并用fiftyone分类,再用extract_features_from_a_released_checkpoint.sh 得到TSP的特征文件再进行训练嘛,那这样的话.h5文件是否得到利用呢。

以上问题可能比较基础和啰嗦,如果您乐意为我解答真的能帮到我很多!! 再次感谢~

C3D特征

您好!
王老师,我查看.npy的C3D特征,发现每个特征的列数都是500,行数是不一样的。我想问您,列数和行数具体代表什么呢?每一行的特征是一个时间段的视频特征吗?列数和行数有关系吗?期待您的回复,谢谢您。

A question about object detection

Thank you so much for this wonderful project. When I tried to run your code on my validation set, I ran into some problems. For example, in a video, a cat runs out of a Christmas gift box, but the prediction is: a woman runs out of the Christmas gift box. Another video of mine shows some sheep walking and the prediction is that some horses are walking. From this, it can be seen that the model can recognize the action, but not the type of the object. I think it may be the problem of ActivityNet, because the animal category in the dataset only contains dogs and horses. Could you please provide a pre training weight obtained after pre-training on ImageNet-22K. I think this may be really effective for the model when it comes to object detection. Finally thank you for your contribution.

Some questions for your paper

Hello, Teng

I have read your PDVC paper and run this code, it is a very good work! However, there are some points I can't understand in the paper, could you explain it?

  1. I can't get how to attain N queries in the flow chart, in this paper, seems there are no anchors, is it also by some anchors pre-set and the order of confidence score?
  2. In Table 3 of this paper, in line 9 of this table (MT [31] with TSN feature), why it is re-evaluated, is it due to different evaluation tools or different features used? Also, I found the Meteor score in Table1, [31] is 9.25, not the same with the re-evaluated value 4.98, could you help me with this?
  3. Could you explain what is the difference of PDVC light and PDVC methods?
    You can use Chinese if you prefer.
    Thank you very much!

A question about demo video

Thanks for sharing your wonderful work.
I haven't read your paper yet, so based on demo video, I have some questions:
1- Is your PDVC model can be considered as live video captioning?
2- Is the caption is generated for each event directly without reading all video frames?
3- How long does it take to generate caption for one event?

Results on YouCook2 varies with different Runs - Seed is same!

Hi @ttengwang !

thanks for nice work on DVC!

I am able to run the code on YouCook2 with small configuration changes, however, I am getter different results if run the same code multiple time with same seed. Metric is SODA_c.

Method Validation
Run 1 4.171
Run 2 3.933
Run 3 4.322
Run 4 3.958
Average 4.096 ± 0.159

Please let me know your thoughts!

Regards
Anil

使用GT proposal时测得的paragraph captioning结果偏低

作者您好~我使用的是TSP的特征以及预训练好的模型,测试predicted proposal时得到与readme中相近结果(bleu4:10.46, METEOR: 16.43, CIDEr: 20.92),这个结果比论文Table 4中的结果要好,可能是因为用了更好的特征。

但当我使用同样的特征和模型测试GT proposal时,得到(bleu4:11.17, METEOR15.58, CIDEr: 22.70),这个结果又明显不如论文Table 4中的结果,这是为什么呢?是测试GT proposal用的模型和测试predicted proposal的模型不一样吗?

如果方便的话,能不能给我发一份模型在predicted proposal和GT proposal两种条件下的预测结果呀,我们打算搜集一些模型的结果进行一个人工评测,我的邮箱是[email protected],感谢!

Few questions about training

Hello @ttengwang ,
I am trying to train your model from scratch (just for learning purpose). However I am facing few issues:

  1. the train_caption_file or val_caption_file does not have labels, which is being used in video_dataset.py (also in class loss). Am I using some wrong file?
  2. I tried with labels from action_proposal dataset (with captioning related part removed), but the loss_ce doesn't decrease at all, both in train and val (did you face any issues like this?). Also the loss_ce is coming in ranges of 300-400.
  3. How many epochs you trained before getting decent captions?

关于 MultiScaleDeformableAttention 的问题

您好作者,我在安装reader me的步骤安装好MultiScaleDeformableAttention之后,在运行时,出现了导入的错误。你又遇到类似的问题吗,应该怎么解决!

5{{H~7T)_3XFEI0M@ SY HS

error in ms_deformable_im2col_cuda: invalid device function

48%|████▊ | 2358/4917 [02:40<03:01, 14.12it/s]error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
48%|████▊ | 2360/4917 [02:40<02:55, 14.58it/s]error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function

Hello, what does the above situation mean when training the model? But the program can run

关于模型测试结果

您好,我用您的训练的模型测试,出现以下情况,请问您知道是什么原因吗?

/home/yy/anaconda3/envs/DVC1/bin/python /home/yy/桌面/PDVC/eval.py --eval_folder=anet_c3d_pdvc --eval_model_path=model-best.pth
{'eval_save_dir': 'save', 'eval_mode': 'eval', 'test_video_feature_folder': None, 'test_video_meta_data_csv_path': None, 'eval_folder': 'anet_c3d_pdvc', 'eval_model_path': 'model-best.pth', 'eval_tool_version': '2018', 'eval_caption_file': 'data/anet/captiondata/val_1.json', 'eval_proposal_type': 'gt', 'eval_transformer_input_type': 'queries', 'gpu_id': ['0'], 'eval_device': 'cuda'}
load info from save/anet_c3d_pdvc/info.json
load translator, total_vocab: %d 5747
load captioning file, %d captioning loaded 4917
/home/yy/anaconda3/envs/DVC1/lib/python3.7/site-packages/torch/nn/modules/rnn.py:61: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1
"num_layers={}".format(dropout, num_layers))
all decoder layers share the same caption head
Loading model from save/anet_c3d_pdvc/model-best.pth
alpha: 1.0, temp: 2.0
loss: OrderedDict([('loss_ce', 0.19), ('loss_counter', 0.116), ('loss_bbox', 0.151), ('loss_giou', 0.302), ('loss_self_iou', 0.144), ('cardinality_error', 3.56), ('loss_ce_0', 0.18), ('loss_counter_0', 0.117), ('loss_bbox_0', 0.331), ('loss_giou_0', 0.503), ('loss_self_iou_0', 0.242), ('cardinality_error_0', 3.56), ('total_loss', 4.075)])
available video number 4917
PTBTokenizer tokenized 610661 tokens at 1464671.40 tokens per second.
PTBTokenizer tokenized 583002 tokens at 1426850.56 tokens per second.
Traceback (most recent call last):
File "/home/yy/桌面/PDVC/eval.py", line 144, in
main(opt)
File "/home/yy/桌面/PDVC/eval.py", line 109, in main
logger, alpha=opt.ec_alpha, dvc_eval_version=opt.eval_tool_version, device=opt.eval_device, debug=False, skip_lang_eval=False)
File "/home/yy/桌面/PDVC/eval_utils.py", line 224, in evaluate
dvc_eval_version=dvc_eval_version
File "/home/yy/桌面/PDVC/eval_utils.py", line 124, in eval_metrics
dvc_score = eval_dvc(json_path=dvc_filename, reference=gt_filenames, version=dvc_eval_version)
File "/home/yy/桌面/PDVC/densevid_eval3/eval_dvc.py", line 13, in eval_dvc
score = eval_func(args)
File "/home/yy/桌面/PDVC/densevid_eval3/evaluate2018.py", line 261, in main
evaluator.evaluate()
File "/home/yy/桌面/PDVC/densevid_eval3/evaluate2018.py", line 113, in evaluate
scores = self.evaluate_tiou(tiou)
File "/home/yy/桌面/PDVC/densevid_eval3/evaluate2018.py", line 237, in evaluate_tiou
score, scores = scorer.compute_score(gts[vid_id], res[vid_id])
File "/home/yy/桌面/PDVC/densevid_eval3/pycocoevalcap/meteor/meteor.py", line 37, in compute_score
stat = self._stat(res[i][0], gts[i])
File "/home/yy/桌面/PDVC/densevid_eval3/pycocoevalcap/meteor/meteor.py", line 57, in _stat
self.meteor_p.stdin.flush()
BrokenPipeError: [Errno 32] Broken pipe

Comparison with Base Transformer on YouCook2

Hi @ttengwang

Appreciate you for sharing the code.

I am wondering if you train the base Transformer +LSTM on Youcook2 dataset, i.e. similar to Row 1 and 2 in Table 7 (a).

I am wondering if the current code supports to train the base transformer or not.

Thanks

the video is shown with a white screen.

In the Infer stage, follow the readme instruction(Running PDVC on Your Own Videos), but for different videos, always generate captions with the same sentence: "The video is shown with a white screen."
image
image

anet_tsn_pdvc best model fail to load..

Thanks for the sharing of your wonderful work! I want to employ your best tsn model to run my video. But an error occurred while loading the model as follows:(maybe released model parameters and structure do not match)
Loading model from save/anet_tsn_pdvc/model-best.pth
Traceback (most recent call last):
File "eval.py", line 111, in
main(opt)
File "eval.py", line 70, in main
model.load_state_dict(loaded_pth['model'], strict=True)
File "/data11/zq/vc_envs/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for PDVC:
size mismatch for transformer.pos_trans.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).

About evaluation indicators

I would like to ask if the evaluation index of the reproduction model of the paper is only described by video paragraphs. If I want to get the evaluation index of dense video caption

Is there any limit for Batch_size?

I try to train the model with tsn feature.But it only use 2GB GPU memory.So I try to train the model with bitch_size = 8.But there are some error like:

/opt/conda/conda-bld/pytorch_1614378098133/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [41,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1614378098133/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [41,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1614378098133/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [41,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
  0%|                                                                                                                                                                                                   | 0/2502 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 317, in <module>
    train(opt)
  File "train.py", line 181, in train
    output, loss = model(dt, criterion, opt.transformer_input_type)
  File "/home/anaconda3/envs/PDVC-main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/axxddzh/dat/axxddzh/PDVC-main/models/pdvc.py", line 166, in forward
    disable_iterative_refine)
  File "/media/axxddzh/dat/axxddzh/PDVC-main/models/pdvc.py", line 299, in parallel_prediction_matched
    others, self.opt.caption_decoder_type, indices)
  File "/media/axxddzh/dat/axxddzh/PDVC-main/models/pdvc.py", line 387, in caption_prediction
    cap_prob = cap_head(hs[:, feat_bigids], reference[:, feat_bigids], others, seq)
RuntimeError: CUDA error: device-side assert triggered

I have met the same problem when Batch_size not be 1

为什么测试集的loss几乎不变?

我直接按照readme进行训练,发现evaluate的loss几乎一直不变,几乎一直保持下面的大小
loss: OrderedDict([('loss_ce', 0.187), ('loss_counter', 0.114), ('loss_bbox', 0.151), ('loss_giou', 0.299), ('loss_self_iou', 0.138), ('cardinality_error', 3.56), ('loss_ce_0', 0.173), ('loss_counter_0', 0.115), ('loss_bbox_0', 0.343), ('loss_giou_0', 0.511), ('loss_self_iou_0', 0.242), ('cardinality_error_0', 3.56), ('total_loss', 4.075)])
请问这是怎么回事?

i3d+vggish results

Hello professor,

When I reproduced your ‘i3d+vggish’ model, I found that i can not achieve the same results compared with the original paper. I don't know if there is something wrong with my settings.

Thinks

Issue on inference

Hi,
I tried to perform inference on my own videos by simply putting those videos in the /visualization/videos folder, then running the provided scripts in this repo.

However, when loading the model Loading model from save/anet_tsp_pdvc/model-best.pth, my terminal shows this error:

visualization/output/r2plus1d_34-tsp_on_activitynet_stride_16/sample_vid.npy not exists, use zero padding.
all feature files of video sample-vid do not exist

Then the generated captions in dvc_results.json will just be talking about a black screen or white screen or credit scene. I assume this is due to the zero padding.
It seems that there is a problem when extracting features from my video(?) I am not too sure. Is there any step I might have missed or any step that was not included in the scripts?

Any help is appreciated. Thank you~

How to further train on our own dataset?

Hello,
Thank you so much for this amazing github repo. I wanted to use the pre-trained model and further train on my own dataset of videos and corresponding captions that I have. Do you have any suggestions on how I can do this?

Should I change other parameters when I change batch_size?

I cloned your repository and trained the model on ActivityNet. When I changed batch_size to a higher number, there was an error in the program. It happened at line 387 of pdvc.py, and after debugging, I found that the program ran there with an out-of-bounds in tensor. Should I change other parameters when I change batch_size?

caption my custom video

Hi @ttengwang ~
Thanks for the sharing of your wonderful work! I want to caption my custom video, but unfortunately I find that most codes for captioning are starting from extracted features, and little instructions are provided for the extraction process. It's very inconvenient for me, cause I'm not so familiar with the captioning task, and I just want to utilize the tool for some applications. So could you please kindly give me some detailed instructions on how I can get the captions from a raw video? I will appreciate it a lot!

Thanks,
Zhihong

questions about counter_class_rate

Hi!

I found that a predefined list called 'counter_class_rate' in ./pdvc/crtirion.py is used as a weight in counter loss.
I'm curious about the way to get the list. Is it the frequency of event number of the dataset?

I'd be appreciated if you can answer my question!

Running PDVC on Your Own Videos

您好!很高兴可以看到您优秀的工作!
但是,我在测试模型的过程中遇到了问题。在“Running PDVC on Your Own Videos”部分,我使用了您提供的预训练模型和准别好的测试视频“xukun”,可是测试结果并没有达到您README中展示的效果。不知道是不是我的操作过程出现了问题呢?

期待您的回复!谢谢!

START FEATURE EXTRACTION出现错误

作者您好,我运行您的代码之后,在START FEATURE EXTRACTION阶段出现RuntimeError: CUDA error: unknown error,其它地方均无错误,请问我该如何解决它?
image
image

Question about the result difference of video paragraph captioning

Thanks for the great work!
I notice that in the Table 4 of your paper, PDVC can achieve "B@4 11.80| M 15.93 | C 27.27" in ActivityNet Captions ae-val set, but it is "B@4 10.18 | M 15.96 | C 20.66" for PDVC with TSN features shown in the Readme. I wonder if the two datasets (ActivityNet Captions v.s. ActivityNet Entity) are different that leads to such different results? Looking forward to your reply.

没能复现readme中示例的效果

作者您好,我运行您的代码之后,对于坤坤跳舞的视屏,我并没有得到和您一样的效果大部分的时间caption的结果是"the credits of the video are shown"
image
对于其他视屏,也是在大部分时间中都会显示这句话。请问是我哪一步错误了吗?我使用的就是您网盘中给的预训练模型

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.