antoyang / vidchapters Goto Github PK

[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale

Home Page: http://arxiv.org/abs/2309.13952

License: MIT License

Python 47.74% Shell 0.34% C++ 0.35% Cuda 3.52% Jupyter Notebook 48.06%

dense-video-captioning multimodal-learning pre-training temporal-language-grounding video-captioning video-understanding vision-and-language weakly-supervised-learning vid2seq video-chapter-generation

vidchapters's Introduction

VidChapters-7M: Video Chapters at Scale

Webpage • Paper

In this work, we present VidChapters-7M, a large-scale dataset of user-chaptered videos. We study three tasks on top of this dataset and show that video chapter generation models trained on VidChapters-7M transfer well to dense video captioning.

This repository provides the code for our paper, including:

Environment setup
Data collection pipeline for VidChapters-7M (in case you want to collect your own set of chaptered videos)
Data downloading instructions and processed data files
Data processing and analysis scripts (in case you want to reproduce the preprocessing)
Training and evaluation scripts for the tasks of video chapter generation without or with ground-truth boundaries and video chapter grounding on VidChapters-7M, and dense video captioning on YouCook2 and ViTT
Pretrained model checkpoints
A demo to chapter or densely caption the video of your choice with a pretrained Vid2Seq model

This codebase also includes a PyTorch implementation of Vid2Seq (notably in model/vid2seq.py). There are a few differences with the original Jax implementation, including:

Usage of t5-base instead of t5-v1_1-base, which also results in a few architectural differences (is_gated_act=False instead of True)
Addition of a normalization of the weights related to time tokens at every optimization step
No random temporal cropping during training
Whisper ASR instead of Google ASR

Paths and Requirements

Fill the empty paths in the file args.py (and if you wish to use PDVC / Moment-DETR, in the scripts in PDVC/cfgs / moment_detr/moment_detr/scripts/).

To use the evaluation scripts with the METEOR captioning metric, you also need Java.

To install requirements (originally done in Python 3.7), run:

pip install -r requirements.txt

Notes:

The Whisper ASR extraction is done with a separate conda environment created as specified in WhisperX, with Python 3.10 and PyTorch 2.0.
The PDVC experiments are run with a separate conda environment as suggested by PDVC , so to compile the deformable attention layer.

Data collection pipeline

To start, you should get a bunch of YouTube video IDs (that do not necessarily contain video chapters) and use yt-dlp to download descriptions from YouTube, e.g., yt-dlp https://www.youtube.com/watch?v=<VIDEO_ID> --write-description --skip-download.

Then, assuming the descriptions are downloaded as .txt files in SSD_DIR/chapters_descriptions, you can run python collection/desc2chapters.py to extract chapters from descriptions. The output file maps video IDs of user-chaptered videos to the chapter titles and timestamps. You can then download the YouTube video content of videos with chapters with yt-dlp, e.g., yt-dlp https://www.youtube.com/watch?v=<VIDEO_ID>.

Data downloading

VidChapters-7M: We provide the dataset annotations and ASR at this link. You should download the annotations in DATA_DIR/AllChapters. We also provide processed annotations here.

HowTo100M: We use a sentencified version of the dataset. You should download it in DATA_DIR/howto100m.

ViTT: Download it from the data providers. You will also need to download the mapping between 4-character IDs from YouTube-8M to YouTube video IDs. You should download these in DATA_DIR/ViTT. We also provide processed annotations, ASR and visual features here.

YouCook2: Download it from the data providers. You should download these in YouCook2. We also provide processed annotations, ASR and visual features here.

Data processing

Visual Feature Extraction

We follow FrozenBiLM to extract CLIP ViT-L/14 @ 224 pixels features at 1 FPS for all videos. We store them in SSD_DIR/chapters_clipvitl14_features/SSD_DIR/howto100m_clip_features, one file per video, for VidChapters-7M/HowTo100M, and gather them in a single .pth file for all videos in YouCook2/ViTT.

ASR Extraction

To extract ASR, given a csv file prepared like for the visual feature extraction and an output_path where to store the extracted ASR, we run on a single GPU:

conda activate whisperX_env
python asr_extract/whisper_inference.py --csv=<csv> --output_path=<output_path> --faster

You may parallelize this over many jobs. Note that this requires having downloaded the Whisper Large-V2 model weights in <MODEL_DIR>.

We then gather the extracted ASR into a single file asr by running:

python asr_extract/merge_asr_whisper.py <output_path> DATA_DIR/AllChapters/whisper.pkl

To extract word-level timestamps and segment the ASR into sentences, we run on a single GPU:

conda activate whisperX_env
python asr_extract/whisper_align.py --csv=<csv> --asr=DATA_DIR/AllChapters/whisper.pkl --output_path=<align_output_path>

You may parallelize this over many jobs. Note that this requires having downloaded the alignment model weights for all languages from WhisperX in <MODEL_DIR>.

Finally, we merge the aligned ASR into a single file by running:

python asr_extract/merge_asr_whisper_align.py <align_output_path> DATA_DIR/AllChapters/asr.pkl DATA_DIR/AllChapters/whisper.pkl

Annotation files

To preprocess annotation files, use:

python preproc/chapters_to_dvc.py
python preproc/chapters_to_vmr.py
python preproc/vitt.py
python preproc/youcook.py

Analysis

To detect languages from ASR or chapters, we run on single GPUs:

python analysis/language.py

You may parallelize this over many jobs.

To obtain gender statistics, we run on a CPU:

python analysis/gender.py

To detect videos with NSFW frames or toxic chapter titles or ASR, we run on single GPUs (for this, you will also need detoxify==0.5.1 that you can pip install):

python analysis/nsfw.py

You may parallelize this over many jobs. Note that this requires having downloaded this NSFW classifier and the Detoxify language model.

You can also find the code for the paper plots in the notebook analysis/plots.ipynb, and the details of the manual assessment presented in the paper in analysis/manual_assessment.xlsx.

Model checkpoints

For HowTo100M pretraining, the full video chapter generation task, and dense video captioning tasks, we release the following Vid2Seq checkpoints and report their corresponding SODA performance.

Training data	VidChapters-7M (test)	YouCook2 (val)	ViTT (test)	url	size
HowTo100M				Drive	1.1GB
VidChapters-7M	10.6			Drive	1.1GB
HowTo100M + VidChapters-7M	11.4			Drive	1.1GB
HowTo100M + VidChapters-7M + YouCook2		10.3		Drive	1.1GB
HowTo100M + VidChapters-7M + ViTT			15.0	Drive	1.1GB

For the task of video chapter generation with ground-truth boundaries, we release the following Vid2Seq checkpoint and report its corresponding CIDEr performance.

Training data	VidChapters-7M (test)	url	size
HowTo100M + VidChapters-7M	120.5	Drive	1.1GB

For the task of video chapter grounding, we release the following Moment-DETR checkpoint and report its corresponding R@10s performance.

Training data	VidChapters-7M (test)	url	size
VidChapters-7M	21.8	Drive	0.9GB

Training and evaluation

Unless stated otherwise, to load a pretrained checkpoint with the following scripts, you can use --load=<CHECKPOINT>, and evaluation can be done with the same scripts as below but specifying --eval.

Note that most of our training runs were done using A100 GPUs with 80GB of memory. You may need to adapt the batch size if you are using lower memory GPUs.

Also, to use BLIP-2-based scripts, you need to download raw videos from the corresponding datasets and prepare a video_paths.json file that maps video IDs to the video path.

Vid2Seq Pretraining on HowTo100M

Run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env dvc.py --epochs=5 \
--fraction_warmup_steps=0.01 --lr=3e-4 --print_freq=1000 --save_dir=howto100m \
--combine_datasets htm --batch_size=8 --clip_max_norm=0.1

Video Chapter Generation

For Vid2Seq, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env dvc.py --epochs=10 \
--lr=3e-4 --save_dir=chapters --combine_datasets chapters --combine_datasets_val chapters \
--batch_size=8 --batch_size_val=8 --clip_max_norm=0.1 --schedule="cosine_with_warmup"

Multiple baselines reported in the paper can also be found in args.py, e.g. using only visual or speech input with --no_speech or --no_video, or training only using ASR with --gen_asr.

For PDVC, run:

cd PDVC
conda activate PDVC_env
python train.py --cfg_path cfgs/chapters_clip_pdvc.yml --gpu_id=0 --epoch=5 --no_self_iou --lr=1e-4

Test inference with PDVC can be done by setting the evaluation paths to the test data in the config, using the same script, and setting the parameters --load=<CHECKPOINT> and --epoch=0.

For the text tiling + LLaMA zero-shot baseline, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env zs_speechvcg.py --combine_datasets=chapters \
--combine_datasets_val=chapters --save_dir=chapters_texttilingllama --model_name <MODEL_DIR>/7BHF

Pass --random to the previous command to run the random baseline.

For the shot detection + BLIP-2 zero-shot baseline, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env zs_visualvcg.py --combine_datasets=chapters \
--combine_datasets_val=chapters --save_dir=chapters_shotdetectblip2 --model_name Salesforce/blip2-flan-t5-xl

Video Chapter Generation with Ground-Truth Boundaries

For Vid2Seq, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env vc.py --epochs=20 --lr=3e-4 \
--save_dir=chapters_vcggt --combine_datasets chapters --combine_datasets_val chapters --batch_size=64 \
--batch_size_val=1 --schedule="cosine_with_warmup" --max_input_tokens=256 --max_output_tokens=32

For the LLaMA zero-shot baseline, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env vc.py --model_name=<MODEL_DIR>/7BHF \
--save_dir=chapters_vcggt_zeroshotllama --combine_datasets chapters --combine_datasets_val chapters \
--batch_size_val=1 --max_input_tokens=256 --max_output_tokens=32 --eval

Pass --random to the previous command to run the random baseline.

For the BLIP-2 zero-shot baseline, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env vc.py --model_name=Salesforce/blip2-flan-t5-xl \
--save_dir=chapters_vcggt_zeroshotblip2 --combine_datasets chapters --combine_datasets_val chapters \
--batch_size_val=1 --max_input_tokens=256 --max_output_tokens=32 --eval

Video Chapter Generation Grounding

For Moment-DETR, run:

cd moment_detr
bash moment_detr/scripts/chapters.sh --max_v_l=1200 --downsample --clip_length=3 --lr=3e-4 \
--n_epoch=50 --max_es_cnt=50 --exp_id=chapters --bsz=256 --eval_bsz=256 --num_workers=16

Inference with Moment-DETR can be run with the script moment_detr/scripts/chapters_inference.sh, the same parameters, and a parameter --resume=<CHECKPOINT>.

For the CLIP zero-shot baseline, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env zs_vcgr.py --save_dir=chapters_vcgr_clip \
--combine_datasets chapters --combine_datasets_val chapters

For the BERT zero-shot baseline, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env zs_vcgr.py --save_dir=chapters_vcgr_bert \
--combine_datasets chapters --combine_datasets_val chapters --no_video

For the random zero-shot baseline, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env zs_vcgr.py --save_dir=chapters_vcgr_random \
--combine_datasets chapters --combine_datasets_val chapters --random

Dense Video Captioning

For Vid2Seq on YouCook2/ViTT, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env dvc.py --epochs=40 \
--lr=3e-4 --save_dir=youcook --combine_datasets youcook --combine_datasets_val youcook \
--batch_size=2 --batch_size_val=2 --schedule="cosine_with_warmup"
python -m torch.distributed.launch --nproc_per_node 8 --use_env dvc.py --epochs=20 \
--lr=3e-4 --save_dir=vitt --combine_datasets vitt --combine_datasets_val vitt \
--batch_size=2 --batch_size_val=2 --schedule="cosine_with_warmup"

The zero-shot evaluation can be simply done by loading a checkpoint pretrained on VidChapters-7M for evaluation using the arguments --load=<CHECKPOINT> --eval.

For PDVC on YouCook2/ViTT, run:

cd PDVC
conda activate PDVC_env
python train.py --cfg_path=cfgs/yc2_clip_pdvc.yml --gpu_id=0
python train.py --cfg_path=cfgs/vitt_clip_pdvc.yml --gpu_id=0

To load a pretrained PDVC checkpoint, set the parameters --load=<CHECKPOINT> and --load_vocab data/vocabulary_allchapters.json.
Test inference with PDVC can be done by setting the evaluation paths to the test data in the config, using the same script, and setting the parameters --load=<CHECKPOINT> and --epoch=0.

Demo

To run a pretrained Vid2Seq model (for video chapter generation or dense video captioning) on the video of your choice, you first need to extract ASR with the following command:

conda activate whisperX_env
python demo_asr.py --video_example=<VIDEO_PATH> --asr_example <OUTPUT_ASR_PATH> --combine_datasets chapters

Then you can run the model inference:

python demo_vid2seq.py --load=<CHECKPOINT> --video_example=<VIDEO_PATH> --asr_example <OUTPUT_ASR_PATH> --combine_datasets chapters

Licenses

This code is released under the MIT License. The licenses for datasets used in the paper are available at the following links: VidChapters-7M, HowTo100M, YouCook2, and ViTT.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2023vidchapters,
title={VidChapters-7M: Video Chapters at Scale},
author={Antoine Yang and Arsha Nagrani and Ivan Laptev and Josef Sivic and Cordelia Schmid},
booktitle={NeurIPS},
year={2023}}

vidchapters's People

Contributors

Stargazers

Watchers

Forkers

eltociear lin-roger wemersiveadmin huangmr0719 wangdeyu flying2023 arctanbell kavindie quangminhdinh hsuanguo johnny-haytham xlsean gunnarmarino jordi-bird thederpling20 rohit-gupta hangingter vistoai elv-sauptik

vidchapters's Issues

Error with dependencies in requirements.txt

First of all, thank you for huge and interesting work!

I've found some error with versions of dependensies in requirements.txt file:

The Purpose of Time Token Weights Normalization

@antoyang Thanks for the wonderful work! Could you please explain the purpose of normalizing time token weights?

Inference of dense caption

Hi!
Congratulations that you have done a great job!!!
If I want to do dense captioning inference in this project, should i modify something in "python demo_vid2seq.py --load= --video_example=<VIDEO_PATH> --asr_example <OUTPUT_ASR_PATH> --combine_datasets chapters
"? cuz the captioning result is just like in the format of video chapter

Replace with faster-whisper

I prefer using faster whisper, if you need to do demo with it too, here is the revised code:

import argparse
import torch
import os
import pickle
from args import get_args_parser, MODEL_DIR
import whisper
from faster_whisper import WhisperModel, decode_audio
import whisperx
from typing import TypedDict
class SingleSegment(TypedDict):
    """
    A single segment (up to multiple sentences) of a speech.
    """
    start: float
    end: float
    text: str

# Args
parser = argparse.ArgumentParser(parents=[get_args_parser()])
args = parser.parse_args()
device = torch.device(args.device)

print("load Whisper model")
asr_model = WhisperModel("large-v3",device="cuda", compute_type="float16")
print("extract ASR")
asr = asr_model.transcribe(args.video_example,without_timestamps=True,word_timestamps=False, beam_size=5,initial_prompt='Please！ add punctuations。',vad_filter=True)
print("load align model")
align_model, metadata = whisperx.load_align_model(language_code=asr[1].language, device=args.device, model_dir=MODEL_DIR)
print("extract audio")
audio = whisperx.load_audio(args.video_example)

print("align ASR")
the_segments = []
for segment in asr[0]:
    s_item = {'text':segment.text,'start':segment.start,'end':segment.end}
    the_segments.append(s_item)
print(the_segments[:3])

print("whisperx.......")
aligned_asr = whisperx.align(the_segments, align_model, metadata, audio, args.device, return_char_alignments=False)

print("saving")
pickle.dump(aligned_asr, open(args.asr_example, 'wb'))

Hallucinations in chapter title generation after some time

Hello, First of all, thank you for an awesome project and making source code and models available. I really appreciate it.

I was able to run the demo_asr.py and demo_vid2seq.py. I tested HowTo100M + VidChapters-7M + YouCook2 and VidChapters-7M models with it.

On a few videos I tested, VidChapters-7M seems to work better. However, after some time, it starts to repeat the chapter name for rest of the video. I know its not using ChatGPT behind the scene, and term Hallucination is not a perfect fit for the title.

Here is an example:

input video: https://www.youtube.com/watch?v=jARPzYkjp3g

output:

[
    {'sentence': 'The four pillars of happiness.', 'timestamp': [0.0, 14.901636363636365]}, 
    {'sentence': 'The four pillars of happiness.', 'timestamp': [14.901636363636365, 69.5409696969697]}, 
    {'sentence': 'Belonging.', 'timestamp': [69.5409696969697, 94.37703030303031]}, 
    {'sentence': 'Purpose.', 'timestamp': [94.37703030303031, 139.0819393939394]}, 
    {'sentence': 'Religion.', 'timestamp': [139.0819393939394, 173.85242424242423]}, 
    {'sentence': 'Fast casual restaurants.', 'timestamp': [173.85242424242423, 213.59012121212123]}, 
    {'sentence': 'The sacramento.', 'timestamp': [213.59012121212123, 243.39339393939395]}, 
    {'sentence': 'The sacramento.', 'timestamp': [243.39339393939395, 268.22945454545453]}, 
    {'sentence': 'The sacramento.', 'timestamp': [268.22945454545453, 307.9671515151515]}, 
    {'sentence': 'The sacramento.', 'timestamp': [307.9671515151515, 342.73763636363634]}, 
    {'sentence': 'The sacramento.', 'timestamp': [342.73763636363634, 377.50812121212124]}, 
    {'sentence': 'The sacramento.', 'timestamp': [377.50812121212124, 407.31139393939395]}, 
    {'sentence': 'The sacramento.', 'timestamp': [407.31139393939395, 437.11466666666666]}, 
    {'sentence': 'The sacramento.', 'timestamp': [437.11466666666666, 461.9507272727273]}, 
    {'sentence': 'The sacramento.', 'timestamp': [461.9507272727273, 471.88515151515156]}, 
    {'sentence': 'The sacramento.', 'timestamp': [471.88515151515156, 491.754]}
]

Another example:
Input: https://www.youtube.com/watch?v=PRpr0_Iz4dI

Output:

[
	{'sentence': 'Intro.', 'timestamp': [0.0, 7.774242424242424]}, 
	{'sentence': 'Why we hate the new logos.', 'timestamp': [7.774242424242424, 116.61363636363636]}, 
	{'sentence': 'Why people are missing the bigger picture.', 'timestamp': [116.61363636363636, 174.92045454545453]}, 
	{'sentence': "Why people aren't logical.", 'timestamp': [174.92045454545453, 244.88863636363635]}, 
	{'sentence': "Why people aren't logical.", 'timestamp': [244.88863636363635, 384.82499999999993]}
]

Do you know if I can do something to improve it a little?

tokenizer in demo_vid2seq.py

hello, as i want to use demo_vid2seq.py to get video captioning, there are many questions which i don't understand, first, when i run demo_vid2seq.py, there is an error:

load Vid2Seq model
Traceback (most recent call last):
File "demo_vid2seq.py", line 55, in
tokenizer = _get_tokenizer(args.model_name, args.num_bins)
File "/root/tzp/codes/VidChapters-main/model/vid2seq.py", line 12, in _get_tokenizer
tokenizer = T5Tokenizer.from_pretrained(tokenizer_path, local_files_only=True)
File "/root/.local/conda/envs/vidchapter/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1796, in from_pretrained
f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
OSError: Can't load tokenizer for 't5-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 't5-base' is the correct path to a directory containing all relevant files for a T5Tokenizer tokenizer.

i don't know how to set and get the tokenizer, can you help me?

How to run on a single video

How can I run it on a single video?

Inference on a video

Could you show me how to run demo on a video. your instructions about demo is really ambigous. The args.py, do i need to fill it before run the demo

Will the model weights pretrained on YT-Temporal-1B be released?

Hello! Thanks for releasing the PyTorch implementation of Vid2Seq. I was wondering if the model weights pretrained on YT-Temporal-1B would be released.

inference without speech

hi!
may i know how to do the inference without speech?

I've set the --no_speech but so that the output is [].
And when i do inference in activitynet and charades dataset, the output looks like it only considers the speech feature

Thank you!

Dense video captioning on ActivityNet Captions dataset?

Great work Antoine! In your last paper Vid2Seq, you also tested the pre-trained model on the ActivityNet captions dataset, but in VidChapters you only show on ViTT and YouCook2. I am wondering if there is any particular reason to pick ViTT and YouCook2. Is it because the ActivityNet captions dataset is larger than these two (i.e. longer training time) or is it because it contains more diverse activities which makes it a harder dataset?

Thank you!

test on vitt

Thank you for your work. I would like to know if the file clipvit14.pth contains the features of the test set of vitt. When I was testing, I encountered the error: File "/public/home/code/VidChapters-main/VidChapters-main/dataset/dvc_dataset.py", line 65, in _get_video assert video_id in self.features, video_id AssertionError: 0_-0zE4NDkuYo

About the generative objective and the denoising objective

Hi! Thank you for huge and interesting work!
Would you please provide a guideline to the implementation of both the generative objective and the denoising objective mentioned in the vid2seq paper. I can't find it in the code.

Video Id for Each Category

Thank you for the awesome work.

There are 12 video categories present in this dataset. Do you have the video ids for each category separately?

Run demo for videos without speech

How can I run demo_asr.py for videos without speech?
I tried it by pickling a json file as:
{ "segments": [], "word_segments": [] }
But got a runtime error in the demo_asr.py script:
load Vid2Seq model loading visual backbone extracting visual features visual features extracted load ASR ASR to tokens Traceback (most recent call last): File "demo_vid2seq.py", line 150, in <module> input_tokens = torch.cat(input_tokens, 0) RuntimeError: torch.cat(): expected a non-empty list of Tensors

Video request

Thanks for your wonderful work!

I would like to download videos of VidChapters. Could you provide some cmd tools to quickly download these videos? And how much storage do you use for these videos?

Thanks~

indexSelectLargeIndex: Device-side assertion `srcIndex < srcSelectDimSize' failed.

Anyone having this problem?

___2____Input tokenized shape: torch.Size([1, 57])
/var/lib/jenkins/pytorch/aten/src/ATen/native/hip/Indexing.hip:1294: indexSelectLargeIndex: Device-side assertion `srcIndex < srcSelectDimSize' failed.

        if self.use_video:
            print("___1____Video shape:", video.size())
            video = self.visual_encoder(video)  # B T D
            if self.proj_v2t is not None:
                video = self.proj_v2t(video)
            atts_vis = torch.ones(video.size()[:-1], dtype=torch.long).to(video.device)
            print("___1____atts_vis shape:", atts_vis.size())

        if self.use_speech:
            print("___2____Input tokenized shape:", input_tokenized['input_ids'].size())
            text = self.t5_model.encoder.embed_tokens(input_tokenized['input_ids'])  # B L D
            print("___2____Text tokenized shape:", text)
            encoded = self.t5_model.encoder(
                attention_mask=input_tokenized['attention_mask'],
                inputs_embeds=text,
            )
            print("___2____Encoded shape:", encoded.last_hidden_state.size())

Code issue

We run the demo_vid2seq.py
In vid2seq.py line 41 we find that:
self.t5_model.resize_token_embeddings(len(tokenizer) - num_bins) # remove the weights of the 28 tokens that are not used (32128 vs 32100 in the tokenizer)
self.t5_model.resize_token_embeddings(len(tokenizer)) # add time tokens

These two lines of code are the same. We commented out one line and An error occurred：
File "demo_vid2seq.py", line 170, in
temperature=1)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/vid2seq.py", line 163, in generate
num_return_sequences=num_captions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/transformers/generation/utils.py", line 1534, in generate
**model_kwargs,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/transformers/generation/utils.py", line 2814, in beam_search
output_hidden_states=output_hidden_states,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 1698, in forward
return_dict=return_dict,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 1082, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 710, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 616, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 528, in forward
query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, dim_per_head)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

May I ask how we should resolve it?

Demo for dense video captioning purposes

As i clearly understand, demo_vid2seq.py is used for main goal: video chapter generation.
How can i change this module for dense video captioning purposes? Or can you add new demo for this inference, please?

Will you release chapters_clipvitl14_features?

Thanks for your amazing work! Will you release chapters_clipvitl14_features so that we can have a quickstart ;)