tencentarc / umt Goto Github PK

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

License: Other

Python 100.00%

umt's Introduction

Unified Multi-modal Transformers

This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection by Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, and Xiaohu Qie, which has been accepted by CVPR 2022.

Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

CUDA 11.5.0
CUDNN 8.3.2.44
Python 3.10.0
PyTorch 1.11.0
NNCore 0.3.6

Install from source

Clone the repository from GitHub.

git clone https://github.com/TencentARC/UMT.git
cd UMT

Install dependencies.

pip install -r requirements.txt

Getting Started

Download and prepare the datasets

Download and extract the datasets.

Prepare the files in the following structure.

UMT
├── configs
├── datasets
├── models
├── tools
├── data
│   ├── qvhighlights
│   │   ├── *features
│   │   ├── highlight_{train,val,test}_release.jsonl
│   │   └── subs_train.jsonl
│   ├── charades
│   │   ├── *features
│   │   └── charades_sta_{train,test}.txt
│   ├── youtube
│   │   ├── *features
│   │   └── youtube_anno.json
│   └── tvsum
│       ├── *features
│       └── tvsum_anno.json
├── README.md
├── setup.cfg
└── ···

Train a model

Run the following command to train a model using a specified config.

# Single GPU
python tools/launch.py ${path-to-config}

# Multiple GPUs
torchrun --nproc_per_node=${num-gpus} tools/launch.py ${path-to-config}

Test a model and evaluate results

Run the following command to test a model and evaluate results.

python tools/launch.py ${path-to-config} --checkpoint ${path-to-checkpoint} --eval

Pre-train with ASR captions on QVHighlights

Run the following command to pre-train a model using ASR captions on QVHighlights.

torchrun --nproc_per_node=4 tools/launch.py configs/qvhighlights/umt_base_pretrain_100e_asr.py

Model Zoo

We provide multiple pre-trained models and training logs here. All the models are trained with a single NVIDIA Tesla V100-FHHL-16GB GPU and are evaluated using the default metrics of the datasets.

Dataset	Model	Type	MR mAP		HD mAP		Download
Dataset	Model	Type	[email protected]	[email protected]	[email protected]	[email protected]	Download
QVHighlights	UMT-B	—	38.59		39.85		model \| metrics
QVHighlights	UMT-B	w/ PT	39.26		40.10		model \| metrics
Charades-STA	UMT-B	V + A	48.31	29.25	88.79	56.08	model \| metrics
Charades-STA	UMT-B	V + O	49.35	26.16	89.41	54.95	model \| metrics
YouTube Highlights	UMT-S	Dog	—		65.93		model \| metrics
	UMT-S	Gymnastics	—		75.20		model \| metrics
	UMT-S	Parkour	—		81.64		model \| metrics
	UMT-S	Skating	—		71.81		model \| metrics
	UMT-S	Skiing	—		72.27		model \| metrics
	UMT-S	Surfing	—		82.71		model \| metrics
TVSum	UMT-S	VT	—		87.54		model \| metrics
	UMT-S	VU	—		81.51		model \| metrics
	UMT-S	GA	—		88.22		model \| metrics
	UMT-S	MS	—		78.81		model \| metrics
	UMT-S	PK	—		81.42		model \| metrics
	UMT-S	PR	—		86.96		model \| metrics
	UMT-S	FM	—		75.96		model \| metrics
	UMT-S	BK	—		86.89		model \| metrics
	UMT-S	BT	—		84.42		model \| metrics
	UMT-S	DS	—		79.63		model \| metrics

Here, w/ PT means initializing the model using pre-trained weights on ASR captions. V, A, and O indicate video, audio, and optical flow, respectively.

Citation

If you find this project useful for your research, please kindly cite our paper.

@inproceedings{liu2022umt,
  title={UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection},
  author={Liu, Ye and Li, Siyuan and Wu, Yang and Chen, Chang Wen and Shan, Ying and Qie, Xiaohu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={3042--3051},
  year={2022}
}

umt's People

Contributors

Stargazers

Watchers

Forkers

videosummarizer ww-i python-repository-hub lynneyyq johnsoncn nahidalam seebiggerworld zuiwufenghua aosiddiqui aspnetcs auto-shorts vinciusandrade minlianglin seulbinhwang wei-ucas

umt's Issues

Pretraining Problem

Could you please provide a specific method of pre-training? I am getting the following error using your pretraining command:

validate

how to get moment bodundary during validate

How to extract video features

Did you only use SlowFast or CLIP in QVHighlights ？

Model Computation Amount (FLOPs) and Number of Parameters (Params)

How to calculate the UMT model computation amount (FLOPs) and Parameter number (Params)?

save epoch problems

When I run the code, it is saved every epoch. Running this once will take up a lot of space. How should I set it up so that the parameters can be saved every specific epoch?

feature extraction （i3d and optical flow）

Hello, I would like to ask which code base do you use for i3d feature extraction and optical flow feature extraction mentioned in the data set paper? I want to reproduce it and then test my video.

How to prepare the data

I used the feature extraction method you provided, and all the feature dimensions obtained are [1,1024]. They are different from the feature dimensions you gave. What modifications have you made? I would like to know what are the requirements for the feature dimension derived from the model?
Thank you so much!

results visualized

1、How are the results visualized?

2、Does the Y-axis of the line plot represent the saliency scores of the prediction？

qvhighlights/umt_base_pretrain_100e_asr.py

运行launch.py configs/qvhighlights/umt_base_pretrain_100e_asr.py的时候报错。
Traceback (most recent call last):
File "E:\SF5\UMT-main\tools\launch.py", line 67, in
main()
File "E:\SF5\UMT-main\tools\launch.py", line 63, in main
engine.launch(eval=args.eval)
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\nncore\engine\engine.py", line 529, in launch
self.run_stage()
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\nncore\engine\engine.py", line 468, in run_stage
self.train_epoch()
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\nncore\engine\engine.py", line 408, in train_epoch
for data in self.data_loader:
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch\utils\data\dataloader.py", line 530, in next
data = self._next_data()
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch\utils\data\dataloader.py", line 1204, in _next_data
return self._process_data(data)
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch\utils\data\dataloader.py", line 1250, in _process_data
data.reraise()
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch_utils.py", line 457, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 3.
Original Traceback (most recent call last):
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch\utils\data_utils\worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch\utils\data_utils\fetch.py", line 52, in fetch
return self.collate_fn(data)
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\nncore\parallel\collate.py", line 69, in collate
return {
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\nncore\parallel\collate.py", line 70, in
k: collate([d[k] for d in batch], samples_per_gpu)
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\nncore\parallel\collate.py", line 52, in collate
stacked.append(default_collate(padded))
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch\utils\data_utils\collate.py", line 136, in default_collate
storage = elem.storage()._new_shared(numel)
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch\storage.py", line 487, in _new_shared
untyped_storage = module._UntypedStorage._new_shared(size * cls().element_size())
File "D:\ProgramData\Anaconda3\envs\MUT\lib\site-packages\torch\storage.py", line 172, in _new_shared
return cls._new_using_filename(size)
RuntimeError: Couldn't open shared file mapping: <0000020412B6EEA2>, error code: <1455>

Inference code

Sorry to bother you, i want to know how to make a prediction with my video and corresponding query
Thank you very much.

Inference mode

Hi,

Iam a little confuse about how to use your model
First of all i wanted some clarification about models.
I want to try with my own video with the UMT-B V+A model with Charades-STA by giving it a Query and a video.
In your repo you only mention training, eval and for the test part i didn't see a test configuration after i lookup at your code
Can you explain with the model mentionned above on how to make a prediction if i have a video and a query ?

Thank you

Query feature in TVSum highlight detection

Hi. Another question about the query feature in TVsum dataset. Does this query feature come from the title of each video? I assume the model does not require text input for highlight detection.

How to extract each modal feature.

Hello, I want to test an original video, but I don't know your feature extraction method. Excuse me, will your code for extracting the features of image, audio and text be supplemented?

Hello, questions about text feature extraction。

Hello, questions about text feature extraction。
1、Is the model loaded when using CLIP to extract text features VIT-B /32?
2、Is the text input when extracting text features using CLIP the value of "query" in the "highligiht_train_release.jsonl" file?

bug?? if (num_gt := sum(label)) == 0:

if (num_gt := sum(label)) == 0:
           ^

SyntaxError: invalid syntax

I modified the code as follow:

num_gt = sum(label)##add

if num_gt == 0:
    print("????????")
    collected.append(0)
    continue

but The map for each evaluation are different. WHY?

feature exaction

I saw in the paper that "Since each feature vector captures 32 consecutive frames, we follow and consider the feature vector belonging to a clip if their overlap is more than 50%." What is the relationship between 32 frames and 100 frames of .json annotation files? ? The feature vector of the sound is also extracted in 32 frames, and the overlap is 50%?
How to deal with sound features so that it can correspond to the characteristics of the video?

RuntimeError: CUDA error: no kernel image is available for execution on the device

I use the RTX 3090 to run the code, and I got the error same as title.
I google, it seems 3090 compatibility problem, it cant work with some pytorch version. Someone meets the same problem? And what is the solution?

Can you provide a demo about running predictions on my own videos and queries

As many issues about feature extracting, I comfused about the processing. I wonder if you could provide a demo about running predictions on my own videos and queries. So that it is clear to reproduce your feature extraction operation.
UMT is wonderful. Thanks for your contribution!!!

Audio feature extraction

Hello, regarding the preprocessing done for audio feature extraction,

Can you provide the code that calls get_features function provided in issue #22
Do you call this method for every audio file in youtube highlights dataset as there are some files producing errors.
In batches computation, why do you do integer division and then cast it to integer ?
How to identify the values for feature_time provided in issue #22
I am trying to regenerate the audio features of Youtube Highlights using PANN, can you provide the steps that you did for feature extraction.
Thanks in advance

音频特征提取部分的代码

你好，请问下能特征提取部分的代码能提供嘛？特别是音频部分，我想在自己数据集上训练，但是缺少这部分的代码，感谢！

Something seems wrong in the head.py

Thanks for sharing your wonderful work!!!!!
I am running the UMT on qvhightlight dataset following yout instruction by running:

python tools/launch.py ./configs/qvhighlights/umt_base_200e_qvhighlights.py

However, I got such a error report:

The shape of center_pred I got is [32,75], which do not have a length dim,and I try to fix it by simpely adding a length dim:

 center_pred = center_pred.unsqueeze(1)

It reported another bug:

Could you help me to solve such a problem? Thanks again, have a nice day～.

metric methods

Hello, I am very interested in your research.
In the evaluation method, after you sort the predicted scores, you do not use them any more, but only use the real labels corresponding to the sorted scores for prediction. Why?
And there is no detailed explanation of the mAP metric method in the paper. Is there any reference?

The Checkpoint file requirment

Tanks for your great work. Could you please provide the ckpt on QVHighlights dataset with video modality only?

Audio feature extraction

Hi. Question about audio feature extraction.

I have read issue #22. I am wondering:

Why is sr always 32000? What does it mean?
If I want to extract the feature for every frame from the audio, what should the feature_time and sr be in this case?

What is the horizontal coordinate of Figure 4 in the paper? What does it represent?

TVSum training problem

Hi. Thanks for your great work.
When I am trying to train the model under TVsum dataset I got this error:

The command I used is:

torchrun --nproc_per_node=4 tools/launch.py configs/tvsum/umt_small_500e_tvsum_ga.py

And the environment of my workspace:

Please check if you can reproduce the same result. Thanks.

How can I annotate my own dataset?

I am very interested in your research. I have a lot of questions, thank you for taking the time to answer!

How to annotate the .json file using my own dataset?
How to annotate the label in the video? What is the meaning of match in the .json annotation file, I read from the reference " Label 1 denotes matched clip, label -1 denotes unmatched clip, and label 0 denotes borderline cases.". What does the 0 boundary mean? Does the clip with match 0 contain both positive and negative samples? How do I tag my videos?
How to segment the video. I refer to youtube's .json annotation file, it shows that a clip contains 100 frames, and the overlap is 50%. But I saw in the paper that "Since each feature vector captures 32 consecutive frames, we follow and consider the feature vector belonging to a clip if their overlap is more than 50%." What is the relationship between 32 frames and 100 frames? How many frames should I use for a clip?
My lab questions. My dataset is each 5s video segment (train: 1554, val: 389.), and each video contains only positive or negative samples. Each clip is 50 frames, and the overlap is 50%. But my result is loss change, mAP and best mAP do not change (as shown in the figure), what is the reason?

Is it related to overfitting? 2) Is it related to the division of frames? 3) Does it have something to do with the data only containing a certain sample? Or the video segment should have both positive and negative samples. If a video segment contains both positive and negative samples, how should I label the video?

Where can I download the raw video data of YouTube Highlight?

Thank you very much!

extract audio features

I want to use other method to extract audio features, such as AST\SSAST. Rather than PANN

Will this be very similar to the results of the paper?

.json annotation

Hello, I am very interested in your research.
Can you give a detailed description of dataset annotation files, such as YouTube, TVSum?Thanks a lot.

My dataset

I have my own dataset, i want to process it so to make it suitable for your model, is there any reference code to convert my .mp4 files and vector of scores suitable for your dataset?
Thankyou

Attention map visualization

How to plot the attention map like paper MBT Figure 6?
I see that the feature sequence after the pre -training model is extracted. How to map attention to the original video？
Hope you can give me some advice.

How to align the audio and video at the clip level

Your paper says "Visual and audio features are temporally aligned at clip level". For example, in the YouTube Highlight dataset, the video is divided into clips every 100 frames, and the overlap is 50%. I extract audio features through the codebase you provided.
How to align the audio and video at the clip level? How did you do it?

How do I make my dataset

I have a lot of questions？
1、I want to make a data set similar to QVHighlights in my research direction, What do I need to do?

2、What method was used for feature extraction of QVHIGHLIGHTS text?

When will the code be released.

How can download the data via wget

Thanks for your great work. I wanna reproduce the work using the provided code and data. However, when I use wget to dowload the file to my linux server, I met the error of:
Resolving connectpolyu-my.sharepoint.com (connectpolyu-my.sharepoint.com)... 52.105.223.41
Connecting to connectpolyu-my.sharepoint.com (connectpolyu-my.sharepoint.com)|52.105.223.41|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2022-04-11 12:24:52 ERROR 403: Forbidden.
I also try to use proxy, the above error still exists.

Error. TypeError: '>=' not supported between instances of 'DataContainer' and 'int'

When I train or pre-train the model, the error "TypeError: '>=' not supported between instances of 'DataContainer' and 'int' " happens.

But the eval can work.

If I use .data to get the tensor, the train and pre-train can work, but the eval can't work.

"
UMT/models/model.py", line 34, in forward
mask = torch.where(data['saliency'] >= 0, 1, 0)
TypeError: '>=' not supported between instances of 'DataContainer' and 'int'

model test

1.How to test highlight_test_release.jsonl text and How to output a prediction like this.

2.Can you explain the metrics that appear in the validation set?
MR-long-mAP", "MR-middle-mAP", "MR-short-mAP", "HL-min-Fair-mAP", "HL-min-Fair-Hit1", "HL-min-Good-mAP"

how to align the audio feature and video feature?

If the size of video feature is [14, 2048], i need to extract the audio feature which size is [14, 2048].

Follow you, I use the PANN_inference project to extract audio feature from raw wave file.
Because of video clips and overlap operation, the first dimension of video feature is 14. How to align the audio feature and video feature?

I found that the size of audio feature is related to the sample rate , window size, hop size and anymore, what i should set the parameter.
I want to know more details about how to extract the audio feature, thank you.

Can you provide the original video data? Especially YouTube Highlights.

It's hard for me to download the complete video data. Can you provide the original video data? Or please tell me how you download the complete data. Thank you very much.

Model applicability

Is the model suitable for processing long videos, for example 30, 40, 50 minutes long?

Text embedding on charadesSTA dataset and some minor questions

Hi, first of all, thanks for your great work!

I plan to perform the experiments on charadesSTA dataset with the features you provide,
but I notice that there is no text embedding files, although other features (videos, optical flows, audio) are available.

Can you provide the text embedding you used for your experiments?

Also, I have some minor questions about data preprocessing.
Similar to the Issue in #29 (comment),
I found the length of optical flow features is different from that of video features.

Did you simply crop the feature using a shorter length, same as the audio features?

Thanks,

Misalignment between video and audio for QVhighlight

THank you for the great work.

However, when I use the features provided by this repo,
some video and audio features are misaligned in their context length.

Example is attached below. It is described in the order of "vid, video shape, audio shape".
B3yOejNbNks_210.0_360.0 torch.Size([71, 2816]) torch.Size([70, 2048])

Can you provide how to align these features?

Thank you.
Best regards

How do I use the trained models available in model zoo

Hi,

Can you please provide a short example on how to use one of the models from the model zoo where I just pass in path to a video and relevant text query to get the highlights. I know the process might not be as straight forward as I am saying it to be but if there are any steps then please guide me into that direction and I will get started.

Looking forward to your help.

Thanks

how to visulize the results in your paper

Could you please send me your visual code?

audio feature extraction

hello，Audio feature extraction problem。
1、 Feature extraction in audio data using the output of this line in PANNS?

2、 In qvhighlights dataset, do you directly extract features from 150 second audio?

retrieve a video in real time

Hello, can this method retrieve a video in real time?

The paper says that "On YouTube Highlights and TVSum, we obtain clip-
level visual features using an I3D [4] pre-trained on Kinetics 400 [13] ”, which means that if an unknown video is verified, should audio and video features be extracted offline separately?
How to retrieve the highlighted part of a video in real time?

Thanks a lot.

Any idea of model's general highlight effectiveness

Hi, thanks a lot for decent work.

I want to ask you about the models' general highlight performance.

In the paper, such as evaluation on QVHighlights dataset and Charades, the model quite seems to have fine performance in detecting highlights of the videos without the text-quires. However, you have only done domain specific (i.e. 7 domains and 10 domains for Youtube Highlights and TVSum datasets, respectively) evaluation on your model.

Is there any reason why you have trained the model on each domain? Because it seems the model is cable of detecting highlights on multi-domain structure.

I would like to generate the general model to detect highlight from the video.
Do you think that training on YouTube highlights and TVSum in a multi-domain setting would just result in an awful model? Even considering the performance degradation due to the multi-domain setting?