gabeur / mmt Goto Github PK

View Code? Open in Web Editor NEW

253.0 253.0 41.0 816 KB

Multi-Modal Transformer for Video Retrieval

Home Page: http://thoth.inrialpes.fr/research/MMT/

License: Apache License 2.0

Python 100.00%

fusion language multimodal nlp video vision

mmt's People

Stargazers

Watchers

mmt's Issues

About MSRVTT_full

Could you share the config file(.json) about using the MSRVTT dataset fully?

How to run code with multiple GPUs

Hi, friends, when I run the code with the following command. At the same time, I changed "n_gpu: 1" to "n_gpu: 2" in the LSMDC_full_trainval.json. I get some error messages. I hope you can give me some suggestions.

Command

CUDA_VISIBLE_DEVICES=0,1 python -m train --config configs_pub/eccv20/LSMDC_full_trainval.json

Error

H5 files with video features

Thank you for your generous sharing.
I want to know the difference between ‘features.audio’ and 'features_t.audio' in the H5 file.

Cannot download the video features

wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz

It shows: ERROR 502: Server UnReachable.

How can I get these datasets？Thank you.

Feature Extraction

Would you please provide your feature extraction codes ?

Wrong dimension for L2 normalization?

When ReduceDim

mmt/model/model.py

Lines 715 to 725 in 0d848cd

 class ReduceDim(nn.Module): 

 def __init__(self, input_dimension, output_dimension): 

 super(ReduceDim, self).__init__() 

 self.fc = nn.Linear(input_dimension, output_dimension) 

 def forward(self, x): 

 x = self.fc(x) 

 x = F.normalize(x) 

 return x

is applied to video expert embeddings it do F.normalize on shape (batch_size, num_tokens, embedding_dim), and reduce by default along dim=1 (num_tokens)

mmt/model/model.py

Lines 431 to 434 in 0d848cd

 if self.vid_inp in ['both', 'temp', 'all']: 

 for mod in self.modalities: 

 layer = self.video_dim_reduce[mod] 

 experts_feats[mod] = layer(experts_feats[mod])

But here it is applied to shape (batch_size, embedding_dim)

mmt/model/model.py

Lines 423 to 426 in 0d848cd

 for mod in self.modalities: 

 layer = self.video_dim_reduce[mod] 

 mnp_experts[mod] = layer(pooled_experts[f'{mod}_avgpool']) 

 maxp_experts[mod] = layer(pooled_experts[f'{mod}_maxpool'])

So normalization along embedding_dim axis.

Is it mistake or not?

Cannot download the extracted features

The above links are not available. ERROR 502: Server UnReachable.

TypeError

Hi,
TypeError: Can't instantiate abstract class MixDataset with abstract methods configure_train_test_splits, load_features, sanity_checks.
The TypeError show up when I run python -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json, please help with the solution.

About different dataset partition

Could you provide configs of different data sets, which will make it easier to reproduce the results.

HowTo100M pre-processd version

Sorry to bother again.
How can I pre-train my own model in HowTo100M dataset? Is there any chance that you release the processd dataset sometime?

S3D code for extracting the motion feature

Hi, thanks for sharing the code,
Could you please share the S3D code you use for extracting the motion feature? Thanks.
I only find a non-official S3D implementation at https://github.com/kylemin/S3D.
I would appreciate your reply.

About finetuning from a HowTo100M pretrained model on ActivityNet dataset

Thank you for your generous sharing of your wonderful research results! But I have some problems in reproducing the results.

I found that using the ‘HowTo100M_full_train.pth’ you provided can not achieve the desired results on ActivityNet dataset. What should I do if I want to finetune on ActivityNet dataset? Is there any chance that you release the ‘HowTo100M_full_train.pth’ for ActivityNet dataset sometime?

Thank you!

video features for youcook2 and didemo dataset

Thank you for sharing your great work ^_^

Could you kindly provide the video features for youcook2 and didemo dataset?

TypeError

Segmentation fault (core dumped) when Evaluating

Thanks for sharing this project!!
But after I run the following command, I met into a segmentation fault (core dumped) error:
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
tar -xvf MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/mmt/data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth
python -m train --config configs_pub/eccv20/prtrn_MSRVTT_jsfusion_trainval.json --only_eval --load_checkpoint data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth

I have exactly same environment equipment as you mentioned in the readme.
Could you help me with this error?

MSRVTT features_t.s3d and ablation studies

Dear Valentin,

first of all thank you for sharing your code, I think it is a pretty powerful gated model and you have good results, congrats on that!
I would like to ask you about features_t.s3d. In my examination of MSRVTT .h5 files I didn't find time stamps for s3d model (same for vggish). From your ablation studies I see that this expert gives an important contribution into the model performance. Could you please explain how the model handles the absence of s3d(vggish) time stamps and their need at all?

Cheers,
Vladyslav

Typo in query shuffling variable name

Hi, I noticed a small typo in https://github.com/gabeur/mmt/blob/master/utils/util.py#L397, the variable name "query_suffling" should be corrected to "query_shuffling".

Unable to reproduce the results

Do you have the Howto100M pretrained model? We are not able to reproduce the results since we don't have the pretrained model.

File name export routine for top-rated files.

Hello, I would like to ask if you have implemented routine in the code which returns top N video filenames, not just rankings?

About Dataloader

Thank you for your generous sharing of your code!
But I meet an issue during running your code directly.
I found that an error was thrown during loading data (AttributeError: 'Dataset' object has no attribute 'value', video_data[f"raw_captions.{i}"].value).
I tried to modify the attribute, but I did not find a proper one.

Thank you!

Can not download activity-net dataset

wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
report 404 not Found
How can I get this dataset, thanks

How to speed up the training process?

Hi. Thank you for generously sharing your work. When I trained the model on MSRVTT with 1 V100, I found the GPU-Util cannot reach 100%(about 60%). Do you have some tips? Thank you.

missing speech features for LSMDC dataset

Hi, thanks for sharing the code.

I noticed that the speech features are missing for all video clips in the LSMDC.tar.gz. But the paper mentioned that

I watched some original video clips from LSMDC dataset and found that they are all with audio where speech transcripts can be extracted using the Google Cloud Speech to Text API.

Therefore, my question is that did you train the MMT model with speech features on LSMDC dataset. If so, what's the result and would you please share the speech feature files? If not, why did't you utilize the speech features?

I would appreciate your reply.

Expert embeddings and Temporal embeddings

How to generate Expert Embeddings? Generate En(dmodel) according to F n(v) = [F1n... FKn]? ( This part of paper description is not very clear)
Moreover, Temporal Embeddings generated {T1,... How is TD} aligned with F n(v) = [F1n,... FKn]?
For example, D= 8s, and then K=30.

How to train MMT from scratch for other databases e.g. v3c1

First of all, thanks a lot for sharing such a great work. It is really interesting reading your paper and work.

Furthermore, If I want to analyze the performance of MMT on other databases like V3C1, how will I extract the expert embeddings for raw videos. Do I have to extract them on my own first and your code can only work for pre-computed features? or your code also extracts the experts features from videos?
As in the code, pre-computed features for databases e.g. MSRVTT, LSMDC are provided and it does not seem that there is any file for extracting embedded features from pretrained experts that you used.

Is this possible for you to share the "models (pre-trained experts) and code" you used for extracting expert embedding from videos.

About inference

Thank you for your generous sharing of this wonderful research results! But I have some little problems in reproducing the results.

After training the MMT model, how can I use it to predict the corresponding video clip given text query, which never appears in training set

In other words, is this model outputs the regression result of the start time and end time of the corresponding video clip, or it just perform based on the dataset it learned and when given a query it just return the most similar video clip in training set

[CLS] [SEP] bug

mmt/base/base_dataset.py

Line 343 in ef81f96

if not hasattr(self.tokenizer, "cls_token_ids"):

The class transformers.tokenization_bert.BertTokenizer does not have cls_token_ids. Here must be cls_token_id
Due to this mistake all text examples do not have SEP and CLS tokens.

Feature extraction code

Hi,
Could you please provide the feature extraction code.

Explanation of the input parameters of the forward function of the model

Thanks a lot for sharing such great work.
In the process of reading the code, there are some questions that cannot be understood. I would like to ask if you can give me some tips. Thank you very much.

Q: Explanation of the input parameters of the forward function on line 312 of file model/model.py. It’s really hard for me to understand the meaning of parameter features_t and parameter features_ind. Looking forward to your suggestions.

	class ReduceDim(nn.Module):

	def __init__(self, input_dimension, output_dimension):
	super(ReduceDim, self).__init__()
	self.fc = nn.Linear(input_dimension, output_dimension)

	def forward(self, x):
	x = self.fc(x)
	x = F.normalize(x)
	return x

	if self.vid_inp in ['both', 'temp', 'all']:
	for mod in self.modalities:
	layer = self.video_dim_reduce[mod]
	experts_feats[mod] = layer(experts_feats[mod])

	for mod in self.modalities:
	layer = self.video_dim_reduce[mod]
	mnp_experts[mod] = layer(pooled_experts[f'{mod}_avgpool'])
	maxp_experts[mod] = layer(pooled_experts[f'{mod}_maxpool'])

gabeur / mmt Goto Github PK

mmt's People

Stargazers

Watchers

Forkers

mmt's Issues

Command

Error

Recommend Projects

Recommend Topics

Recommend Org

Jobs