gabeur / mmt Goto Github PK

View Code? Open in Web Editor NEW

253.0 253.0 41.0 816 KB

Multi-Modal Transformer for Video Retrieval

Home Page: http://thoth.inrialpes.fr/research/MMT/

License: Apache License 2.0

Python 100.00%

fusion language multimodal nlp video vision

mmt's Issues

TypeError

Hi,
TypeError: Can't instantiate abstract class MixDataset with abstract methods configure_train_test_splits, load_features, sanity_checks.
The TypeError show up when I run python -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json, please help with the solution.

Typo in query shuffling variable name

Hi, I noticed a small typo in https://github.com/gabeur/mmt/blob/master/utils/util.py#L397, the variable name "query_suffling" should be corrected to "query_shuffling".

Can not download activity-net dataset

wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
report 404 not Found
How can I get this dataset, thanks

Wrong dimension for L2 normalization?

When ReduceDim

mmt/model/model.py

Lines 715 to 725 in 0d848cd

 class ReduceDim(nn.Module): 

 def __init__(self, input_dimension, output_dimension): 

 super(ReduceDim, self).__init__() 

 self.fc = nn.Linear(input_dimension, output_dimension) 

 def forward(self, x): 

 x = self.fc(x) 

 x = F.normalize(x) 

 return x

is applied to video expert embeddings it do F.normalize on shape (batch_size, num_tokens, embedding_dim), and reduce by default along dim=1 (num_tokens)

mmt/model/model.py

Lines 431 to 434 in 0d848cd

 if self.vid_inp in ['both', 'temp', 'all']: 

 for mod in self.modalities: 

 layer = self.video_dim_reduce[mod] 

 experts_feats[mod] = layer(experts_feats[mod])

But here it is applied to shape (batch_size, embedding_dim)

mmt/model/model.py

Lines 423 to 426 in 0d848cd

 for mod in self.modalities: 

 layer = self.video_dim_reduce[mod] 

 mnp_experts[mod] = layer(pooled_experts[f'{mod}_avgpool']) 

 maxp_experts[mod] = layer(pooled_experts[f'{mod}_maxpool'])

So normalization along embedding_dim axis.

Is it mistake or not?

[CLS] [SEP] bug

mmt/base/base_dataset.py

Line 343 in ef81f96

if not hasattr(self.tokenizer, "cls_token_ids"):

The class transformers.tokenization_bert.BertTokenizer does not have cls_token_ids. Here must be cls_token_id
Due to this mistake all text examples do not have SEP and CLS tokens.

H5 files with video features

Thank you for your generous sharing.
I want to know the difference between ‘features.audio’ and 'features_t.audio' in the H5 file.

Cannot download the extracted features

wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz

The above links are not available. ERROR 502: Server UnReachable.

Feature Extraction

Would you please provide your feature extraction codes ?

About different dataset partition

Could you provide configs of different data sets, which will make it easier to reproduce the results.

How to run code with multiple GPUs

Hi, friends, when I run the code with the following command. At the same time, I changed "n_gpu: 1" to "n_gpu: 2" in the LSMDC_full_trainval.json. I get some error messages. I hope you can give me some suggestions.

Command

CUDA_VISIBLE_DEVICES=0,1 python -m train --config configs_pub/eccv20/LSMDC_full_trainval.json

Error

About MSRVTT_full

Could you share the config file(.json) about using the MSRVTT dataset fully?

Feature extraction code

Hi,
Could you please provide the feature extraction code.

Expert embeddings and Temporal embeddings

How to generate Expert Embeddings? Generate En(dmodel) according to F n(v) = [F1n... FKn]? ( This part of paper description is not very clear)
Moreover, Temporal Embeddings generated {T1,... How is TD} aligned with F n(v) = [F1n,... FKn]?
For example, D= 8s, and then K=30.

How to train MMT from scratch for other databases e.g. v3c1

First of all, thanks a lot for sharing such a great work. It is really interesting reading your paper and work.

Furthermore, If I want to analyze the performance of MMT on other databases like V3C1, how will I extract the expert embeddings for raw videos. Do I have to extract them on my own first and your code can only work for pre-computed features? or your code also extracts the experts features from videos?
As in the code, pre-computed features for databases e.g. MSRVTT, LSMDC are provided and it does not seem that there is any file for extracting embedded features from pretrained experts that you used.

Is this possible for you to share the "models (pre-trained experts) and code" you used for extracting expert embedding from videos.

video features for youcook2 and didemo dataset

Thank you for sharing your great work ^_^

Could you kindly provide the video features for youcook2 and didemo dataset?

About Dataloader

Thank you for your generous sharing of your code!
But I meet an issue during running your code directly.
I found that an error was thrown during loading data (AttributeError: 'Dataset' object has no attribute 'value', video_data[f"raw_captions.{i}"].value).
I tried to modify the attribute, but I did not find a proper one.

Thank you!

HowTo100M pre-processd version

Sorry to bother again.
How can I pre-train my own model in HowTo100M dataset? Is there any chance that you release the processd dataset sometime?

Cannot download the video features

It shows: ERROR 502: Server UnReachable.

How can I get these datasets？Thank you.

About inference

Thank you for your generous sharing of this wonderful research results! But I have some little problems in reproducing the results.

After training the MMT model, how can I use it to predict the corresponding video clip given text query, which never appears in training set

In other words, is this model outputs the regression result of the start time and end time of the corresponding video clip, or it just perform based on the dataset it learned and when given a query it just return the most similar video clip in training set

Segmentation fault (core dumped) when Evaluating

Thanks for sharing this project!!
But after I run the following command, I met into a segmentation fault (core dumped) error:
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
tar -xvf MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/mmt/data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth
python -m train --config configs_pub/eccv20/prtrn_MSRVTT_jsfusion_trainval.json --only_eval --load_checkpoint data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth

I have exactly same environment equipment as you mentioned in the readme.
Could you help me with this error?

How to speed up the training process?

Hi. Thank you for generously sharing your work. When I trained the model on MSRVTT with 1 V100, I found the GPU-Util cannot reach 100%(about 60%). Do you have some tips? Thank you.

missing speech features for LSMDC dataset

Hi, thanks for sharing the code.

I noticed that the speech features are missing for all video clips in the LSMDC.tar.gz. But the paper mentioned that

I watched some original video clips from LSMDC dataset and found that they are all with audio where speech transcripts can be extracted using the Google Cloud Speech to Text API.

Therefore, my question is that did you train the MMT model with speech features on LSMDC dataset. If so, what's the result and would you please share the speech feature files? If not, why did't you utilize the speech features?

I would appreciate your reply.

File name export routine for top-rated files.

Hello, I would like to ask if you have implemented routine in the code which returns top N video filenames, not just rankings?

Unable to reproduce the results

Do you have the Howto100M pretrained model? We are not able to reproduce the results since we don't have the pretrained model.

MSRVTT features_t.s3d and ablation studies

Dear Valentin,

first of all thank you for sharing your code, I think it is a pretty powerful gated model and you have good results, congrats on that!
I would like to ask you about features_t.s3d. In my examination of MSRVTT .h5 files I didn't find time stamps for s3d model (same for vggish). From your ablation studies I see that this expert gives an important contribution into the model performance. Could you please explain how the model handles the absence of s3d(vggish) time stamps and their need at all?

Cheers,
Vladyslav

About finetuning from a HowTo100M pretrained model on ActivityNet dataset

Thank you for your generous sharing of your wonderful research results! But I have some problems in reproducing the results.

I found that using the ‘HowTo100M_full_train.pth’ you provided can not achieve the desired results on ActivityNet dataset. What should I do if I want to finetune on ActivityNet dataset? Is there any chance that you release the ‘HowTo100M_full_train.pth’ for ActivityNet dataset sometime?

Thank you!

	class ReduceDim(nn.Module):

	def __init__(self, input_dimension, output_dimension):
	super(ReduceDim, self).__init__()
	self.fc = nn.Linear(input_dimension, output_dimension)

	def forward(self, x):
	x = self.fc(x)
	x = F.normalize(x)
	return x

	if self.vid_inp in ['both', 'temp', 'all']:
	for mod in self.modalities:
	layer = self.video_dim_reduce[mod]
	experts_feats[mod] = layer(experts_feats[mod])

	for mod in self.modalities:
	layer = self.video_dim_reduce[mod]
	mnp_experts[mod] = layer(pooled_experts[f'{mod}_avgpool'])
	maxp_experts[mod] = layer(pooled_experts[f'{mod}_maxpool'])

gabeur / mmt Goto Github PK

mmt's Issues

Command

Error

Recommend Projects

Recommend Topics

Recommend Org

Jobs