GithubHelp home page GithubHelp logo

gabeur / mmt Goto Github PK

View Code? Open in Web Editor NEW
253.0 253.0 41.0 816 KB

Multi-Modal Transformer for Video Retrieval

Home Page: http://thoth.inrialpes.fr/research/MMT/

License: Apache License 2.0

Python 100.00%
fusion language multimodal nlp video vision

mmt's Issues

TypeError

Hi,
TypeError: Can't instantiate abstract class MixDataset with abstract methods configure_train_test_splits, load_features, sanity_checks.
The TypeError show up when I run python -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json, please help with the solution.

Wrong dimension for L2 normalization?

When ReduceDim

mmt/model/model.py

Lines 715 to 725 in 0d848cd

class ReduceDim(nn.Module):
def __init__(self, input_dimension, output_dimension):
super(ReduceDim, self).__init__()
self.fc = nn.Linear(input_dimension, output_dimension)
def forward(self, x):
x = self.fc(x)
x = F.normalize(x)
return x

is applied to video expert embeddings it do F.normalize on shape (batch_size, num_tokens, embedding_dim), and reduce by default along dim=1 (num_tokens)

mmt/model/model.py

Lines 431 to 434 in 0d848cd

if self.vid_inp in ['both', 'temp', 'all']:
for mod in self.modalities:
layer = self.video_dim_reduce[mod]
experts_feats[mod] = layer(experts_feats[mod])

But here it is applied to shape (batch_size, embedding_dim)

mmt/model/model.py

Lines 423 to 426 in 0d848cd

for mod in self.modalities:
layer = self.video_dim_reduce[mod]
mnp_experts[mod] = layer(pooled_experts[f'{mod}_avgpool'])
maxp_experts[mod] = layer(pooled_experts[f'{mod}_maxpool'])

So normalization along embedding_dim axis.

Is it mistake or not?

[CLS] [SEP] bug

if not hasattr(self.tokenizer, "cls_token_ids"):

The class transformers.tokenization_bert.BertTokenizer does not have cls_token_ids. Here must be cls_token_id
Due to this mistake all text examples do not have SEP and CLS tokens.

H5 files with video features

Thank you for your generous sharing.
I want to know the difference between ‘features.audio’ and 'features_t.audio' in the H5 file.

How to run code with multiple GPUs

Hi, friends, when I run the code with the following command. At the same time, I changed "n_gpu: 1" to "n_gpu: 2" in the LSMDC_full_trainval.json. I get some error messages. I hope you can give me some suggestions.

Command

CUDA_VISIBLE_DEVICES=0,1 python -m train --config configs_pub/eccv20/LSMDC_full_trainval.json

Error

image

About MSRVTT_full

Could you share the config file(.json) about using the MSRVTT dataset fully?

Expert embeddings and Temporal embeddings

How to generate Expert Embeddings? Generate En(dmodel) according to F n(v) = [F1n... FKn]? ( This part of paper description is not very clear)
Moreover, Temporal Embeddings generated {T1,... How is TD} aligned with F n(v) = [F1n,... FKn]?
For example, D= 8s, and then K=30.

How to train MMT from scratch for other databases e.g. v3c1

First of all, thanks a lot for sharing such a great work. It is really interesting reading your paper and work.

Furthermore, If I want to analyze the performance of MMT on other databases like V3C1, how will I extract the expert embeddings for raw videos. Do I have to extract them on my own first and your code can only work for pre-computed features? or your code also extracts the experts features from videos?
As in the code, pre-computed features for databases e.g. MSRVTT, LSMDC are provided and it does not seem that there is any file for extracting embedded features from pretrained experts that you used.

Is this possible for you to share the "models (pre-trained experts) and code" you used for extracting expert embedding from videos.

About Dataloader

Thank you for your generous sharing of your code!
But I meet an issue during running your code directly.
I found that an error was thrown during loading data (AttributeError: 'Dataset' object has no attribute 'value', video_data[f"raw_captions.{i}"].value).
I tried to modify the attribute, but I did not find a proper one.

Thank you!

HowTo100M pre-processd version

Sorry to bother again.
How can I pre-train my own model in HowTo100M dataset? Is there any chance that you release the processd dataset sometime?

About inference

Thank you for your generous sharing of this wonderful research results! But I have some little problems in reproducing the results.

After training the MMT model, how can I use it to predict the corresponding video clip given text query, which never appears in training set

In other words, is this model outputs the regression result of the start time and end time of the corresponding video clip, or it just perform based on the dataset it learned and when given a query it just return the most similar video clip in training set

Segmentation fault (core dumped) when Evaluating

Thanks for sharing this project!!
But after I run the following command, I met into a segmentation fault (core dumped) error:
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
tar -xvf MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/mmt/data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth
python -m train --config configs_pub/eccv20/prtrn_MSRVTT_jsfusion_trainval.json --only_eval --load_checkpoint data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth
image

I have exactly same environment equipment as you mentioned in the readme.
Could you help me with this error?

How to speed up the training process?

Hi. Thank you for generously sharing your work. When I trained the model on MSRVTT with 1 V100, I found the GPU-Util cannot reach 100%(about 60%). Do you have some tips? Thank you.

missing speech features for LSMDC dataset

Hi, thanks for sharing the code.

I noticed that the speech features are missing for all video clips in the LSMDC.tar.gz. But the paper mentioned that
image

I watched some original video clips from LSMDC dataset and found that they are all with audio where speech transcripts can be extracted using the Google Cloud Speech to Text API.

Therefore, my question is that did you train the MMT model with speech features on LSMDC dataset. If so, what's the result and would you please share the speech feature files? If not, why did't you utilize the speech features?

I would appreciate your reply.

Unable to reproduce the results

Do you have the Howto100M pretrained model? We are not able to reproduce the results since we don't have the pretrained model.

MSRVTT features_t.s3d and ablation studies

Dear Valentin,

first of all thank you for sharing your code, I think it is a pretty powerful gated model and you have good results, congrats on that!
I would like to ask you about features_t.s3d. In my examination of MSRVTT .h5 files I didn't find time stamps for s3d model (same for vggish). From your ablation studies I see that this expert gives an important contribution into the model performance. Could you please explain how the model handles the absence of s3d(vggish) time stamps and their need at all?

Cheers,
Vladyslav

About finetuning from a HowTo100M pretrained model on ActivityNet dataset

Thank you for your generous sharing of your wonderful research results! But I have some problems in reproducing the results.

I found that using the ‘HowTo100M_full_train.pth’ you provided can not achieve the desired results on ActivityNet dataset. What should I do if I want to finetune on ActivityNet dataset? Is there any chance that you release the ‘HowTo100M_full_train.pth’ for ActivityNet dataset sometime?

Thank you!

TypeError

Hi,
TypeError: Can't instantiate abstract class MixDataset with abstract methods configure_train_test_splits, load_features, sanity_checks.
The TypeError show up when I run python -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json, please help with the solution.

Explanation of the input parameters of the forward function of the model

Thanks a lot for sharing such great work.
In the process of reading the code, there are some questions that cannot be understood. I would like to ask if you can give me some tips. Thank you very much.

Q: Explanation of the input parameters of the forward function on line 312 of file model/model.py. It’s really hard for me to understand the meaning of parameter features_t and parameter features_ind. Looking forward to your suggestions.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.