gabeur / mmt Goto Github PK
View Code? Open in Web Editor NEWMulti-Modal Transformer for Video Retrieval
Home Page: http://thoth.inrialpes.fr/research/MMT/
License: Apache License 2.0
Multi-Modal Transformer for Video Retrieval
Home Page: http://thoth.inrialpes.fr/research/MMT/
License: Apache License 2.0
Could you share the config file(.json) about using the MSRVTT dataset fully?
Hi, friends, when I run the code with the following command. At the same time, I changed "n_gpu: 1" to "n_gpu: 2" in the LSMDC_full_trainval.json. I get some error messages. I hope you can give me some suggestions.
CUDA_VISIBLE_DEVICES=0,1 python -m train --config configs_pub/eccv20/LSMDC_full_trainval.json
Thank you for your generous sharing.
I want to know the difference between ‘features.audio’ and 'features_t.audio' in the H5 file.
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz
It shows: ERROR 502: Server UnReachable.
How can I get these datasets?Thank you.
Would you please provide your feature extraction codes ?
When ReduceDim
Lines 715 to 725 in 0d848cd
is applied to video expert embeddings it do F.normalize on shape (batch_size, num_tokens, embedding_dim), and reduce by default along dim=1 (num_tokens)
Lines 431 to 434 in 0d848cd
But here it is applied to shape (batch_size, embedding_dim)
Lines 423 to 426 in 0d848cd
So normalization along embedding_dim axis.
Is it mistake or not?
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz
The above links are not available. ERROR 502: Server UnReachable.
Hi,
TypeError: Can't instantiate abstract class MixDataset with abstract methods configure_train_test_splits, load_features, sanity_checks.
The TypeError show up when I run python -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json
, please help with the solution.
Could you provide configs of different data sets, which will make it easier to reproduce the results.
Sorry to bother again.
How can I pre-train my own model in HowTo100M dataset? Is there any chance that you release the processd dataset sometime?
Hi, thanks for sharing the code,
Could you please share the S3D code you use for extracting the motion feature? Thanks.
I only find a non-official S3D implementation at https://github.com/kylemin/S3D.
I would appreciate your reply.
Thank you for your generous sharing of your wonderful research results! But I have some problems in reproducing the results.
I found that using the ‘HowTo100M_full_train.pth’ you provided can not achieve the desired results on ActivityNet dataset. What should I do if I want to finetune on ActivityNet dataset? Is there any chance that you release the ‘HowTo100M_full_train.pth’ for ActivityNet dataset sometime?
Thank you!
Thank you for sharing your great work ^_^
Could you kindly provide the video features for youcook2 and didemo dataset?
Hi,
TypeError: Can't instantiate abstract class MixDataset with abstract methods configure_train_test_splits, load_features, sanity_checks.
The TypeError show up when I run python -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json
, please help with the solution.
Thanks for sharing this project!!
But after I run the following command, I met into a segmentation fault (core dumped) error:
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
tar -xvf MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/mmt/data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth
python -m train --config configs_pub/eccv20/prtrn_MSRVTT_jsfusion_trainval.json --only_eval --load_checkpoint data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth
I have exactly same environment equipment as you mentioned in the readme.
Could you help me with this error?
Dear Valentin,
first of all thank you for sharing your code, I think it is a pretty powerful gated model and you have good results, congrats on that!
I would like to ask you about features_t.s3d. In my examination of MSRVTT .h5 files I didn't find time stamps for s3d model (same for vggish). From your ablation studies I see that this expert gives an important contribution into the model performance. Could you please explain how the model handles the absence of s3d(vggish) time stamps and their need at all?
Cheers,
Vladyslav
Hi, I noticed a small typo in https://github.com/gabeur/mmt/blob/master/utils/util.py#L397, the variable name "query_suffling" should be corrected to "query_shuffling".
Do you have the Howto100M pretrained model? We are not able to reproduce the results since we don't have the pretrained model.
Hello, I would like to ask if you have implemented routine in the code which returns top N video filenames, not just rankings?
Thank you for your generous sharing of your code!
But I meet an issue during running your code directly.
I found that an error was thrown during loading data (AttributeError: 'Dataset' object has no attribute 'value', video_data[f"raw_captions.{i}"].value).
I tried to modify the attribute, but I did not find a proper one.
Thank you!
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
report 404 not Found
How can I get this dataset, thanks
Hi. Thank you for generously sharing your work. When I trained the model on MSRVTT with 1 V100, I found the GPU-Util cannot reach 100%(about 60%). Do you have some tips? Thank you.
Hi, thanks for sharing the code.
I noticed that the speech features are missing for all video clips in the LSMDC.tar.gz. But the paper mentioned that
I watched some original video clips from LSMDC dataset and found that they are all with audio where speech transcripts can be extracted using the Google Cloud Speech to Text API.
Therefore, my question is that did you train the MMT model with speech features on LSMDC dataset. If so, what's the result and would you please share the speech feature files? If not, why did't you utilize the speech features?
I would appreciate your reply.
How to generate Expert Embeddings? Generate En(dmodel) according to F n(v) = [F1n... FKn]? ( This part of paper description is not very clear)
Moreover, Temporal Embeddings generated {T1,... How is TD} aligned with F n(v) = [F1n,... FKn]?
For example, D= 8s, and then K=30.
First of all, thanks a lot for sharing such a great work. It is really interesting reading your paper and work.
Furthermore, If I want to analyze the performance of MMT on other databases like V3C1, how will I extract the expert embeddings for raw videos. Do I have to extract them on my own first and your code can only work for pre-computed features? or your code also extracts the experts features from videos?
As in the code, pre-computed features for databases e.g. MSRVTT, LSMDC are provided and it does not seem that there is any file for extracting embedded features from pretrained experts that you used.
Is this possible for you to share the "models (pre-trained experts) and code" you used for extracting expert embedding from videos.
Thank you for your generous sharing of this wonderful research results! But I have some little problems in reproducing the results.
After training the MMT model, how can I use it to predict the corresponding video clip given text query, which never appears in training set
In other words, is this model outputs the regression result of the start time and end time of the corresponding video clip, or it just perform based on the dataset it learned and when given a query it just return the most similar video clip in training set
Line 343 in ef81f96
The class transformers.tokenization_bert.BertTokenizer does not have cls_token_ids
. Here must be cls_token_id
Due to this mistake all text examples do not have SEP and CLS tokens.
Hi,
Could you please provide the feature extraction code.
Thanks a lot for sharing such great work.
In the process of reading the code, there are some questions that cannot be understood. I would like to ask if you can give me some tips. Thank you very much.
Q: Explanation of the input parameters of the forward function on line 312 of file model/model.py. It’s really hard for me to understand the meaning of parameter features_t and parameter features_ind. Looking forward to your suggestions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.