jayleicn / moment_detr Goto Github PK
View Code? Open in Web Editor NEW[NeurIPS 2021] Moment-DETR code and QVHighlights dataset
Home Page: https://arxiv.org/abs/2107.09609
License: MIT License
[NeurIPS 2021] Moment-DETR code and QVHighlights dataset
Home Page: https://arxiv.org/abs/2107.09609
License: MIT License
Hi, thanks for your excellent work! I found that the provided video features include both clip_features and slow_fast features. When it comes to the run_on_video/run.py, the codes only extract the clip features. Is there a mistake here? Besides, could you please provide the run.py extracting both clip and slowfast features? Thank you.
Excuse me, CodaLab can only upload 5 times, how to evaluate the results of V+A? My username is Lonicerin
Hi, could you kindly offer the best weights that you reported in paper?
Hello, I sent a request to register for your competition in codalab. May you have a look at it? My username is Pupil_Ling. Thank you very much.
Hi,
Where could we find the original videos? Thanks!
Hi!
I have recently submitted a registration request for the QVHighlights Codalab competition using the username 'icefree'.
I would greatly appreciate it if you could review my application at your earliest convenience.
Hi, I sent a request to your competition in codalab several days ago. May you have a look at it? My username is Young_tz.
competition link
Hi there, thanks for sharing your great work!
Following issues #11 and #8, I'm trying to train the model on Charades-STA.
However, I got the error message 'raise ValueError("Sample larger than population or is negative")'.
I think that the model cannot sample the negative clips because some gt moments are covering the entire video.
moment_detr/moment_detr/start_end_dataset.py
Lines 104 to 117 in 1e67364
Thanks.
First of all, I thank you for your contribution to the community.
Can you explain why the MomentDETR only supports the video up to 150s? When I comment run_on_video/run.py, L:41
, it still works.
Hi,thanks for your great works! I have a question that how you fuse the image features from a 2-seconds clip into a clip video feature, since ViT is a feature extraction model for images not videos.
Hi, I recently generate the test results and validation results on CodaLab as the following structure.
--Submit.zip
----hl_val_submission.jsonl
----hl_test_submission.jsonl
The CodaLab gave me the error IOError: [Errno 2] No such file or directory: '/tmp/codalab/tmphfqu8Q/run/input/res/hl_test_submission.jsonl'
How can I solve this problem?
Hi, thanks for your great work, and I have submitted my resquest for many days, can you pass my approva? Sorry for any inconvenient! My username is "nameless"
Hello.
Thanks for the great work.
Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition. (user name:Jin_Yang) But the maximum number of submissions has been reached. Can you help me obtain more submissions?
Thank you!!!
Btw, Sorry for bothering you.
Regards.
Very appreciate your work, I got little question.
I have tried training the model by the given instructions
bash moment_detr/scripts/train.sh
, and I also tried running predictions on my own videos using checkpoint model given by you PYTHONPATH=$PYTHONPATH:. python run_on_video/run.py
, both of them worked.
But how can I predict on my own videos using model trained by myself ?(training dataset still uses yours)
I find that just change the model path in run_example to the model I trained is not worked.
so the questions are: what's the difference between the model I trained and the model you give? how can I run predictions on my own trained model and how to set command to train this model?
Hi, I sent a request to your competition in codalab. May you have a look at it? My username is old_tz.
competition link
Thanks for the impressive work.
I'm working on the video grounding task and want to measure the score on your test set.
Hope i can get the approval on your competition. (user name: jinhyunj)
Thank you!
Hello.
Thanks for the great work.
Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition. (user name:Jin_Yang)
Thank you!!!
Btw, Sorry for bothering you.
Regards.
Hi,
I am a little confuse about feature extraction
If I am correct there is two kind of features : CLIP OPEN AI and HERO_VIDEO_FEATURE_EXTRACTOR
I wanted to know the difference between those two and the purpose of CLIP ?
Also I have run HERO_VIDEO_FEATURE_EXTRACTOR and i am left with 4 files :
Thank you
Hello.
Thanks for the great work. Motivated by the work and the interesting topic, my team sincerely hope to get approved to be in the competition. (user name:IAIR) Thank you!!!
Btw, Sorry for bothering you.
Regards.
Hi @jayleicn, many thanks for sharing this great work! I was wondering whether your baseline models (e.g., MCN, XML, XML+) in Table 3 used the same feature extractors as Moment-DETR? Thanks!
Hi, I noticed that you also conduct experiments on CharadesSTA dataset. I'm wondering how you prepare the video feature in CharadesSTA dataset? Could you share the feature files you prepared?
Hello,
I recently submitted a registration request for the QVHighlights Codalab competition under the username 'ez615'.
Could you please review my application at your earliest convenience?
Thank you for your wonderful work!
However, when I tried to run your demo in folder run_on_video, the file bpe_simple_vocab_16e6.txt.gz for the tokenizer is missing.
Can you provide this file?
FileNotFoundError: [Errno 2] No such file or directory: 'moment_detr/run_on_video/clip/bpe_simple_vocab_16e6.txt.gz'
hi, authors, great works, now I want to train the model on the charades-sta dataset, and I find that you provide an 'opt.json' file about hyper-parameters in #11, in the configure file, you set the parameter 'clip_len' is 2, what does it mean?
What does the parameter "clip_len" mean? What determines it? Does it depend on how we extract video features?
Are the text features extracted by using CLIP extracted based on words, and will the connections between each word in the query and other words be included in the features? Thank you
Hi, I have a question about the "tef" in vision feature:
if self.use_tef:
tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
tef_ed = tef_st + 1.0 / ctx_l
tef = torch.stack([tef_st, tef_ed], dim=1) # (Lv, 2)
if self.use_video:
model_inputs["video_feat"] = torch.cat(
[model_inputs["video_feat"], tef], dim=1) # (Lv, Dv+2)
else:
model_inputs["video_feat"] = tef
What does "tef" mean in the visual feature? Thanks in advance.
What if neg_pool in get_saliency_labels_sub_as_query is empty because the groundtruth moment of the video sample in a video retrieval task is from the start of the video to the end of the video?
Thank you for your great work and open-source code.
I have an issue with the GT saliency scores (only localized 2-sec clips), can you please explain briefly?
besides, how Predicted saliency scores (for all 2-sec clip) corresponds to the previous term?
Thanks!
Best,
Kevin
Build models...
Loading feature extractors...
Loading CLIP models
Loading trained Moment-DETR model...
Run prediction...
------------------------------idx0
>> query: Chef makes pizza and cuts it up.
>> video_path: run_on_video/example/RoripwjYFp8_60.0_210.0.mp4
>> GT moments: [[106, 122]]
>> Predicted moments ([start_in_seconds, end_in_seconds, score]): [
[49.967, 64.9129, 0.9421],
[66.4396, 81.0731, 0.9271],
[105.9434, 122.0372, 0.9234],
[93.2057, 103.3713, 0.2222],
...,
[45.3834, 52.2183, 0.0005]
]
>> GT saliency scores (only localized 2-sec clips): # what it means?
[[2, 3, 3], [2, 3, 3], ...]
>> Predicted saliency scores (for all 2-sec clip): # how this correspond to the GT saliency scores?
[-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]
Hello,
I had sent a request to participate in the Codalab evaluation server (username: noga), on 29.04. Could you please approve the request?
Best,
Noga
Traceback (most recent call last):
File "moment_detr/train.py", line 255, in
best_ckpt_path, eval_split_name, eval_path, debug = start_training()
File "moment_detr/train.py", line 246, in start_training
model, criterion, optimizer, lr_scheduler = setup_model(opt)
File "/home/ciivam/IE643_PROJECT/moment_detr/moment_detr/inference.py", line 195, in setup_model
model, criterion = build_model(opt)
File "/home/ciivam/IE643_PROJECT/moment_detr/moment_detr/model.py", line 445, in build_model
criterion.to(device)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 688, in _apply
self._buffers[key] = fn(buf)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/cuda/init.py", line 229, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Getting this while doing bash moment_detr/scripts/train.sh.Cant we run it without gpu I know it would take a lot of time
Hi, I sent a request to your competition in codalab several days ago. May you have a look at it?
competition link
Hi @jayleicn, thanks for your great work! I notice that in the annotation files, as shown below, the duration of a video (126s) does not match the actual duration (810s - 660s = 150s). May I ask that should I crop the original video to 126s before processing in this case?
{
"qid": 8737,
"query": "A family is playing basketball together on a green court outside.",
"duration": 126,
"vid": "bP5KfdFJzC4_660.0_810.0",
"relevant_windows": [[0, 16]],
"relevant_clip_ids": [0, 1, 2, 3, 4, 5, 6, 7],
"saliency_scores": [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]
}
>> Predicted saliency scores (for all 2-sec clip):
[-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]
Hi, @jayleicn , how do you divide the dataset? Is it divided by sample id or video id?
Hello. Firstly, congratulations thank you for sharing this work, it's really cool!
I had a question regarding feature extraction. In the paper and the training script, train.sh
suggests that there's two sets of video features being used -- SlowFast and CLIP.
I confirmed that the shared moment_detr_features.tar.gz
file has both the SlowFast & CLIP features available as well.
However, in the inference script run.py
, only the ClipFeatureExtractor
is used. Do we not need SlowFast features during inference? Or am I missing something?
hi,
We think that mdetr has great potential, but we look at table 6 in the paper and find that the metics of moment retrieval on the charades-sta dataset is not much higher than that of ivg-dcl (in particular, ivg-dcl adopts C3d feature for video extractor and glove for text embedding), and your work uses clip feature + slowfast). Have you ever tested on other video grounding dataset, like activitynets?
The paper says that there are 3 categories of videos in the QVHighlights dataset. But, I am unable to find any category information in the annotation or elsewhere. Could you share the category information of the videos?
Is is possible to release the training code on Charades-STA (codes related to load the dataset) as stated in the paper? Thanks.
Good job. I have read the paper and the github repository, but I still don’t understand how the features such as clip_features, clip_sub_features, clip_text_features, slowfast_features, etc. under the features folder are extracted and the details of the features extracted? Can you describe it in detail if it is convenient?
Hi, Congrats on the amazing work. How to use CLIP to extract QVHIGHLIGHTS text features ? Can you provide the specific code?
Hi, I tried to run your code on my local machine. It has RTX 3060 6GB Ram. I have attached the screenshot for your reference. For some reason, the training of the model is getting stuck. Please guide me in what direction I should look for the solution.
https://drive.google.com/file/d/1nNinH4WnGygF6CvhT5Zj7ExKGIKdhMss/view?usp=sharing
Hello.
Thanks for the great work.
Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition.
Thank you!!!
Btw, Sorry for bothering you.
Regards.
Hi, thanks a lot for the decent work.
As I am working through your work, I just realised that the all videos are cropped starting at 60 seconds (ie, all the first segments of video are starting from 60 seconds).
Is there any reason why the video formats are preprocessed this way? Because I couldn't find any mention in the paper.
Does this mean the model is not trained with the first 60 seconds of each video?
Thanks in advance. And sorry if I have just missed this point in the paper/repository.
Hi, Congrats on the amazing work. I want to make a data set similar to QVHighlights in my research direction, I have a lot of questions?
1、What annotation tools do you use? And details in the annotation process.
2、How to use CLIP to extract QVHIGHLIGHTS text features ? Can you provide the specific code?
I would like to see where the model focuses attention on video features for text queries.
How can I visualize the cross-attention heatmap?
Hi, thanks for your good work and released code!
I have a question regarding the feature extractor:
which setting did you adopt for the QVHighlight slowfast feature? e.g., SLOWFAST_8x8_R50
.
Thanks!
Kevin
can you share the link about the videos to me ?
Thank you for your excellent work! I noticed that in the eval_moment_retrieval
function https://github.com/jayleicn/moment_detr/blob/main/standalone_eval/eval.py#L136, there are four predefined time ranges corresponding to 'short', 'middle', 'long', and 'full'. I'm wondering about how you chose these four intervals? If I want to evaluate my own dataset, would I need to modify the corresponding time ranges?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.