jayleicn / moment_detr Goto Github PK

View Code? Open in Web Editor NEW

243.0 243.0 42.0 35.18 MB

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset

Home Page: https://arxiv.org/abs/2107.09609

License: MIT License

Python 98.34% Shell 1.66%

pytorch video-retrieval

moment_detr's People

Contributors

Stargazers

Watchers

moment_detr's Issues

Video feature extraction

Hi, thanks for your excellent work! I found that the provided video features include both clip_features and slow_fast features. When it comes to the run_on_video/run.py, the codes only extract the clip features. Is there a mistake here? Besides, could you please provide the run.py extracting both clip and slowfast features? Thank you.

Eval

Excuse me, CodaLab can only upload 5 times, how to evaluate the results of V+A? My username is Lonicerin

About pretrained weight

Hi, could you kindly offer the best weights that you reported in paper?

Request for the competition in codalab

Hello, I sent a request to register for your competition in codalab. May you have a look at it? My username is Pupil_Ling. Thank you very much.

Original Videos?

Hi,

Where could we find the original videos? Thanks!

Request for QVHighlights Evaluation

Hi!
I have recently submitted a registration request for the QVHighlights Codalab competition using the username 'icefree'.
I would greatly appreciate it if you could review my application at your earliest convenience.

Request of the competition in codalab

Hi, I sent a request to your competition in codalab several days ago. May you have a look at it? My username is Young_tz.
competition link

Experiments on Charades-STA dataset

Hi there, thanks for sharing your great work!

Following issues #11 and #8, I'm trying to train the model on Charades-STA.

However, I got the error message 'raise ValueError("Sample larger than population or is negative")'.

I think that the model cannot sample the negative clips because some gt moments are covering the entire video.

moment_detr/moment_detr/start_end_dataset.py

Lines 104 to 117 in 1e67364

 def get_saliency_labels_sub_as_query(self, gt_window, ctx_l, max_n=2): 

 gt_st = int(gt_window[0] / self.clip_len) 

 gt_ed = max(0, min(int(gt_window[1] / self.clip_len), ctx_l) - 1) 

 if gt_st > gt_ed: 

 gt_st = gt_ed 

 if gt_st != gt_ed: 

 pos_clip_indices = random.sample(range(gt_st, gt_ed+1), k=max_n) 

 else: 

 pos_clip_indices = [gt_st, gt_st] 

 neg_pool = list(range(0, gt_st)) + list(range(gt_ed+1, ctx_l)) 

 neg_clip_indices = random.sample(neg_pool, k=max_n) 

 return pos_clip_indices, neg_clip_indices

Thanks.

Maximum video length

First of all, I thank you for your contribution to the community.
Can you explain why the MomentDETR only supports the video up to 150s? When I comment run_on_video/run.py, L:41, it still works.

Question about the video encoder ViT

Hi，thanks for your great works! I have a question that how you fuse the image features from a 2-seconds clip into a clip video feature, since ViT is a feature extraction model for images not videos.

CodaLab Submission Error

Hi, I recently generate the test results and validation results on CodaLab as the following structure.

--Submit.zip
----hl_val_submission.jsonl
----hl_test_submission.jsonl

The CodaLab gave me the error IOError: [Errno 2] No such file or directory: '/tmp/codalab/tmphfqu8Q/run/input/res/hl_test_submission.jsonl'

How can I solve this problem?

Resquest for approval of the competition

Hi, thanks for your great work, and I have submitted my resquest for many days, can you pass my approva? Sorry for any inconvenient! My username is "nameless"

About more submissions in codlab.

Hello.

Thanks for the great work.
Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition. (user name:Jin_Yang) But the maximum number of submissions has been reached. Can you help me obtain more submissions?

Thank you!!!
Btw, Sorry for bothering you.

Regards.

diff between given model and trained model

Very appreciate your work, I got little question.
I have tried training the model by the given instructions
bash moment_detr/scripts/train.sh , and I also tried running predictions on my own videos using checkpoint model given by you PYTHONPATH=$PYTHONPATH:. python run_on_video/run.py, both of them worked.
But how can I predict on my own videos using model trained by myself ?(training dataset still uses yours)
I find that just change the model path in run_example to the model I trained is not worked.
so the questions are: what's the difference between the model I trained and the model you give? how can I run predictions on my own trained model and how to set command to train this model?

Request for the competition in codalab

Hi, I sent a request to your competition in codalab. May you have a look at it? My username is old_tz.
competition link

Approval on the competition

Thanks for the impressive work.
I'm working on the video grounding task and want to measure the score on your test set.
Hope i can get the approval on your competition. (user name: jinhyunj)
Thank you!

[Request for the approval in competition] Can you approve the request? Thanks.

Hello.

Thanks for the great work.
Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition. (user name:Jin_Yang)

Thank you!!!
Btw, Sorry for bothering you.

Regards.

CLIP or HERO feature extraction

Hi,

I am a little confuse about feature extraction
If I am correct there is two kind of features : CLIP OPEN AI and HERO_VIDEO_FEATURE_EXTRACTOR
I wanted to know the difference between those two and the purpose of CLIP ?
Also I have run HERO_VIDEO_FEATURE_EXTRACTOR and i am left with 4 files :

clip-vit_feature
mil-nce_feature
resnet_feature
slowfast_feature
In this repo at the features file there is 4 files too:
clip_feature
clip_sub_feature
clip_text_feature
slowfast_feature
Can you tell me which file match between those two list of files ? (of course slowfast_feature is the first obvious match i presume)

Thank you

[Request for the approval in competition] Can you approve the request? Thanks.

Hello.

Thanks for the great work. Motivated by the work and the interesting topic, my team sincerely hope to get approved to be in the competition. (user name:IAIR) Thank you!!!

Btw, Sorry for bothering you.
Regards.

Did the baselines use the same (CLIP + SlowFast) feature extractors as Moment-DETR?

Hi @jayleicn, many thanks for sharing this great work! I was wondering whether your baseline models (e.g., MCN, XML, XML+) in Table 3 used the same feature extractors as Moment-DETR? Thanks!

About experiments on CharadesSTA dataset

Hi, I noticed that you also conduct experiments on CharadesSTA dataset. I'm wondering how you prepare the video feature in CharadesSTA dataset? Could you share the feature files you prepared?

Request for the QVHighlights competition in codalab

Hello,
I recently submitted a registration request for the QVHighlights Codalab competition under the username 'ez615'.
Could you please review my application at your earliest convenience?

About File missing in run_on_video

Thank you for your wonderful work!
However, when I tried to run your demo in folder run_on_video, the file bpe_simple_vocab_16e6.txt.gz for the tokenizer is missing.
Can you provide this file?

FileNotFoundError: [Errno 2] No such file or directory: 'moment_detr/run_on_video/clip/bpe_simple_vocab_16e6.txt.gz'

About eval loss.

I tried to replicate the experiment following the default settings, and the results I got was similar with those reported in the paper. However, the eval loss increased as the training goes. I am confused why this happens.

a question about training on the charades-sta dataset

hi, authors, great works, now I want to train the model on the charades-sta dataset, and I find that you provide an 'opt.json' file about hyper-parameters in #11, in the configure file, you set the parameter 'clip_len' is 2, what does it mean?

What does the parameter "clip_len" mean？

What does the parameter "clip_len" mean? What determines it? Does it depend on how we extract video features?

CLIP Features

Are the text features extracted by using CLIP extracted based on words, and will the connections between each word in the query and other words be included in the features? Thank you

The meaning of "tef"

Hi, I have a question about the "tef" in vision feature:

if self.use_tef:
    tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
    tef_ed = tef_st + 1.0 / ctx_l
    tef = torch.stack([tef_st, tef_ed], dim=1)  # (Lv, 2)
    if self.use_video:
        model_inputs["video_feat"] = torch.cat(
            [model_inputs["video_feat"], tef], dim=1)  # (Lv, Dv+2)
    else:
        model_inputs["video_feat"] = tef

What does "tef" mean in the visual feature? Thanks in advance.

A question about saliency loss

What if neg_pool in get_saliency_labels_sub_as_query is empty because the groundtruth moment of the video sample in a video retrieval task is from the start of the video to the end of the video?

Meaning of GT saliency scores

Thank you for your great work and open-source code.

I have an issue with the GT saliency scores (only localized 2-sec clips), can you please explain briefly?
besides, how Predicted saliency scores (for all 2-sec clip) corresponds to the previous term?

Thanks!

Best,
Kevin

Build models...
Loading feature extractors...
Loading CLIP models
Loading trained Moment-DETR model...
Run prediction...
------------------------------idx0
>> query: Chef makes pizza and cuts it up.
>> video_path: run_on_video/example/RoripwjYFp8_60.0_210.0.mp4
>> GT moments: [[106, 122]]
>> Predicted moments ([start_in_seconds, end_in_seconds, score]): [
    [49.967, 64.9129, 0.9421], 
    [66.4396, 81.0731, 0.9271], 
    [105.9434, 122.0372, 0.9234], 
    [93.2057, 103.3713, 0.2222], 
    ..., 
    [45.3834, 52.2183, 0.0005]
   ]
>> GT saliency scores (only localized 2-sec clips):  # what it means?
    [[2, 3, 3], [2, 3, 3], ...]
>> Predicted saliency scores (for all 2-sec clip):  # how this correspond to the GT saliency scores?
    [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]

Codalab participation request

Hello,

I had sent a request to participate in the Codalab evaluation server (username: noga), on 29.04. Could you please approve the request?

Best,
Noga

Can't Run without GPU

Traceback (most recent call last):
File "moment_detr/train.py", line 255, in
best_ckpt_path, eval_split_name, eval_path, debug = start_training()
File "moment_detr/train.py", line 246, in start_training
model, criterion, optimizer, lr_scheduler = setup_model(opt)
File "/home/ciivam/IE643_PROJECT/moment_detr/moment_detr/inference.py", line 195, in setup_model
model, criterion = build_model(opt)
File "/home/ciivam/IE643_PROJECT/moment_detr/moment_detr/model.py", line 445, in build_model
criterion.to(device)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 688, in _apply
self._buffers[key] = fn(buf)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/cuda/init.py", line 229, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Getting this while doing bash moment_detr/scripts/train.sh.Cant we run it without gpu I know it would take a lot of time

Request of the competition in codalab

Hi, I sent a request to your competition in codalab several days ago. May you have a look at it?
competition link

About the annotations

Hi @jayleicn, thanks for your great work! I notice that in the annotation files, as shown below, the duration of a video (126s) does not match the actual duration (810s - 660s = 150s). May I ask that should I crop the original video to 126s before processing in this case?

{
    "qid": 8737, 
    "query": "A family is playing basketball together on a green court outside.", 
    "duration": 126, 
    "vid": "bP5KfdFJzC4_660.0_810.0", 
    "relevant_windows": [[0, 16]],
    "relevant_clip_ids": [0, 1, 2, 3, 4, 5, 6, 7], 
    "saliency_scores": [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]
}

predicted saliency scores

How is the predicted saliency scores (for all 2-sec clip) calculated?

>> Predicted saliency scores (for all 2-sec clip): 
    [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]

Is it the average of the scores of three people? And why the predicted saliency scores (for all 2-sec clip) is negative.

how do you divide the dataset?

Hi, @jayleicn , how do you divide the dataset? Is it divided by sample id or video id?

Question Regarding Feature Extraction Discrepancy Between Training & Inference

Hello. Firstly, congratulations thank you for sharing this work, it's really cool!

I had a question regarding feature extraction. In the paper and the training script, train.sh suggests that there's two sets of video features being used -- SlowFast and CLIP.
I confirmed that the shared moment_detr_features.tar.gz file has both the SlowFast & CLIP features available as well.

However, in the inference script run.py, only the ClipFeatureExtractor is used. Do we not need SlowFast features during inference? Or am I missing something?

About paper

hi,
We think that mdetr has great potential, but we look at table 6 in the paper and find that the metics of moment retrieval on the charades-sta dataset is not much higher than that of ivg-dcl (in particular, ivg-dcl adopts C3d feature for video extractor and glove for text embedding), and your work uses clip feature + slowfast). Have you ever tested on other video grounding dataset, like activitynets?

category information

The paper says that there are 3 categories of videos in the QVHighlights dataset. But, I am unable to find any category information in the annotation or elsewhere. Could you share the category information of the videos?

Training on Charades-STA

Is is possible to release the training code on Charades-STA (codes related to load the dataset) as stated in the paper? Thanks.

About dataset?

Good job. I have read the paper and the github repository, but I still don’t understand how the features such as clip_features, clip_sub_features, clip_text_features, slowfast_features, etc. under the features folder are extracted and the details of the features extracted? Can you describe it in detail if it is convenient?

Text feature extraction

Hi, Congrats on the amazing work. How to use CLIP to extract QVHIGHLIGHTS text features ? Can you provide the specific code?

Training process getting stuck

Hi, I tried to run your code on my local machine. It has RTX 3060 6GB Ram. I have attached the screenshot for your reference. For some reason, the training of the model is getting stuck. Please guide me in what direction I should look for the solution.

https://drive.google.com/file/d/1nNinH4WnGygF6CvhT5Zj7ExKGIKdhMss/view?usp=sharing

[Request for the approval in competition] Hello. can you approve the request?

Hello.

Thanks for the great work.
Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition.

Thank you!!!
Btw, Sorry for bothering you.

Regards.

What happened to the first 60 seconds of the video?

Hi, thanks a lot for the decent work.

As I am working through your work, I just realised that the all videos are cropped starting at 60 seconds (ie, all the first segments of video are starting from 60 seconds).
Is there any reason why the video formats are preprocessed this way? Because I couldn't find any mention in the paper.

Does this mean the model is not trained with the first 60 seconds of each video?

Thanks in advance. And sorry if I have just missed this point in the paper/repository.

How do I make my dataset ？

Hi, Congrats on the amazing work. I want to make a data set similar to QVHighlights in my research direction, I have a lot of questions？
1、What annotation tools do you use? And details in the annotation process.
2、How to use CLIP to extract QVHIGHLIGHTS text features ? Can you provide the specific code？

how to visualize the cross-attention map

I would like to see where the model focuses attention on video features for text queries.
How can I visualize the cross-attention heatmap?

Slowfast config setting

Hi, thanks for your good work and released code!

I have a question regarding the feature extractor:
which setting did you adopt for the QVHighlight slowfast feature? e.g., SLOWFAST_8x8_R50.

Thanks!

Kevin

can you share the raw videos about qvhighlight?

can you share the link about the videos to me ?

Question about `eval_moment_retrieval` function

Thank you for your excellent work! I noticed that in the eval_moment_retrieval function https://github.com/jayleicn/moment_detr/blob/main/standalone_eval/eval.py#L136, there are four predefined time ranges corresponding to 'short', 'middle', 'long', and 'full'. I'm wondering about how you chose these four intervals? If I want to evaluate my own dataset, would I need to modify the corresponding time ranges?

	def get_saliency_labels_sub_as_query(self, gt_window, ctx_l, max_n=2):
	gt_st = int(gt_window[0] / self.clip_len)
	gt_ed = max(0, min(int(gt_window[1] / self.clip_len), ctx_l) - 1)
	if gt_st > gt_ed:
	gt_st = gt_ed

	if gt_st != gt_ed:
	pos_clip_indices = random.sample(range(gt_st, gt_ed+1), k=max_n)
	else:
	pos_clip_indices = [gt_st, gt_st]

	neg_pool = list(range(0, gt_st)) + list(range(gt_ed+1, ctx_l))
	neg_clip_indices = random.sample(neg_pool, k=max_n)
	return pos_clip_indices, neg_clip_indices

jayleicn / moment_detr Goto Github PK

moment_detr's People

Contributors

Stargazers

Watchers

Forkers

moment_detr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs