GithubHelp home page GithubHelp logo

jayleicn / moment_detr Goto Github PK

View Code? Open in Web Editor NEW
243.0 243.0 42.0 35.18 MB

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset

Home Page: https://arxiv.org/abs/2107.09609

License: MIT License

Python 98.34% Shell 1.66%
pytorch video-retrieval

moment_detr's People

Contributors

jayleicn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

moment_detr's Issues

Video feature extraction

Hi, thanks for your excellent work! I found that the provided video features include both clip_features and slow_fast features. When it comes to the run_on_video/run.py, the codes only extract the clip features. Is there a mistake here? Besides, could you please provide the run.py extracting both clip and slowfast features? Thank you.

Eval

Excuse me, CodaLab can only upload 5 times, how to evaluate the results of V+A? My username is Lonicerin

Request for QVHighlights Evaluation

Hi!
I have recently submitted a registration request for the QVHighlights Codalab competition using the username 'icefree'.
I would greatly appreciate it if you could review my application at your earliest convenience.

Experiments on Charades-STA dataset

Hi there, thanks for sharing your great work!

Following issues #11 and #8, I'm trying to train the model on Charades-STA.

However, I got the error message 'raise ValueError("Sample larger than population or is negative")'.

I think that the model cannot sample the negative clips because some gt moments are covering the entire video.

def get_saliency_labels_sub_as_query(self, gt_window, ctx_l, max_n=2):
gt_st = int(gt_window[0] / self.clip_len)
gt_ed = max(0, min(int(gt_window[1] / self.clip_len), ctx_l) - 1)
if gt_st > gt_ed:
gt_st = gt_ed
if gt_st != gt_ed:
pos_clip_indices = random.sample(range(gt_st, gt_ed+1), k=max_n)
else:
pos_clip_indices = [gt_st, gt_st]
neg_pool = list(range(0, gt_st)) + list(range(gt_ed+1, ctx_l))
neg_clip_indices = random.sample(neg_pool, k=max_n)
return pos_clip_indices, neg_clip_indices

Thanks.

Maximum video length

First of all, I thank you for your contribution to the community.
Can you explain why the MomentDETR only supports the video up to 150s? When I comment run_on_video/run.py, L:41, it still works.

Question about the video encoder ViT

Hi,thanks for your great works! I have a question that how you fuse the image features from a 2-seconds clip into a clip video feature, since ViT is a feature extraction model for images not videos.

CodaLab Submission Error

Hi, I recently generate the test results and validation results on CodaLab as the following structure.

--Submit.zip
----hl_val_submission.jsonl
----hl_test_submission.jsonl

The CodaLab gave me the error IOError: [Errno 2] No such file or directory: '/tmp/codalab/tmphfqu8Q/run/input/res/hl_test_submission.jsonl'

How can I solve this problem?

Resquest for approval of the competition

Hi, thanks for your great work, and I have submitted my resquest for many days, can you pass my approva? Sorry for any inconvenient! My username is "nameless"

About more submissions in codlab.

Hello.

Thanks for the great work.
Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition. (user name:Jin_Yang) But the maximum number of submissions has been reached. Can you help me obtain more submissions?

Thank you!!!
Btw, Sorry for bothering you.

Regards.

diff between given model and trained model

Very appreciate your work, I got little question.
I have tried training the model by the given instructions
bash moment_detr/scripts/train.sh , and I also tried running predictions on my own videos using checkpoint model given by you PYTHONPATH=$PYTHONPATH:. python run_on_video/run.py, both of them worked.
But how can I predict on my own videos using model trained by myself ?(training dataset still uses yours)
I find that just change the model path in run_example to the model I trained is not worked.
so the questions are: what's the difference between the model I trained and the model you give? how can I run predictions on my own trained model and how to set command to train this model?

Approval on the competition

Thanks for the impressive work.
I'm working on the video grounding task and want to measure the score on your test set.
Hope i can get the approval on your competition. (user name: jinhyunj)
Thank you!

CLIP or HERO feature extraction

Hi,

I am a little confuse about feature extraction
If I am correct there is two kind of features : CLIP OPEN AI and HERO_VIDEO_FEATURE_EXTRACTOR
I wanted to know the difference between those two and the purpose of CLIP ?
Also I have run HERO_VIDEO_FEATURE_EXTRACTOR and i am left with 4 files :

  • clip-vit_feature
  • mil-nce_feature
  • resnet_feature
  • slowfast_feature
    In this repo at the features file there is 4 files too:
  • clip_feature
  • clip_sub_feature
  • clip_text_feature
  • slowfast_feature
    Can you tell me which file match between those two list of files ? (of course slowfast_feature is the first obvious match i presume)

Thank you

About experiments on CharadesSTA dataset

Hi, I noticed that you also conduct experiments on CharadesSTA dataset. I'm wondering how you prepare the video feature in CharadesSTA dataset? Could you share the feature files you prepared?

About File missing in run_on_video

Thank you for your wonderful work!
However, when I tried to run your demo in folder run_on_video, the file bpe_simple_vocab_16e6.txt.gz for the tokenizer is missing.
Can you provide this file?

FileNotFoundError: [Errno 2] No such file or directory: 'moment_detr/run_on_video/clip/bpe_simple_vocab_16e6.txt.gz'

About eval loss.

I tried to replicate the experiment following the default settings, and the results I got was similar with those reported in the paper. However, the eval loss increased as the training goes. I am confused why this happens.
1683191534661

a question about training on the charades-sta dataset

hi, authors, great works, now I want to train the model on the charades-sta dataset, and I find that you provide an 'opt.json' file about hyper-parameters in #11, in the configure file, you set the parameter 'clip_len' is 2, what does it mean?

CLIP Features

Are the text features extracted by using CLIP extracted based on words, and will the connections between each word in the query and other words be included in the features? Thank you

The meaning of "tef"

Hi, I have a question about the "tef" in vision feature:

if self.use_tef:
    tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
    tef_ed = tef_st + 1.0 / ctx_l
    tef = torch.stack([tef_st, tef_ed], dim=1)  # (Lv, 2)
    if self.use_video:
        model_inputs["video_feat"] = torch.cat(
            [model_inputs["video_feat"], tef], dim=1)  # (Lv, Dv+2)
    else:
        model_inputs["video_feat"] = tef

What does "tef" mean in the visual feature? Thanks in advance.

A question about saliency loss

What if neg_pool in get_saliency_labels_sub_as_query is empty because the groundtruth moment of the video sample in a video retrieval task is from the start of the video to the end of the video?

Meaning of GT saliency scores

Thank you for your great work and open-source code.

I have an issue with the GT saliency scores (only localized 2-sec clips), can you please explain briefly?
besides, how Predicted saliency scores (for all 2-sec clip) corresponds to the previous term?

Thanks!

Best,
Kevin

Build models...
Loading feature extractors...
Loading CLIP models
Loading trained Moment-DETR model...
Run prediction...
------------------------------idx0
>> query: Chef makes pizza and cuts it up.
>> video_path: run_on_video/example/RoripwjYFp8_60.0_210.0.mp4
>> GT moments: [[106, 122]]
>> Predicted moments ([start_in_seconds, end_in_seconds, score]): [
    [49.967, 64.9129, 0.9421], 
    [66.4396, 81.0731, 0.9271], 
    [105.9434, 122.0372, 0.9234], 
    [93.2057, 103.3713, 0.2222], 
    ..., 
    [45.3834, 52.2183, 0.0005]
   ]
>> GT saliency scores (only localized 2-sec clips):  # what it means?
    [[2, 3, 3], [2, 3, 3], ...]
>> Predicted saliency scores (for all 2-sec clip):  # how this correspond to the GT saliency scores?
    [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]  

Codalab participation request

Hello,

I had sent a request to participate in the Codalab evaluation server (username: noga), on 29.04. Could you please approve the request?

Best,
Noga

Can't Run without GPU

Traceback (most recent call last):
File "moment_detr/train.py", line 255, in
best_ckpt_path, eval_split_name, eval_path, debug = start_training()
File "moment_detr/train.py", line 246, in start_training
model, criterion, optimizer, lr_scheduler = setup_model(opt)
File "/home/ciivam/IE643_PROJECT/moment_detr/moment_detr/inference.py", line 195, in setup_model
model, criterion = build_model(opt)
File "/home/ciivam/IE643_PROJECT/moment_detr/moment_detr/model.py", line 445, in build_model
criterion.to(device)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 688, in _apply
self._buffers[key] = fn(buf)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/ciivam/anaconda3/envs/moment_detr/lib/python3.7/site-packages/torch/cuda/init.py", line 229, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Getting this while doing bash moment_detr/scripts/train.sh.Cant we run it without gpu I know it would take a lot of time

About the annotations

Hi @jayleicn, thanks for your great work! I notice that in the annotation files, as shown below, the duration of a video (126s) does not match the actual duration (810s - 660s = 150s). May I ask that should I crop the original video to 126s before processing in this case?

{
    "qid": 8737, 
    "query": "A family is playing basketball together on a green court outside.", 
    "duration": 126, 
    "vid": "bP5KfdFJzC4_660.0_810.0", 
    "relevant_windows": [[0, 16]],
    "relevant_clip_ids": [0, 1, 2, 3, 4, 5, 6, 7], 
    "saliency_scores": [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]
}

predicted saliency scores

  1. How is the predicted saliency scores (for all 2-sec clip) calculated?
>> Predicted saliency scores (for all 2-sec clip): 
    [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]  
  1. Is it the average of the scores of three people? And why the predicted saliency scores (for all 2-sec clip) is negative.

Question Regarding Feature Extraction Discrepancy Between Training & Inference

Hello. Firstly, congratulations thank you for sharing this work, it's really cool!

I had a question regarding feature extraction. In the paper and the training script, train.sh suggests that there's two sets of video features being used -- SlowFast and CLIP.
I confirmed that the shared moment_detr_features.tar.gz file has both the SlowFast & CLIP features available as well.

However, in the inference script run.py, only the ClipFeatureExtractor is used. Do we not need SlowFast features during inference? Or am I missing something?

About paper

hi,
We think that mdetr has great potential, but we look at table 6 in the paper and find that the metics of moment retrieval on the charades-sta dataset is not much higher than that of ivg-dcl (in particular, ivg-dcl adopts C3d feature for video extractor and glove for text embedding), and your work uses clip feature + slowfast). Have you ever tested on other video grounding dataset, like activitynets?

category information

The paper says that there are 3 categories of videos in the QVHighlights dataset. But, I am unable to find any category information in the annotation or elsewhere. Could you share the category information of the videos?

Training on Charades-STA

Is is possible to release the training code on Charades-STA (codes related to load the dataset) as stated in the paper? Thanks.

About dataset?

Good job. I have read the paper and the github repository, but I still don’t understand how the features such as clip_features, clip_sub_features, clip_text_features, slowfast_features, etc. under the features folder are extracted and the details of the features extracted? Can you describe it in detail if it is convenient?

Text feature extraction

Hi, Congrats on the amazing work. How to use CLIP to extract QVHIGHLIGHTS text features ? Can you provide the specific code?

What happened to the first 60 seconds of the video?

Hi, thanks a lot for the decent work.

As I am working through your work, I just realised that the all videos are cropped starting at 60 seconds (ie, all the first segments of video are starting from 60 seconds).
Is there any reason why the video formats are preprocessed this way? Because I couldn't find any mention in the paper.

Does this mean the model is not trained with the first 60 seconds of each video?

Thanks in advance. And sorry if I have just missed this point in the paper/repository.

How do I make my dataset ?

Hi, Congrats on the amazing work. I want to make a data set similar to QVHighlights in my research direction, I have a lot of questions?
1、What annotation tools do you use? And details in the annotation process.
2、How to use CLIP to extract QVHIGHLIGHTS text features ? Can you provide the specific code?

Slowfast config setting

Hi, thanks for your good work and released code!

I have a question regarding the feature extractor:
which setting did you adopt for the QVHighlight slowfast feature? e.g., SLOWFAST_8x8_R50.

Thanks!

Kevin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.