rowanz / merlot_reserve Goto Github PK

View Code? Open in Web Editor NEW

135.0 135.0 32.0 4.67 MB

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

License: MIT License

Python 98.28% Shell 0.51% Dockerfile 1.20%

merlot_reserve's People

Contributors

Stargazers

Watchers

merlot_reserve's Issues

Video Segment Count in Demo

Hi Rowan,

In the code, you state the code only supports at most 8 segments. I would like to learn how you handle these segment counts in your demo at https://merlot.apps.allenai.org/

Thank you for your time and attention,

Mustafa

saving intermediate tensors in jitted function

Hi Rowan,

I intend to save intermediate tensors (e.g. the embedding of Layer 11 of the joint transformer) when fine-tuning the tvqa dataset, so I can understand how the internal representations change during the time. However, I cannot save the concrete values of the layers' representations because they are encoded as Traced in jitted function (I got an error like The NumPy.ndarray conversion method array() was called on the JAX Tracer object).

I was wondering if you have already found a good solution to save them when you designed your codes. Thank you!

Best,
Dota

Zero-shot classification without audio

Thank you for the great repository.

How can we however run the model in zero-shot setup without audio?
Concretly, function model.embed_video in demo_video.py requires argument audio_clips.

What can we do in order to not use the audio?
Thank you!

Best,
Tomas

hi,question about frozen embedding

Hi，

Thank you for your excellent work! I really want to ask you a question: now I want to only use your model to encode the video frame and the corresponding dialogue segment of the video, and then design the model by myself. Do I just copy the SPAN_encr and vision_enc model codes and download the checkpoint? :)

Best,
Jun

Questions about input of TVQA?

Hi, I have noticed sample of TVQA inputs "1 to 28 What is Janice Holding on to after Chandler sends Joey to his room? Chandler's tie. MASK[subtitles or audio]".

Does this mean that the input would be "TIME STAMP" + "QUESTION" + "ANSWER" + "MASK TOKEN" + "SUB or AUDIO"?
Besides, I cannot fully understand why we need a mask token here?

Thanks in advance.

Code for Action Recognition

Hi, is there any plan to release the code for K600 classification?

Can't access Google Drive files.

I'm trying to get VCR data in google drive address: gs://merlotreserve/finetune_data/vcr/, and I find AccessDeniedException because I cannot access to your Google Cloud Storage bucket using my personal account. Is there any way to get permission to access the VCR data?

Pretraining loss being negative

Is it possible to get negative loss for each task during pretraining? Also can you share the pretraining log file (mostly the loss of each task, i.e., audio2text, audio_text_matching etc.)?

IndexError: list index out of range

Getting this error with demo_video.py, with the video downloaded from youtube-dl when trying to read in video with ID "pmjPjZZRhNQ.mp4". Using CUDA 11.6 with python 3.8 in the mreserve conda environment.

Relative Location of input for TVQA

Hi, I have a question about the relative location for TVQA.
`t_start = midpoint - segment_size * 0.5
t_end = midpoint + segment_size * 0.5

# Try to extend by 3 segments in either direction of the middle
times_used0 = [{'start_time': t_start, 'end_time': t_end}]
for i in range(6):
    for delta in [-segment_size, segment_size]:
        t0 = t_start + delta * (i+1)
        t1 = t_end + delta * (i+1)

        t0 = round(t0 * 3) / 3
        t1 = round(t1 * 3) / 3

        if t1 < 0:
            continue
        if t0 > max_time:
            continue
        if len(times_used0) < 7:
            times_used0.append({'start_time': t0, 'end_time': t1})
times_used0 = sorted(times_used0, key=lambda x: x['start_time'])

# Figure out the relative position of the annotation
my_duration = times_used0[-1]['end_time'] - times_used[0]['start_time']
rel_localized_tstart = (ts0 - times_used[0]['start_time']) / my_duration
rel_localized_tend = (ts1 - times_used[0]['start_time']) / my_duration
qa_item['rel_localization'] = (rel_localized_tstart, rel_localized_tend)`

For the above code, I suspect that the rel_localized_tstart could be greater than rel_localized_tend since the "midpoint - segment_size * 0.5" could less than zero?

Besides, does the rel_localized_tstart or rel_localized_tend can be a negative number?

Finetuning on GPUs

Hi!
I'm currently creating a dataset that I'd like to finetune this model on, but I don't have access to TPUs. I'm also not too familiar with Jax, so I was wondering if you roughly know what needs to be changed in the finetuning pipeline to be able to use GPUs.

Thanks for your work!
-Samuel

demo with custom videos

Could not automatically determine credentials.

Hi,

I tried to run the demo script but encountered the following error, it cannot download the model checkpoints.

(mreserve) yueyang1@nlpgpu01:/nlp/data/yueyang/merlot_reserve/demo> CUDA_VISIBLE_DEVICES=1 python demo_video.py
Traceback (most recent call last):
  File "demo_video.py", line 14, in <module>
    model = PretrainedMerlotReserve.from_pretrained(model_name='large', image_grid_size=grid_size)
  File "/mnt/nlpgridio3/data/yueyang/merlot_reserve/demo/../mreserve/modeling.py", line 968, in from_pretrained
    storage_client = storage.Client()
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/storage/client.py", line 123, in __init__
    super(Client, self).__init__(
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 318, in __init__
    _ClientProjectMixin.__init__(self, project=project, credentials=credentials)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 266, in __init__
    project = self._determine_default(project)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 285, in _determine_default
    return _determine_default_project(project)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/_helpers.py", line 186, in _determine_default_project
    _, project = google.auth.default()
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/auth/_default.py", line 488, in default
    raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

Hope to get a solution, thank you!

Yue

Still cannot get TVQA data

Hi Rowan,

I also could not open this link for tvqa data "https://storage.googleapis.com/merlotreserve/finetune_data/tvqa" Could you provide more details on the access to your tvqa data used in the paper. Thank you!

Best,
Dota

How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ?

Hi,

Thanks for releasing your work.

I'm currently trying to run your data/process.py code with customed crawled video.

And everything works well except the text_iterator().

I thought it is because I couldn't make "txt.jsonl.zst" which is going to use as random_text for pretraining batch

So I was wondering if there is any reference code or sample data to make "text.jsonl.zst" for my own ?

If it isn't possible, could you be able to explain the role of "random_text" in pretraining step for understanding your work ?
(Because I couldn't understand to align the "random text" with MERLOT-Reserve pre-training objectives)

Thank you,
Haena

TVQA Dataset

Hi Rowan,

Thank you for this great resource! I'm trying to reproduce the finetuning results on TVQA. I can't seem to access the google storage link though, and it looks like the TVQA dataset only gives access to video frames. Would you mind letting me know where you got the audio frames, or if there's anything not included in this link (once I get access)?
https://tvqa.cs.unc.edu/download_tvqa.html

Best,
Alex

Failed to restore checkpoint

Hi, I installed the package following your guidance. However, when I ran the demo_video.py, it raised a ValueError: Unpack failed: incomplete input when doing state=checkpoints.restore_checkpoint(ckpt_dir_path, target_state, step=step, prefix='ckpt_', parallel=True) in line 125 of mreserve/checkpoint.py. What should I do?

My flax version is 0.3.4 and the large_resadapt checkpoint is auto-downloaded.

How can I download TVQA audio?

Thank you for sharing this code.

I am trying to finetune on TVQA.

It seems like that audio is not available on the TVQA homepage.

How can I download TVQA audio?

Do ASR transcripts have a cleaned version like YT-Temporal 180M ?

The ASR transcripts in YT-Temporal 180M have a cleaned version. The cleaned transcripts have punctuation and are much more fluent than the original ASR. Does YT-Temporal 1B has such transcripts?

Can MERLOT-reserve be applied to short videos?

Hi，

Thank you for your excellent work!

I have noticed that you mention the limitations of the model in your paper: “Our model only learns from 40-second long videos”. So I wonder if this model can be applied to short video clips (like 5 seconds)? Is it feasible to reduce the time interval (5s) and number of video segments (16)?

Best,
Fan

Release infill templates

Hello dear author,

Could you please release the infilled questions, i.e. the questions transformed to statements with <|MASK|> using GPT-3? I would be especially interested in the statements for MSRVTT-QA and TVQA.

It would be very helpful to release them, so other researchers don't have to run and pay GPT-3 for the same task again.

Thanks for consideration,

Simon

request for training data examples

Hello,

I am trying to process a dataset for training using data/process.py. Can you please share some example inputs? For example, what is the format of the youtube_dump/{video_id}/{video_id}.v2.info.json.gz file (in function load_video(), line 212)?

Thank you!

In case I missed it, may I ask where the script to download all the youtube video is? I just found the processing script in the data/ folder.

rowanz / merlot_reserve Goto Github PK

merlot_reserve's People

Contributors

Stargazers

Watchers

Forkers

merlot_reserve's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs