GithubHelp home page GithubHelp logo

rowanz / merlot_reserve Goto Github PK

View Code? Open in Web Editor NEW
135.0 135.0 32.0 4.67 MB

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

License: MIT License

Python 98.28% Shell 0.51% Dockerfile 1.20%

merlot_reserve's People

Contributors

rowanz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

merlot_reserve's Issues

saving intermediate tensors in jitted function

Hi Rowan,

I intend to save intermediate tensors (e.g. the embedding of Layer 11 of the joint transformer) when fine-tuning the tvqa dataset, so I can understand how the internal representations change during the time. However, I cannot save the concrete values of the layers' representations because they are encoded as Traced in jitted function (I got an error like The NumPy.ndarray conversion method array() was called on the JAX Tracer object).

I was wondering if you have already found a good solution to save them when you designed your codes. Thank you!

Best,
Dota

hi,question about frozen embedding

Hi,

Thank you for your excellent work! I really want to ask you a question: now I want to only use your model to encode the video frame and the corresponding dialogue segment of the video, and then design the model by myself. Do I just copy the SPAN_encr and vision_enc model codes and download the checkpoint? :)

Best,
Jun

Questions about input of TVQA?

Hi, I have noticed sample of TVQA inputs "1 to 28 What is Janice Holding on to after Chandler sends Joey to his room? Chandler's tie. MASK[subtitles or audio]".

Does this mean that the input would be "TIME STAMP" + "QUESTION" + "ANSWER" + "MASK TOKEN" + "SUB or AUDIO"?
Besides, I cannot fully understand why we need a mask token here?

Thanks in advance.

Can't access Google Drive files.

I'm trying to get VCR data in google drive address: gs://merlotreserve/finetune_data/vcr/, and I find AccessDeniedException because I cannot access to your Google Cloud Storage bucket using my personal account. Is there any way to get permission to access the VCR data?

Pretraining loss being negative

Is it possible to get negative loss for each task during pretraining? Also can you share the pretraining log file (mostly the loss of each task, i.e., audio2text, audio_text_matching etc.)?

IndexError: list index out of range

Getting this error with demo_video.py, with the video downloaded from youtube-dl when trying to read in video with ID "pmjPjZZRhNQ.mp4". Using CUDA 11.6 with python 3.8 in the mreserve conda environment.

Relative Location of input for TVQA

Hi, I have a question about the relative location for TVQA.
`t_start = midpoint - segment_size * 0.5
t_end = midpoint + segment_size * 0.5

# Try to extend by 3 segments in either direction of the middle
times_used0 = [{'start_time': t_start, 'end_time': t_end}]
for i in range(6):
    for delta in [-segment_size, segment_size]:
        t0 = t_start + delta * (i+1)
        t1 = t_end + delta * (i+1)

        t0 = round(t0 * 3) / 3
        t1 = round(t1 * 3) / 3

        if t1 < 0:
            continue
        if t0 > max_time:
            continue
        if len(times_used0) < 7:
            times_used0.append({'start_time': t0, 'end_time': t1})
times_used0 = sorted(times_used0, key=lambda x: x['start_time'])

# Figure out the relative position of the annotation
my_duration = times_used0[-1]['end_time'] - times_used[0]['start_time']
rel_localized_tstart = (ts0 - times_used[0]['start_time']) / my_duration
rel_localized_tend = (ts1 - times_used[0]['start_time']) / my_duration
qa_item['rel_localization'] = (rel_localized_tstart, rel_localized_tend)`

For the above code, I suspect that the rel_localized_tstart could be greater than rel_localized_tend since the "midpoint - segment_size * 0.5" could less than zero?

Besides, does the rel_localized_tstart or rel_localized_tend can be a negative number?

Finetuning on GPUs

Hi!
I'm currently creating a dataset that I'd like to finetune this model on, but I don't have access to TPUs. I'm also not too familiar with Jax, so I was wondering if you roughly know what needs to be changed in the finetuning pipeline to be able to use GPUs.

Thanks for your work!
-Samuel

Could not automatically determine credentials.

Hi,

I tried to run the demo script but encountered the following error, it cannot download the model checkpoints.

(mreserve) yueyang1@nlpgpu01:/nlp/data/yueyang/merlot_reserve/demo> CUDA_VISIBLE_DEVICES=1 python demo_video.py
Traceback (most recent call last):
  File "demo_video.py", line 14, in <module>
    model = PretrainedMerlotReserve.from_pretrained(model_name='large', image_grid_size=grid_size)
  File "/mnt/nlpgridio3/data/yueyang/merlot_reserve/demo/../mreserve/modeling.py", line 968, in from_pretrained
    storage_client = storage.Client()
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/storage/client.py", line 123, in __init__
    super(Client, self).__init__(
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 318, in __init__
    _ClientProjectMixin.__init__(self, project=project, credentials=credentials)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 266, in __init__
    project = self._determine_default(project)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 285, in _determine_default
    return _determine_default_project(project)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/_helpers.py", line 186, in _determine_default_project
    _, project = google.auth.default()
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/auth/_default.py", line 488, in default
    raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started 

Hope to get a solution, thank you!

Yue

How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ?

Hi,

Thanks for releasing your work.

I'm currently trying to run your data/process.py code with customed crawled video.

And everything works well except the text_iterator().

I thought it is because I couldn't make "txt.jsonl.zst" which is going to use as random_text for pretraining batch

So I was wondering if there is any reference code or sample data to make "text.jsonl.zst" for my own ?

If it isn't possible, could you be able to explain the role of "random_text" in pretraining step for understanding your work ?
(Because I couldn't understand to align the "random text" with MERLOT-Reserve pre-training objectives)

Thank you,
Haena

TVQA Dataset

Hi Rowan,

Thank you for this great resource! I'm trying to reproduce the finetuning results on TVQA. I can't seem to access the google storage link though, and it looks like the TVQA dataset only gives access to video frames. Would you mind letting me know where you got the audio frames, or if there's anything not included in this link (once I get access)?
https://tvqa.cs.unc.edu/download_tvqa.html

Best,
Alex

Failed to restore checkpoint

Hi, I installed the package following your guidance. However, when I ran the demo_video.py, it raised a ValueError: Unpack failed: incomplete input when doing state=checkpoints.restore_checkpoint(ckpt_dir_path, target_state, step=step, prefix='ckpt_', parallel=True) in line 125 of mreserve/checkpoint.py. What should I do?

My flax version is 0.3.4 and the large_resadapt checkpoint is auto-downloaded.

How can I download TVQA audio?

Thank you for sharing this code.

I am trying to finetune on TVQA.

It seems like that audio is not available on the TVQA homepage.

How can I download TVQA audio?

Can MERLOT-reserve be applied to short videos?

Hi,

Thank you for your excellent work!

I have noticed that you mention the limitations of the model in your paper: “Our model only learns from 40-second long videos”. So I wonder if this model can be applied to short video clips (like 5 seconds)? Is it feasible to reduce the time interval (5s) and number of video segments (16)?

Best,
Fan

Release infill templates

Hello dear author,

Could you please release the infilled questions, i.e. the questions transformed to statements with <|MASK|> using GPT-3? I would be especially interested in the statements for MSRVTT-QA and TVQA.

It would be very helpful to release them, so other researchers don't have to run and pay GPT-3 for the same task again.

Thanks for consideration,

Simon

request for training data examples

Hello,

I am trying to process a dataset for training using data/process.py. Can you please share some example inputs? For example, what is the format of the youtube_dump/{video_id}/{video_id}.v2.info.json.gz file (in function load_video(), line 212)?

Thank you!

Download dataset

Hi Rowan,

Really nice work and thanks for sharing the code!

In case I missed it, may I ask where the script to download all the youtube video is? I just found the processing script in the data/ folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.