rowanz / merlot_reserve Goto Github PK
View Code? Open in Web Editor NEWCode release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"
License: MIT License
Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"
License: MIT License
Hi Rowan,
In the code, you state the code only supports at most 8 segments. I would like to learn how you handle these segment counts in your demo at https://merlot.apps.allenai.org/
Thank you for your time and attention,
Mustafa
Hi Rowan,
I intend to save intermediate tensors (e.g. the embedding of Layer 11 of the joint transformer) when fine-tuning the tvqa dataset, so I can understand how the internal representations change during the time. However, I cannot save the concrete values of the layers' representations because they are encoded as Traced in jitted function (I got an error like The NumPy.ndarray conversion method array() was called on the JAX Tracer object).
I was wondering if you have already found a good solution to save them when you designed your codes. Thank you!
Best,
Dota
Thank you for the great repository.
How can we however run the model in zero-shot setup without audio?
Concretly, function model.embed_video in demo_video.py requires argument audio_clips
.
What can we do in order to not use the audio?
Thank you!
Best,
Tomas
Hi,
Thank you for your excellent work! I really want to ask you a question: now I want to only use your model to encode the video frame and the corresponding dialogue segment of the video, and then design the model by myself. Do I just copy the SPAN_encr and vision_enc model codes and download the checkpoint? :)
Best,
Jun
Hi, I have noticed sample of TVQA inputs "1 to 28 What is Janice Holding on to after Chandler sends Joey to his room? Chandler's tie. MASK[subtitles or audio]".
Does this mean that the input would be "TIME STAMP" + "QUESTION" + "ANSWER" + "MASK TOKEN" + "SUB or AUDIO"?
Besides, I cannot fully understand why we need a mask token here?
Thanks in advance.
Hi, is there any plan to release the code for K600 classification?
I'm trying to get VCR data in google drive address: gs://merlotreserve/finetune_data/vcr/, and I find AccessDeniedException because I cannot access to your Google Cloud Storage bucket using my personal account. Is there any way to get permission to access the VCR data?
Is it possible to get negative loss for each task during pretraining? Also can you share the pretraining log file (mostly the loss of each task, i.e., audio2text, audio_text_matching etc.)?
Getting this error with demo_video.py, with the video downloaded from youtube-dl when trying to read in video with ID "pmjPjZZRhNQ.mp4". Using CUDA 11.6 with python 3.8 in the mreserve conda environment.
Hi, I have a question about the relative location for TVQA.
`t_start = midpoint - segment_size * 0.5
t_end = midpoint + segment_size * 0.5
# Try to extend by 3 segments in either direction of the middle
times_used0 = [{'start_time': t_start, 'end_time': t_end}]
for i in range(6):
for delta in [-segment_size, segment_size]:
t0 = t_start + delta * (i+1)
t1 = t_end + delta * (i+1)
t0 = round(t0 * 3) / 3
t1 = round(t1 * 3) / 3
if t1 < 0:
continue
if t0 > max_time:
continue
if len(times_used0) < 7:
times_used0.append({'start_time': t0, 'end_time': t1})
times_used0 = sorted(times_used0, key=lambda x: x['start_time'])
# Figure out the relative position of the annotation
my_duration = times_used0[-1]['end_time'] - times_used[0]['start_time']
rel_localized_tstart = (ts0 - times_used[0]['start_time']) / my_duration
rel_localized_tend = (ts1 - times_used[0]['start_time']) / my_duration
qa_item['rel_localization'] = (rel_localized_tstart, rel_localized_tend)`
For the above code, I suspect that the rel_localized_tstart could be greater than rel_localized_tend since the "midpoint - segment_size * 0.5" could less than zero?
Besides, does the rel_localized_tstart or rel_localized_tend can be a negative number?
Hi!
I'm currently creating a dataset that I'd like to finetune this model on, but I don't have access to TPUs. I'm also not too familiar with Jax, so I was wondering if you roughly know what needs to be changed in the finetuning pipeline to be able to use GPUs.
Thanks for your work!
-Samuel
Hi,
I tried to run the demo script but encountered the following error, it cannot download the model checkpoints.
(mreserve) yueyang1@nlpgpu01:/nlp/data/yueyang/merlot_reserve/demo> CUDA_VISIBLE_DEVICES=1 python demo_video.py
Traceback (most recent call last):
File "demo_video.py", line 14, in <module>
model = PretrainedMerlotReserve.from_pretrained(model_name='large', image_grid_size=grid_size)
File "/mnt/nlpgridio3/data/yueyang/merlot_reserve/demo/../mreserve/modeling.py", line 968, in from_pretrained
storage_client = storage.Client()
File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/storage/client.py", line 123, in __init__
super(Client, self).__init__(
File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 318, in __init__
_ClientProjectMixin.__init__(self, project=project, credentials=credentials)
File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 266, in __init__
project = self._determine_default(project)
File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 285, in _determine_default
return _determine_default_project(project)
File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/_helpers.py", line 186, in _determine_default_project
_, project = google.auth.default()
File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/auth/_default.py", line 488, in default
raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
Hope to get a solution, thank you!
Yue
Hi Rowan,
I also could not open this link for tvqa data "https://storage.googleapis.com/merlotreserve/finetune_data/tvqa" Could you provide more details on the access to your tvqa data used in the paper. Thank you!
Best,
Dota
Hi,
Thanks for releasing your work.
I'm currently trying to run your data/process.py code with customed crawled video.
And everything works well except the text_iterator().
I thought it is because I couldn't make "txt.jsonl.zst" which is going to use as random_text for pretraining batch
So I was wondering if there is any reference code or sample data to make "text.jsonl.zst" for my own ?
If it isn't possible, could you be able to explain the role of "random_text" in pretraining step for understanding your work ?
(Because I couldn't understand to align the "random text" with MERLOT-Reserve pre-training objectives)
Thank you,
Haena
Hi Rowan,
Thank you for this great resource! I'm trying to reproduce the finetuning results on TVQA. I can't seem to access the google storage link though, and it looks like the TVQA dataset only gives access to video frames. Would you mind letting me know where you got the audio frames, or if there's anything not included in this link (once I get access)?
https://tvqa.cs.unc.edu/download_tvqa.html
Best,
Alex
Hi, I installed the package following your guidance. However, when I ran the demo_video.py, it raised a ValueError: Unpack failed: incomplete input when doing state=checkpoints.restore_checkpoint(ckpt_dir_path, target_state, step=step, prefix='ckpt_', parallel=True) in line 125 of mreserve/checkpoint.py. What should I do?
My flax version is 0.3.4 and the large_resadapt checkpoint is auto-downloaded.
Thank you for sharing this code.
I am trying to finetune on TVQA.
It seems like that audio is not available on the TVQA homepage.
How can I download TVQA audio?
The ASR transcripts in YT-Temporal 180M have a cleaned version. The cleaned transcripts have punctuation and are much more fluent than the original ASR. Does YT-Temporal 1B has such transcripts?
Hi,
Thank you for your excellent work!
I have noticed that you mention the limitations of the model in your paper: “Our model only learns from 40-second long videos”. So I wonder if this model can be applied to short video clips (like 5 seconds)? Is it feasible to reduce the time interval (5s) and number of video segments (16)?
Best,
Fan
Hello dear author,
Could you please release the infilled questions, i.e. the questions transformed to statements with <|MASK|> using GPT-3? I would be especially interested in the statements for MSRVTT-QA and TVQA.
It would be very helpful to release them, so other researchers don't have to run and pay GPT-3 for the same task again.
Thanks for consideration,
Simon
Hello,
I am trying to process a dataset for training using data/process.py. Can you please share some example inputs? For example, what is the format of the youtube_dump/{video_id}/{video_id}.v2.info.json.gz
file (in function load_video(), line 212)?
Thank you!
Hi, could you tell me the storage size of all the raw videos?
I wonder if our server is big enough to git all of them.
In merlot_reserve/demo/zero_shot_ek both files require opening and parsing a csv file
located in 'data/epic-kitchens-100-annotations/EPIC_100_validation.csv' but this file is not in the repository nor on the website
Hi Rowan,
Really nice work and thanks for sharing the code!
In case I missed it, may I ask where the script to download all the youtube video is? I just found the processing script in the data/
folder.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.