microsoft / xpretrain Goto Github PK

Multi-modality pre-training

License: Other

Dockerfile 1.47% Shell 0.32% Python 98.22%

multimodal-learning pre-training multimedia computer-vision nlp

xpretrain's Introduction

XPretrain

This repo includes some recent research works in multi-modality learning, especially with pre-training method from MSM group of Microsoft Research.

Multi-modality Learning

* Video & Language *

Dataset

HD-VILA-100M dataset: high-resolution and diversified video-language dataset

Pre-training model

HD-VILA (CVPR 2022): high-resolution and diversified video-language pre-training model

LF-VILA (NeurIPS 2022): long-form video-language pre-training model

CLIP-ViP (ICLR 2023): adapting image-language pre-training to video-language pretraining model

* Image & Language *

Pre-training model

Pixel-BERT: end-to-end image and language pre-training model

SOHO (CVPR 2021 oral): improved end-to-end image and language pre-training model with quantized visual tokens

VisualParsing (NeurIPS 2021): Transformer-based end-to-end image and language pre-training model

News

😃March, 2023: the code of CLIP-ViP and LF-VILA was released.
January, 2023: our paper CLIP-ViP to adapt image-language pre-training model to video-language pretraining was accepted by ICLR 2023.
September, 2022: our paper LF-VILA on long-form video-language pre-training was accepted by NeurIPS 2022.
September, 2022: the code of HD-VILA was released.
March, 2022: HD-VILA-100M dataset was released publicly.
March, 2022: HD-VILA was accepted by CVPR 2022.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Contact Information

For help or issues using the pre-trained models, please submit an issue. For other communications, please contact Bei Liu ([email protected]) and Jianlong Fu ([email protected]).

xpretrain's People

Contributors

Stargazers

Watchers

xpretrain's Issues

Code for transcript text processing

May I request the transcript text processing code?
[email protected]
Thank you very much!

About the zero-shot performance

Thanks for you interesting work.

I am curious about the zero-shot performance of your CLIP-ViP on MSR-VTT.

I find that models (e.g. videoCLIP, SimVLP) pre-trained on video-text pairs performs not satisfied compare with image-language countparters(e.g. CLIP, BLIP) on zero-shot transfer to video retrival tasks. How about the zero-shot performance on CLIP-ViP?

Which do you think make this phenomenon happen?

Hi, how to understand the LF-hdvila-8m?

Is the line in 'lfvila8m_clipid.jsonl' a video clips-sentence pair? And I see an variational number of video-clips per row. So how the video-clips of 'lfvila8m_clipid.jsonl' is divided from the original ‘hdvila_clip_text_100m.jsonl’？ In addition to the selection of videos with more than 4 clips mentioned in the paper, are there any details?

Code for transcript text processing

Thanks for releasing this great dataset!

Could you please kindly release the transcript text processing code as mentioned by @bei21?

As discussed in the previous comment: #2 (comment), the text itself can't be released.

How to use HD-VILA as multimodal TextEncoder?

HD-VILA-100M dataset, where is the text corresponding to each video?

Hi, @bei21 @msftdata

Thank you for your great paper.

I downloaded and decompressed the "hdvila100m.zip", but could not find the transcriptions corresponding to each video.

Did I miss something?
Could you let me know how to get the transcriptions?
Thank you.

In CLIP-ViP, what is the results of OFA captions + HD-VILA-10M?

Hi, thank you so much for the project.
I wonder what is the results of OFA captions + HD-VILA-10M ?

Captions for HD-ViLA-100M

Hi,

Firstly, Thank you for your interesting work.

Could you please share more information on how the captions have been generated for HD-ViLA using ASR. The paper explains that ASR-generated captions are post-processed by an off-the-shelf punctuator. But if you could kindly provide access to the generated captions (as in CLIP-ViP) or more details on which ASR technology was used, that would be really helpful in using the dataset.

Thank you.

Where can i get the asr text

I have saw the contents of the hdvila100m.zip, but can't where the asr text is, could you tell me where can i get it? Thank you~

Error in finetuning

When running the command inside the docker image for finetuning LF-VILA, following error is created,
root@8dccc81930c3:/LF-VILA# deepspeed src/tasks/run_video_classification.py --distributed --blob_mount_dir /blob_mount --config $CONFIG_PATH --deepspeed
[2023-10-17 11:11:02,765] [WARNING] [runner.py:132:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/runner.py", line 308, in main
raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available

Please note that my device has GPUs available and cuda and torch are correctly installed.

About LF-VILA code in PatchEmbed3D of video encoder

the padding seems not right, or maybe i made a mistake

# padding
        _, _, D, H, W = x.size() 
        if H % self.patch_size[0] != 0: 
            x = F.pad(x, (0, 0, 0, self.patch_size[1] - H % self.patch_size[1]))
        if W % self.patch_size[1] != 0:
            x = F.pad(x, (0, 0, 0, 0, 0, self.patch_size[0] - D % self.patch_size[0]))

owing to patch_size=[1, 8, 8] where 8x8 is HxW in implementation, should it be padded in H and W dimension?
condition H % self.patch_size[0] != 0 and W % self.patch_size[1] != 0 make me lost
thanks a lot!

How to prepare pretrain data for LF-VILA?

What is the data format in datasets/hdvila100m/video_clip_3fps, datasets/lfvila_data/pretrain/train_db and datasets/lfvila_data/pretrain/val.jsonl mentioned in src/configs/pretrain_stage1.yaml ?
Can you provide specific reference examples or processes (for long-form video and annotations respectively) ?

where to download the ASR transcriptions?

Hello, thanks for the great large-scale high-resolution and high-quality dataset!
and where to download the ASR transcriptions?
Thanks

[CLS] token in CLIP-ViP

I am glad to read your paper. It gave me tremendous growth.

In the model figure, there are [CLS] tokens as result of text encoder. But if I understand the paper correctly, text encoder is not PLM like BERT but Transformer Encoder. In the code of CLIP, the simple tokenizer has 2 special tokens like below.

vocab.extend(['<|startoftext|>', '<|endoftext|>'])

And, in the paper of CLIP4Clip, the researchers use 'the activations from the highest layer of the transformer at the [EOS] token' as text embedding.

So I wonder what is your text embedding exactly. Since the code hasn't been released yet, I'm asking this. Thanks for reading.

Asking for a simple script to get text and video features

First of all - Amazing work on this one.

I'm a bit getting lost with the repo, may I request a simple few line script that does something like the following:

model = CLIPViP("pretrain_clipvip_base_32.pt")
text_features = model.encode_text("This is a very cute cat")
video_features = model.encode_video("vid_file.mp4")
cosine(text_features, video_features)

[Extra] Preferably I wish to get the video features for a batch of mp4 files with different lengths
The closest I found is in CLIP-ViP/src/modeling/VidCLIP.py but I couldn't find a use of this script.

Thank you :)

video caption of HD-VILA-100M Dataset

Thank you for collecting and making public such a large video-text dataset.
Is the text description dataset for each video publicly available?
Where can we download the text caption of the video?

CLIP-VIP OFA caption generate

Regarding using the OFA model to generate captions in the middle of the video, can you introduce in detail which OFA you use and how you speed up this process?

where are the train9k.jsonl and test1ka.jsonl files in MSRVTT retrieval?

I notice that we need train9k.jsonl and test1ka.jsonl files in MSRVTT retrieval task, but the only .csv files are provided in original annotations. Thus, where could I find or generate these two files?

How long does CLIP-VIP pretraining takes?

Paper states that "We train our model with 32 NVIDIA Tesla V100 GPUs in a batch size of 1024", but it doesn't tell how long the pretraining takes in this setting. Could you tell me the pretraining cost?

Where is the MSRVTT json file in CLIP-ViP?

Hi, I found that in your msrvvt_config, train9k.jsonl and test1ka.jsonl are needed. But I don't find anything about it in your readme.md. Are they in the hdvila_ofa_captions_db? If the jsonl files are in the folder, How should we open the 'data.db' in the folder.

Questions about HD-VILA

Hi @bei21 @TiankaiHang , I would like to ask some questions about HD-VILA.

How large are the HD-VILA video files and subtitle files respectively? I guess there are at least 10TB or above. Are you using SSD to store them?
It is mentioned in the paper that 64 V100 GPUs are required. How much time does it take for pre-training stage1, pre-training stage2 and downstream tasks respectively?
Do you have a plan to share the code of HD-VILA?
Thank you very much!

Reproducing the result of CLIP-ViP performance on MSRVTT

Can't reproduce the result of CLIP-ViP performance on MSRVTT. I used the default config file with epoch=100 and bs=16. Or epochs=5, bs=128 in the paper. The best perform of t2vR1 and v2tR1 are both 49+

About the zero-shot performance

Thanks for you interesting work.

Could you provide the zero-shot performance of your CLIP-ViP(B/16) on MSR-VTT?

I would be very grateful

Model checkpoints

Excellent work!
I'm trying to deploy some attacks on your models, but I cannot fine-tune your pretrained ones on my local server due to a VRAM shortage. Could you please provide the model checkpoint for better accessibility of model applications? Thank you.

releasing code and pretrain

Hi thank you for your interesting work when will you release the code and pre-train model CLIP-ViP repository?

Question regarding video proxy mechanism in CLIP-ViP

Congratulation for your paper acceptance in ICLR 2023! Your work is insightful and achieve a significant performance improvement in the video-text retrieval task.

I want to ask regarding the implementation of this video proxy mechanism. In the paper, you mentioned that it is simply a learnable parameter with length M. However, when I look at the code, you define two separate parameters for the video proxy token: class_embedding and added_cls. class_embedding is a 1D vector with the size equal to hidden_size, while added_cls is a 2D matrix with the dimension equal to add_cls_num x hidden_size. CMIIW, but I cannot find any reference in the main paper regarding this.

I have checked the sample configuration on each dataset, and it turns out you set the value of add_cls_num to 3. Does this correlate to 4 Video Proxy Tokens mentioned in the paper: add_cls_num x hidden_size + hidden_size, where add_cls_num = 3? Can you explain the intuition, why we need to separate this into class_embedding and added_cls?

Thank you

Dockerfile and requirements for Clip-ViP

Thanks for your great work! I want to follow your work, but I meet some problems with the dockfile. It seems the image nvidia/cuda:10.1-devel-ubuntu18.04 not exist. Can you provide a requirements.txt file for running Clip-ViP? Thank you very much!

Long Video Processing in LF-VILA

Hi, I am wondering how to read the long video and extract the frame efficiently as stated in the LF-VILA paper.

Error on starting horovod

When we run
horovodrun -np 1 python src/pretrain/run_pretrain.py --config src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json
We get the following error:

<stderr>:  File "src/pretrain/run_pretrain.py", line 22, in <module>                                                                                             
<stderr>:    from transformers import CLIPTokenizerFast                                                                                                          
<stderr>:ImportError: cannot import name 'CLIPTokenizerFast' from 'transformers' (/usr/local/lib/python3.7/dist-packages/transformers/__init__.py) ```

About OFA-Caption generated captions on HD-VILA-100M

First of all, thanks for your interesting work.

May I ask where can I get the OFA-Caption generated captions on HD-VILA-100M in your work?

Ways to open the .mdb caption files

Is there an easy way to open the .mdb files? Trying to dump it to csv with mdbtools but it complains the file is not an access database.

Video compression/decoding methods of each dataset in CLIP-ViP

Hi, I'm trying to reproduce the CLIP-ViP result. In the readme file, it is mentioned that the data preprocessing step follows HD-VILA. However, in the configuration files of the downstream task, it seems the compression/decoding method is different from these. Are these video preprocessing method correct:

MSR-VTT: compression, 6 FPS
LSMDC: no compression/decoding, use raw video as is
ActivityNet: decoding lr
DiDeMo: compression, X FPS (What is the number of X? Is it 6 too?)

MSR-VTT fine tune epochs number

Can you help clarify what's the actual fine-tune epoch used for MSR-VTT dataset? In the paper it says 5 epochs is used (which is the common setting), however in the config file here , it says 100?

microsoft / xpretrain Goto Github PK

xpretrain's Introduction

XPretrain

Multi-modality Learning

***** Video & Language *****

Dataset

Pre-training model

***** Image & Language *****

Pre-training model

News

Contributing

Trademarks

Contact Information

xpretrain's People

Contributors

Stargazers

Watchers

Forkers

xpretrain's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

* Video & Language *

* Image & Language *