GithubHelp home page GithubHelp logo

microsoft / xpretrain Goto Github PK

View Code? Open in Web Editor NEW
437.0 13.0 33.0 3.7 MB

Multi-modality pre-training

License: Other

Dockerfile 1.47% Shell 0.32% Python 98.22%
multimodal-learning pre-training multimedia computer-vision nlp

xpretrain's Introduction

XPretrain

This repo includes some recent research works in multi-modality learning, especially with pre-training method from MSM group of Microsoft Research.

Multi-modality Learning

***** Video & Language *****

Dataset

HD-VILA-100M dataset: high-resolution and diversified video-language dataset

Pre-training model

HD-VILA (CVPR 2022): high-resolution and diversified video-language pre-training model

LF-VILA (NeurIPS 2022): long-form video-language pre-training model

CLIP-ViP (ICLR 2023): adapting image-language pre-training to video-language pretraining model

***** Image & Language *****

Pre-training model

Pixel-BERT: end-to-end image and language pre-training model

SOHO (CVPR 2021 oral): improved end-to-end image and language pre-training model with quantized visual tokens

VisualParsing (NeurIPS 2021): Transformer-based end-to-end image and language pre-training model

News

  • 😃March, 2023: the code of CLIP-ViP and LF-VILA was released.
  • January, 2023: our paper CLIP-ViP to adapt image-language pre-training model to video-language pretraining was accepted by ICLR 2023.
  • September, 2022: our paper LF-VILA on long-form video-language pre-training was accepted by NeurIPS 2022.
  • September, 2022: the code of HD-VILA was released.
  • March, 2022: HD-VILA-100M dataset was released publicly.
  • March, 2022: HD-VILA was accepted by CVPR 2022.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Contact Information

For help or issues using the pre-trained models, please submit an issue. For other communications, please contact Bei Liu ([email protected]) and Jianlong Fu ([email protected]).

xpretrain's People

Contributors

bei21 avatar dependabot[bot] avatar jianlong-fu avatar microsoftopensource avatar tiankaihang avatar ycsun1972 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xpretrain's Issues

About the zero-shot performance

Thanks for you interesting work.

I am curious about the zero-shot performance of your CLIP-ViP on MSR-VTT.

I find that models (e.g. videoCLIP, SimVLP) pre-trained on video-text pairs performs not satisfied compare with image-language countparters(e.g. CLIP, BLIP) on zero-shot transfer to video retrival tasks. How about the zero-shot performance on CLIP-ViP?

Which do you think make this phenomenon happen?

Hi, how to understand the LF-hdvila-8m?

Is the line in 'lfvila8m_clipid.jsonl' a video clips-sentence pair? And I see an variational number of video-clips per row. So how the video-clips of 'lfvila8m_clipid.jsonl' is divided from the original ‘hdvila_clip_text_100m.jsonl’? In addition to the selection of videos with more than 4 clips mentioned in the paper, are there any details?
image

Captions for HD-ViLA-100M

Hi,

Firstly, Thank you for your interesting work.

Could you please share more information on how the captions have been generated for HD-ViLA using ASR. The paper explains that ASR-generated captions are post-processed by an off-the-shelf punctuator. But if you could kindly provide access to the generated captions (as in CLIP-ViP) or more details on which ASR technology was used, that would be really helpful in using the dataset.

Thank you.

Where can i get the asr text

I have saw the contents of the hdvila100m.zip, but can't where the asr text is, could you tell me where can i get it? Thank you~

Error in finetuning

When running the command inside the docker image for finetuning LF-VILA, following error is created,
root@8dccc81930c3:/LF-VILA# deepspeed src/tasks/run_video_classification.py --distributed --blob_mount_dir /blob_mount --config $CONFIG_PATH --deepspeed
[2023-10-17 11:11:02,765] [WARNING] [runner.py:132:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/runner.py", line 308, in main
raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available

Please note that my device has GPUs available and cuda and torch are correctly installed.

About LF-VILA code in PatchEmbed3D of video encoder

the padding seems not right, or maybe i made a mistake

# padding
        _, _, D, H, W = x.size() 
        if H % self.patch_size[0] != 0: 
            x = F.pad(x, (0, 0, 0, self.patch_size[1] - H % self.patch_size[1]))
        if W % self.patch_size[1] != 0:
            x = F.pad(x, (0, 0, 0, 0, 0, self.patch_size[0] - D % self.patch_size[0]))

owing to patch_size=[1, 8, 8] where 8x8 is HxW in implementation, should it be padded in H and W dimension?
condition H % self.patch_size[0] != 0 and W % self.patch_size[1] != 0 make me lost
thanks a lot!

How to prepare pretrain data for LF-VILA?

What is the data format in datasets/hdvila100m/video_clip_3fps, datasets/lfvila_data/pretrain/train_db and datasets/lfvila_data/pretrain/val.jsonl mentioned in src/configs/pretrain_stage1.yaml ?
Can you provide specific reference examples or processes (for long-form video and annotations respectively) ?

[CLS] token in CLIP-ViP

I am glad to read your paper. It gave me tremendous growth.

In the model figure, there are [CLS] tokens as result of text encoder. But if I understand the paper correctly, text encoder is not PLM like BERT but Transformer Encoder. In the code of CLIP, the simple tokenizer has 2 special tokens like below.

vocab.extend(['<|startoftext|>', '<|endoftext|>'])

And, in the paper of CLIP4Clip, the researchers use 'the activations from the highest layer of the transformer at the [EOS] token' as text embedding.

So I wonder what is your text embedding exactly. Since the code hasn't been released yet, I'm asking this. Thanks for reading.

Asking for a simple script to get text and video features

First of all - Amazing work on this one.

I'm a bit getting lost with the repo, may I request a simple few line script that does something like the following:

model = CLIPViP("pretrain_clipvip_base_32.pt")
text_features = model.encode_text("This is a very cute cat")
video_features = model.encode_video("vid_file.mp4")
cosine(text_features, video_features)

[Extra] Preferably I wish to get the video features for a batch of mp4 files with different lengths
The closest I found is in CLIP-ViP/src/modeling/VidCLIP.py but I couldn't find a use of this script.

Thank you :)

video caption of HD-VILA-100M Dataset

Thank you for collecting and making public such a large video-text dataset.
Is the text description dataset for each video publicly available?
Where can we download the text caption of the video?

CLIP-VIP OFA caption generate

Regarding using the OFA model to generate captions in the middle of the video, can you introduce in detail which OFA you use and how you speed up this process?

How long does CLIP-VIP pretraining takes?

Paper states that "We train our model with 32 NVIDIA Tesla V100 GPUs in a batch size of 1024", but it doesn't tell how long the pretraining takes in this setting. Could you tell me the pretraining cost?

Where is the MSRVTT json file in CLIP-ViP?

Hi, I found that in your msrvvt_config, train9k.jsonl and test1ka.jsonl are needed. But I don't find anything about it in your readme.md. Are they in the hdvila_ofa_captions_db? If the jsonl files are in the folder, How should we open the 'data.db' in the folder.

Questions about HD-VILA

Hi @bei21 @TiankaiHang , I would like to ask some questions about HD-VILA.

  1. How large are the HD-VILA video files and subtitle files respectively? I guess there are at least 10TB or above. Are you using SSD to store them?
  2. It is mentioned in the paper that 64 V100 GPUs are required. How much time does it take for pre-training stage1, pre-training stage2 and downstream tasks respectively?
  3. Do you have a plan to share the code of HD-VILA?
    Thank you very much!

About the zero-shot performance

Thanks for you interesting work.

Could you provide the zero-shot performance of your CLIP-ViP(B/16) on MSR-VTT?

I would be very grateful

Model checkpoints

Excellent work!
I'm trying to deploy some attacks on your models, but I cannot fine-tune your pretrained ones on my local server due to a VRAM shortage. Could you please provide the model checkpoint for better accessibility of model applications? Thank you.

releasing code and pretrain

Hi thank you for your interesting work when will you release the code and pre-train model CLIP-ViP repository?

Question regarding video proxy mechanism in CLIP-ViP

Congratulation for your paper acceptance in ICLR 2023! Your work is insightful and achieve a significant performance improvement in the video-text retrieval task.

I want to ask regarding the implementation of this video proxy mechanism. In the paper, you mentioned that it is simply a learnable parameter with length M. However, when I look at the code, you define two separate parameters for the video proxy token: class_embedding and added_cls. class_embedding is a 1D vector with the size equal to hidden_size, while added_cls is a 2D matrix with the dimension equal to add_cls_num x hidden_size. CMIIW, but I cannot find any reference in the main paper regarding this.

I have checked the sample configuration on each dataset, and it turns out you set the value of add_cls_num to 3. Does this correlate to 4 Video Proxy Tokens mentioned in the paper: add_cls_num x hidden_size + hidden_size, where add_cls_num = 3? Can you explain the intuition, why we need to separate this into class_embedding and added_cls?

Thank you

Dockerfile and requirements for Clip-ViP

Thanks for your great work! I want to follow your work, but I meet some problems with the dockfile. It seems the image nvidia/cuda:10.1-devel-ubuntu18.04 not exist. Can you provide a requirements.txt file for running Clip-ViP? Thank you very much!

Error on starting horovod

When we run
horovodrun -np 1 python src/pretrain/run_pretrain.py --config src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json
We get the following error:

<stderr>:  File "src/pretrain/run_pretrain.py", line 22, in <module>                                                                                             
<stderr>:    from transformers import CLIPTokenizerFast                                                                                                          
<stderr>:ImportError: cannot import name 'CLIPTokenizerFast' from 'transformers' (/usr/local/lib/python3.7/dist-packages/transformers/__init__.py) ```

Video compression/decoding methods of each dataset in CLIP-ViP

Hi, I'm trying to reproduce the CLIP-ViP result. In the readme file, it is mentioned that the data preprocessing step follows HD-VILA. However, in the configuration files of the downstream task, it seems the compression/decoding method is different from these. Are these video preprocessing method correct:

  • MSR-VTT: compression, 6 FPS
  • LSMDC: no compression/decoding, use raw video as is
  • ActivityNet: decoding lr
  • DiDeMo: compression, X FPS (What is the number of X? Is it 6 too?)

MSR-VTT fine tune epochs number

Can you help clarify what's the actual fine-tune epoch used for MSR-VTT dataset? In the paper it says 5 epochs is used (which is the common setting), however in the config file here , it says 100?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.