x-plug / youku-mplug Goto Github PK

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks

License: Apache License 2.0

Python 99.78% Shell 0.22%

benchmark chinese dataset mllm multimodal multimodal-large-language-models multimodal-pretraining video video-question-answering video-retrieval

youku-mplug's Introduction

Youku-mPLUG 10M Chinese Large-Scale Video Text Dataset

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks Download Link HERE

Paper

What is Youku-mPLUG?

We release the public largest Chinese high-quality video-language dataset (10 million) named Youku-mPLUG, which is collected from a well-known Chinese video-sharing website, named Youku, with strict criteria of safety, diversity, and quality.

Examples of video clips and titles in the proposed Youku-mPLUG dataset.

We provide 3 different downstream multimodal video benchmark datasets to measure the capabilities of pre-trained models. The 3 different tasks include:

Video Category Prediction：Given a video and its corresponding title, predict the category of the video.
Video-Text Retrieval：In the presence of some videos and some texts, use video for text retrieval and text for video retrieval.
Video Captioning：In the presence of a video, describe the content of the video.

Data statistics

The dataset contains 10 million videos in total, which are of high quality and distributed in 20 super categories can 45 categories.

The distribution of categories in Youku-mPLUG dataset.

Zero-shot Capability

Download

You can download all the videos and annotation files through this link

Setup

Note: Due to a bug in megatron_util, after installing megatron_util, it is necessary to replace conda/envs/youku/lib/python3.10/site-packages/megatron_util/initialize.py with the initialize.py in the current directory.

conda env create -f environment.yml
conda activate youku
pip install megatron_util==1.3.0 -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

# For caption evaluation
apt-get install default-jre

mPLUG-Video (1.3B / 2.7B)

Pre-train

First you should download GPT-3 1.3B & 2.7B checkpoint from Modelscope. The pre-trained model can be downloaded Here (1.3B) and Here (2.7B).

Running the pre-training of mPLUG-Video as:

exp_name='pretrain/gpt3_1.3B/pretrain_gpt3_freezeGPT_youku_v0'
PYTHONPATH=$PYTHONPATH:./ \
python -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  --nnodes=$WORLD_SIZE \
  --node_rank=$RANK \
  --use_env run_pretrain_distributed_gpt3.py \
  --config ./configs/${exp_name}.yaml \
  --output_dir ./output/${exp_name} \
  --enable_deepspeed \
  --bf16
  2>&1 | tee ./output/${exp_name}/train.log

Benchmarking

To perform downstream fine-tuning. We take Video Category Prediction as an example:

exp_name='cls/cls_gpt3_1.3B_youku_v0_sharp_2'
PYTHONPATH=$PYTHONPATH:./ \
python -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  --nnodes=$WORLD_SIZE \
  --node_rank=$RANK \
  --use_env downstream/run_cls_distributed_gpt3.py \
  --config ./configs/${exp_name}.yaml \
  --output_dir ./output/${exp_name} \
  --enable_deepspeed \
  --resume path/to/1_3B_mp_rank_00_model_states.pt \
  --bf16
  2>&1 | tee ./output/${exp_name}/train.log

Experimental results

Below we show the results on the validation sets for reference.

mPLUG-Video (BloomZ-7B)

We build the mPLUG-Video model based on mPLUG-Owl. To use the model, you should first clone the mPLUG-Owl repo as

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl

The instruction-tuned checkpoint is available on HuggingFace. For finetuning the model, you can refer to mPLUG-Owl Repo. To perform video inference you can use the following code:

import torch
from mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration
from transformers import AutoTokenizer
from mplug_owl_video.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor

pretrained_ckpt = 'MAGAer13/mplug-youku-bloomz-7b'
model = MplugOwlForConditionalGeneration.from_pretrained(
    pretrained_ckpt,
    torch_dtype=torch.bfloat16,
    device_map={'': 0},
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)

# We use a human/AI template to organize the context as a multi-turn conversation.
# <|video|> denotes an video placehold.
prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <|video|>
Human: 视频中的女人在干什么？
AI: ''']

video_list = ['yoga.mp4']

# generate kwargs (the same in transformers) can be passed in the do_generate()
generate_kwargs = {
    'do_sample': True,
    'top_k': 5,
    'max_length': 512
}
inputs = processor(text=prompts, videos=video_list, num_frames=4, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
    res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)
print(sentence)

Citing Youku-mPLUG

If you find this dataset useful for your research, please consider citing our paper.

@misc{xu2023youku_mplug,
    title={Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks},
    author={Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Chenliang Li, Qi Qian, Que Maofei, Ji Zhang, Xiao Zeng, Fei Huang},
    year={2023},
    eprint={2306.04362},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

youku-mplug's People

Contributors

Stargazers

Watchers

Forkers

pokerwx shijing888 lenjiuyian sarkarda keyman9848 5l1v3r1 kekewind dwhnicholas nanqiai garlic1234567 quduoduo

youku-mplug's Issues

下载Youku-AliceMind的文件名与caption文件里的名字不同要怎么匹配？

Modelscope下载Youku-mPLUG出现oss2.exceptions.NoSuchKey

当我使用modelscope库下载Youku-mPLUG数据集时出现了oss2.exceptions.NoSuchKey异常，以下是我的报错信息：

oss2.exceptions.NoSuchKey: {'status': 404, 'x-oss-request-id': '652D2F6BEEC74232305394E8', 'details': {'Code': 'NoSuchKey', 'Message': 'The specified key does not exist.', 'RequestId': '652D2F6BEEC74232305394E8', 'HostId': 'dataset-hub.oss-cn-hangzhou.aliyuncs.com', 'Key': 'public-zip/modelscope/Youku-AliceMind/master/videos/pretrain/14111B1211bJ4551C43BJJbb23Y---3A4Y1b17aE3C5a5JJ-aBY81aA-JE4838YbAF.mp4', 'EC': '0026-00000001', 'RecommendDoc': 'https://help.aliyun.com/zh/oss/support/0026-00000001'}}
Downloading data files:   0%|                                                                                                                  | 0/1 [16:05<?, ?it/s]

这是我是用的python脚本:

from modelscope.hub.api import HubApi
from modelscope.msdatasets import MsDataset
from modelscope.utils.constant import DownloadMode
api = HubApi()
sdk_token = ""  # 必填, 从modelscope WEB端个人中心获取
api.login(sdk_token)  # online
data = MsDataset.load(
    'Youku-AliceMind',
    # download_mode=DownloadMode.FORCE_REDOWNLOAD,    # if you need to clean the cache , please use it
    subset_name="pretrain",
    cache_dir="./data")
    
print(next(iter(data)))

# Slicing
len(data)
data_new = data[10:15]
for item in data_new:
    print(item)

(1) 我怎样才能正常下载？
(2) 是否支持断点续传？毕竟我已经下载部分数据？
(3) 找不到的数据是否能够自动跳过，继续下载剩余数据?

How to download from outside Chinese

Hi, look like to download the data, we need to register to modelscope required +86 phone number.
How to download the data from outside Chinese?

Thanks

有人训练、微调成功过吗？

在按照仓库运行代码下载模型的时候，遇到一些问题，不知道有人是否成功复现过？

希望能得到好心人的解答！！

mPLUG-video模型没开源吗？

您好，我看论文里提到Our dataset, code, model, and evaluation set are available at https://github.com/X-PLUG/Youku-mPLUG，但是我找了半天始终没找到，是还没有开源吗？

how to download videos of pretrain dataset?

When I use the download code below, it return the git-lfs.

data = MsDataset.load(
'Youku-AliceMind',
namespace='modelscope',
download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it
subset_name='pretrain',
split='train', # Options: train, test, validation
use_streaming=True)

print(next(iter(data)))

How to download all 36TB video from git-lfs

What about the copyright of the videos?

Download problems

Why is there less than 20 video caption data downloaded locally in the data set?

download pretrain data 404

下载问题

使用modelscope下载pretrain数据集过程中报错，如下所示：

2023-07-26 14:05:26,858 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-07-26 14:05:27,483 - modelscope - INFO - Loading done! Current index file version is 1.7.1, with md5 1a3c80f9923ff896da3e2a4786eadd0f and a total number of 861 components indexed
2023-07-26 14:05:47,880 - modelscope - INFO - Reusing cached meta-data file: /root/.cache/modelscope/hub/datasets/modelscope/Youku-AliceMind/master/meta/8675a4d533a4241f99abcf63d2356b01
Overall progress:   0%|                                                                                                                                                                                                                                                                                 | 0/10009370 [00:00<?, ?it/s]2023-07-26 14:06:26,748 - modelscope - INFO - Reusing cached meta-data file: /root/.cache/modelscope/hub/datasets/modelscope/Youku-AliceMind/master/meta/8675a4d533a4241f99abcf63d2356b01
2023-07-26 14:07:06,106 - modelscope - ERROR - 'DataDownloadConfig' object has no attribute 'storage_options'
Overall progress:   0%|                                                                                                                                                                                                                                                                                 | 0/10009370 [00:39<?, ?it/s]
{'video_id:FILE': ['videos/pretrain/14111Y1211b-1134b18bAE55bFE7Jbb7135YE3aY54EaB14ba7CbAa1AbACB24527A.flv'], 'title': ['妈妈给宝宝听胎心，看看宝宝是怎么做的，太调皮了']}

请问如何处理？

Failed to download the dataset.

Thank you for the great work, but I’ve encountered a issue while downloading. Could you please help me take a look?

I use Python 3.8 with my modelscope is 1.6.0 via pip installed. My machine is an Aliyun ECS and its network services.

When I download the pretrain subset, I got the error below:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.modelscope.cn', port=80): Max retries exceeded with url: /api/v1/datasets/modelscope/Youku-AliceMind/oss/tree/?MaxLimit=-1&Revision=master&Recursive=True&FilterDir=True (Caused by ReadTimeoutError("HTTPConnectionPool(host='www.modelscope.cn', port=80): Read timed out. (read timeout=1800)"

When I download the classification or retrieval subset, I got the error below:

oss2.exceptions.NoSuchKey: {'status': 404, 'x-oss-request-id': '648164E2E81BB23635579DA3', 'details': {'Code': 'NoSuchKey', 'Message': 'The specified key does not exist.', 'RequestId': '648164E2E81BB23635579DA3', 'HostId': 'dataset-hub.oss-cn-hangzhou.aliyuncs.com', 'Key': 'public--zip/modelscope/Youku-AliceMind/master/videos/classification/14111B12117422YB5YYABB2FB3JCA1-b32488Ea75-3a5CY1a3CY8a54J8BA1AJ-7C.mp4', 'EC': '0026-00000001'}}

When I download the caption subset, I got the error below:

ValueError: subset_name caption not found. Available: dict_keys(['classification', 'retrieval', 'pretrain'])

The evaluation results as well as leaderboards look very different

The original Arxiv paper:

The GitHub Readme:

visual encoder

请问一下，论文中写的是将TimeSformer作为视觉编码器，但是代码中用的是clip_vit_b16.pth？clip的预训练权重能加载到TimeSformer上吗？

It is too slow to download the mp4 file, will you release the urls of the mp4s?

已经下载完了，再次使用。。。

已经下载完了，再次使用还在
Downloading data files:
？？？

About the pre-trained CLIP model

The code shows it loads the visual encoder from a CLIP model (clip-vit-b16.pth). I did not find anything mentioned where it comes from. I tried to load clip-vitb16 from OpenAI huggingface, but it has unmatched keys when loading. Is OpenAI's CLIP the required or you have your own trained CLIP?

没有开源模型代码

找不到models.distributed_gpt3文件，什么时候可以开源模型代码

下载问题

当我使用推荐命令下载数据集时，发现只下载了5个视频就结束了，请问是什么原因。代码如下
from modelscope.hub.api import HubApi
from modelscope.msdatasets import MsDataset
from modelscope.utils.constant import DownloadMode
api = HubApi()
sdk_token = "" # Required, obtain from ModelScope WEB personal center
api.login(sdk_token) # online
data = MsDataset.load(
'Youku-AliceMind',
namespace='modelscope',
# download_mode=DownloadMode.FORCE_REDOWNLOAD, # if you need to clean the cache , please use it
subset_name='retrieval',
split='train', # Options: train, test, validation
use_streaming=True)

Downloaded video name is not consistent with the csv file.

According to the cached annotation csv file root/.cache/modelscope/hub/datasets/modelscope/Youku-AliceMind/master/meta/8675a4d533a4241f99abcf63d2356b01, the name of the first video is 14111Y1211b-1134b18bAE55bFE7Jbb7135YE3aY54EaB14ba7CbAa1AbACB24527A.flv , however, the saved video name is ed40efc4c1601131b91170468233687ef93ad432e297c656882ed24e600c0880, which is not conistent. So is there another annotation file that correctly corresponds to the name of the downloaded video?

2023-07-28 19:57:49,271 - modelscope - INFO - PyTorch version 1.13.1 Found.
2023-07-28 19:57:49,277 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-07-28 19:57:50,309 - modelscope - INFO - Loading done! Current index file version is 1.7.1, with md5 cd68fcaa5f94e738aeb76ca735422dcd and a total number of 861 components indexed
2023-07-28 19:57:53,083 - modelscope - INFO - Loading meta-data file ...
10009371it [02:31, 66111.32it/s]
Using custom data configuration modelscope
Overall progress:   0%|                                                                                                                                       | 0/10009370 [00:00<?, ?it/s]2023-07-28 20:00:49,351 - modelscope - INFO - Reusing cached meta-data file: /root/.cache/modelscope/hub/datasets/modelscope/Youku-AliceMind/master/meta/8675a4d533a4241f99abcf63d2356b01
100% {'video_id:FILE': ['/root/.cache/modelscope/hub/datasets/modelscope/Youku-AliceMind/master/data_files/ed40efc4c1601131b91170468233687ef93ad432e297c656882ed24e600c0880'], 'title': ['妈妈给宝宝听胎心，看看宝宝是怎么做的，太调皮了']}
100% {'video_id:FILE': ['/root/.cache/modelscope/hub/datasets/modelscope/Youku-AliceMind/master/data_files/df671131d72f28677c61060a5890937b40dda2292a06714bd73143e0b5f49f71'], 'title': ['治愈系旋律来袭，周华健经典再现《朋友》，珍惜身边的人吧！']}

如何重新训练mplug

我想问请教下大家，如果我想从头开始训练mplug需要如何做？

根据文档里面下载到的数据集只是论文中提到36TB的一部分吗？

Is there a progress bar when downloading?

Thank you for your great contribution to the video/text pre-training field and community.

I'm trying to download the data using my own sdk_token following the official code, but nothing shows up on the command line when running the "MsDataset.load", so I'm not sure if it's downloading properly and how long it will take for the download to complete. Is there a download progress bar?

pretrain 36T的数据能提供URL下载吗？

当前单线程下载速度很慢，而且下载失败又只能重头下载？
如果开源的话，能把数据url也公布吗，或者提供更快速稳定的下载方式也行～

The pretrain.csv file is empty, failed to download.

训练成本

你好作者，mplug的能力非常强，想请问一下在预训练和微调中mplug分别使用了多少显卡，消耗多长时间？

数据集中视频和文本相关性的问题

看了retrieval test前几个视频，感受是文本和视频基本没有弱相关性

oss2.exceptions.NoSuchKey

oss2.exceptions.NoSuchKey: {'status': 404, 'x-oss-request-id': '664AB50BAB8D903237EA79DB', 'details': {'Code': 'NoSuchKey', 'Message': 'The specified key does not exist.', 'RequestId': '664AB50BAB8D903237EA79DB', 'HostId': 'dataset-hub.oss-cn-hangzhou.aliyuncs.com', 'Key': 'public-zip/modelscope/Youku-AliceMind/master/videos/pretrain/14111B121174Y8EbJ43JF-283-42187EFC3A83aBY1Ca55Y4a-Y27aY3Y8C1CCb4B7.mp4', 'EC': '0026-00000001', 'RecommendDoc': 'https://api.aliyun.com/troubleshoot?q=0026-00000001'}}

How to Video-Text Retrieval with Youku-mPLUG?

how to Video-Text Retrieval with Youku-mPLUG? not found demo

How to load the video file in dataset?

Nice Job!
I have a question, This is the result of 'next(iter(data))'. What's the format of the video file 'b9bb81fd77d4930a889d17adadf83d95209f80d7eb6387933b5aacfad2c52fc7', or can you give an example of how to load the video file?

{'clip_name:FILE': '../modelscope/hub/datasets/modelscope/Youku-AliceMind/master/data_files/b9bb81fd77d4930a889d17adadf83d95209f80d7eb6387933b5aacfad2c52fc7', 'caption': '身穿黑色上衣戴着头盔的女子在路上骑着摩托车四周还停放了一些车'}

Downloading Issue

We are facing some downloading issues right now. We will fix it ASAP. Please watch our repo for notification.