pku-yuangroup / open-sora-plan Goto Github PK

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

License: MIT License

Python 98.04% Shell 1.43% C++ 0.20% Cuda 0.32%

open-sora-plan's Introduction

Open-Sora Plan

We are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. We are training for higher resolution (>1024) as well as longer duration (>10s) videos, here is a preview of the next release. We show compressed .gif on GitHub, which loses some quality.

Thanks to HUAWEI Ascend NPU Team for supporting us.

目前已支持国产AI芯片(华为昇腾，期待更多国产算力芯片)进行推理，下一步将支持国产算力训练，具体可参考昇腾分支hw branch.

257×512×512 (10s)	65×1024×1024 (2.7s)	65×1024×1024 (2.7s)

Time-lapse of a coastal landscape transitioning from sunrise to nightfall...	A quiet beach at dawn, the waves gently lapping at the shore and the sky painted in pastel hues....	Sunset over the sea.

65×512×512 (2.7s)	65×512×512 (2.7s)	65×512×512 (2.7s)

A serene underwater scene featuring a sea turtle swimming...	Yellow and black tropical fish dart through the sea.	a dynamic interaction between the ocean and a large rock...

The dynamic movement of tall, wispy grasses swaying in the wind...	Slow pan upward of blazing oak fire in an indoor fireplace.	A serene waterfall cascading down moss-covered rocks...

💪 Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome!!!

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，当前版本离目标差距仍然较大，仍需持续完善和快速迭代，欢迎Pull request！！！

Project stages:

Primary

Setup the codebase and train an un-conditional model on a landscape dataset.
Train models that boost resolution and duration.

Extensions

Conduct text2video experiments on landscape dataset.
Train the 1080p model on video2text dataset.
Control model with more conditions.

📰 News

[2024.04.09] 🚀 Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos. Here is the dataset for train (updating): Open-Sora-Dataset.

[2024.04.07] 🔥🔥🔥 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.

[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.

[2024.03.10] 🚀🚀🚀 This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.

[2024.03.08] We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from Latte.

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, pull requests are welcome.

[2024.03.04] We re-organize and modulize our code to make it easy to contribute to the project, to contribute please see the Repo structure.

[2024.03.03] We open some discussions to clarify several issues.

[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.

✊ Todo

Setup the codebase and train an unconditional model on landscape dataset

Train models that boost resolution and duration

Conduct text2video experiments on landscape dataset.

Train the 1080p model on video2text dataset

Control model with more condition

Incorporating ControlNet. ⌛ [WIP] 🙏 [Need your contribution]

📂 Repo structure (WIP)

├── README.md
├── docs
│   ├── Data.md                    -> Datasets description.
│   ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts                        -> All scripts.
├── opensora
│   ├── dataset
│   ├── models
│   │   ├── ae                     -> Compress videos to latents
│   │   │   ├── imagebase
│   │   │   │   ├── vae
│   │   │   │   └── vqvae
│   │   │   └── videobase
│   │   │       ├── vae
│   │   │       └── vqvae
│   │   ├── captioner
│   │   ├── diffusion              -> Denoise latents
│   │   │   ├── diffusion         
│   │   │   ├── dit
│   │   │   ├── latte
│   │   │   └── unet
│   │   ├── frame_interpolation
│   │   ├── super_resolution
│   │   └── text_encoder
│   ├── sample
│   ├── train                      -> Training code
│   └── utils

🛠️ Requirements and Installation

Clone this repository and navigate to Open-Sora-Plan folder

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan

Install required packages

conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Install optional requirements such as static type checking:

pip install -e '.[dev]'

🗝️ Usage

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command. We also provide online demo and in Huggingface Spaces.

🤝 Enjoying the and , created by @camenduru, who generously supports our research!

python -m opensora.serve.gradio_web_server

CLI Inference

sh scripts/text_condition/sample_video.sh

Datasets

Refer to Data.md

Evaluation

Refer to the document EVAL.md.

Causal Video VAE

Reconstructing

Example:

python examples/rec_imvi_vae.py --video_path test_video.mp4 --rec_path output_video.mp4 --fps 24 --resolution 512 --crop_size 512 --num_frames 128 --sample_rate 1 --ae CausalVAEModel_4x8x8 --model_path pretrained_488_release --enable_tiling --enable_time_chunk

Parameter explanation:

--enable_tiling: This parameter is a flag to enable a tiling conv.
--enable_time_chunk: This parameter is a flag to enable a time chunking. This will block the video in the temporal dimension and reconstruct the long video. This is only an operation performed in the video space, not the latent space, and cannot be used for training.

Training and Eval

Please refer to the document CausalVideoVAE.

VideoGPT VQVAE

Please refer to the document VQVAE.

Video Diffusion Transformer

Training

sh scripts/text_condition/train_videoae_17x256x256.sh

sh scripts/text_condition/train_videoae_65x256x256.sh

sh scripts/text_condition/train_videoae_65x512x512.sh

🚀 Improved Training Performance

In comparison to the original implementation, we implement a selection of training speed acceleration and memory saving features including gradient checkpointing, mixed precision training, and pre-extracted features, xformers, deepspeed. Some data points using a batch size of 1 with a A100:

64×32×32 (origin size: 256×256×256)

gradient checkpointing	mixed precision	xformers	feature pre-extraction	deepspeed config	compress kv	training speed	memory
✔	✔	✔	✔	❌	❌	0.64 steps/sec	43G
✔	✔	✔	✔	Zero2	❌	0.66 steps/sec	14G
✔	✔	✔	✔	Zero2	✔	0.66 steps/sec	15G
✔	✔	✔	✔	Zero2 offload	❌	0.33 steps/sec	11G
✔	✔	✔	✔	Zero2 offload	✔	0.31 steps/sec	12G

128×64×64 (origin size: 512×512×512)

gradient checkpointing	mixed precision	xformers	feature pre-extraction	deepspeed config	compress kv	training speed	memory
✔	✔	✔	✔	❌	❌	0.08 steps/sec	77G
✔	✔	✔	✔	Zero2	❌	0.08 steps/sec	41G
✔	✔	✔	✔	Zero2	✔	0.09 steps/sec	36G
✔	✔	✔	✔	Zero2 offload	❌	0.07 steps/sec	39G
✔	✔	✔	✔	Zero2 offload	✔	0.07 steps/sec	33G

💡 How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

Latte: The main codebase we built upon and it is an wonderful video generated model.
PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
VideoGPT: Video Generation using VQ-VAE and Transformers.
DiT: Scalable Diffusion Models with Transformers.
FiT: Flexible Vision Transformer for Diffusion Model.
Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

See LICENSE for details.

✏️ Citing

BibTeX

@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
  author       = {PKU-Yuan Lab and Tuzhan AI etc.},
  title        = {Open-Sora-Plan},
  month        = apr,
  year         = 2024,
  publisher    = {GitHub},
  doi          = {10.5281/zenodo.10948109},
  url          = {https://doi.org/10.5281/zenodo.10948109}
}

Latest DOI

🤝 Community contributors

open-sora-plan's People

Contributors

Stargazers

Watchers

Forkers

yuanli2333 chg0901 yunyangge cversace tyroneli dongsky tonywang-sh mr-harry gyfastas ftgreat gaovicki lixiang007666 paperwave strategist922 chunyu226 allenxuejian nzb15555196162 wodole wangguan1995 qiao0313 wxfai ray-ruisun jxzhangjhu syunar againstentropy xiaosheng-zhao gptv cfandy hmssg 021gink yang0110 2456868764 hanker-zhu cylonspace edsun3941 zhuxiongwei24 dandingbudanding assassindesign lida0408 ryanhuangnlp qyxqyx alexandor91 zhongjy1998 moreno7798 hku kjdnl jerryyin777 david20080125 jasonz1360 atfortes ma-dan xnliang98 gaohuan2015 black-archivers feixue94 polarisyxh poonono maolala233 eltociear gmh5225 qxzsilver1 00light00 xmyx 791428954 dotieuthien eric-doug yiming992 aoyuqc craii puccho729 techthiyanes hhhhwb xiaolong-rrl mingkin hwade zuonet1988 fjfd leixy76 stjordanis defend1234 zhenyangiacas yuan-manx gisealh hauzaakming fangwudi hellonicoo cinesynth wr-fenglei jimyzzp rainingyu maopy pieuni xiaobaibai1963 luomingshuang islinxu xuyutom macroustc cookerjin keyman9848 opendidi

open-sora-plan's Issues

Ask for training resource requirement

I want to know how many GPUs and GPU RAMs are needed to run the demo training as well as the training times for various training configurations.

Is there any WeChat group to join for fast iteration and optimization of devepment of this project?

One more question, have you test your codes with outputs including moving humans or animals? I understand that the performance maybe not very good since the limitation of computing resources. But I'd love to find more various output examples.

How to incorporate implicit physics ?

Hi what is the plan for the replacement of game engine rendering they used for the physics incorporated in the diffusion model ? Any plans ?

[support] training with more data and more GPU

I can support the work of training with more data and more GPU.
So can you give me the training settings?
I can train multiple versions of the model with these settings.
I will cooperate by being open about training progress and status in this issue, while submitting the necessary code and models through PR.
@LinB203 @SHYuanBest @Tzy010822 @LiuhanChen-github @yuanli2333 @chg0901 @cxh0519 @BinZhu-ece @junwuzhang19

Integration with `huggingface_hub`

Hi there 👋

My name is Sayak, one of the maintainers of the diffusers library at Hugging Face. Thanks for kicking this off!

I was wondering if you'd be interested to integrate with the huggingface_hub library to make model loading and saving easier with the Hugging Face Hub platform. I am happy to draft a PR to showcase the possibility.

Failure in 'pip install -r requirements.txt'

INFO: pip is looking at multiple versions of transformers to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 36) and tokenizers==0.10.3 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested tokenizers==0.10.3
transformers 4.32.0 depends on tokenizers!=0.11.3, <0.14 and >=0.11.1

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

It seems running install command fails due to transformers and tokenizers version conflict

License

Hi,
Thank you for releasing this, I noticed you mentioned that this is an “open source” project but the license is NC which doesn’t classify as an open source license. Is there any chance of could be changed or does the technology this repo depends on also have that license?
Thanks!

好样的，前进！

逐梦之人，联合起来！

VQVAE or VAE?

Dear authors, thanks for your interesting work and plans. However, there is one question in my mind: why you choose to use VQVAE instead of VAE?
As stated both in DiT and SoRA's official website, both of them use VAE without quantization. So what drive you to choosing VQVAE as your tokenizer?
Looking forward to your reply and hoping to contribute to your project.

[support] fine-tune Video-VQVAE on higher resolution

I can support the work of fine-tuning Video-VQVAE on higher resolution.
So can you give me the training settings?
I can train multiple versions of the model with these settings.
I will cooperate by being open about training progress and status in this issue, while submitting the necessary code and models through PR.
@LinB203 @SHYuanBest @Tzy010822 @LiuhanChen-github @yuanli2333 @chg0901 @cxh0519 @BinZhu-ece @junwuzhang19

Join You!

I wanna to know how to join the program. Looking forward to your reply.

What's HW requirement to run this model?

I tried A100 (40GB SXM4) with 30 vCPUs, 200 GiB RAM, 512 GiB SSD but immediately CUDA out of memory.

which card / config shall i use? 8x A100 80GB? 1x H100 80GB? 8x H100 80GB?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation

(opensora) ubuntu@129-146-126-183:~/opensora-arizona/Open-Sora-Plan$ python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Downloading...
From (original): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5
From (redirected): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5&confirm=t&uuid=edea95d1-1e18-41c1-8b57-966749fb41ad
To: /home/ubuntu/opensora-arizona/Open-Sora-Plan/ucf101_stride4x4x4
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258M/258M [00:05<00:00, 45.4MB/s]
sample_frames_len 500, only can sample 300 assets/origin_video_0.mp4 300
Traceback (most recent call last):
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 110, in
main(args)
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 92, in main
encodings, embeddings = vqvae.encode(x_vae, include_embeddings=True)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 38, in encode
h = self.pre_vq_conv(self.encoder(x))
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 241, in forward
h = self.res_stack(h)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 125, in forward
return x + self.block(x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 104, in forward
x = self.attn_w(x, x, x) + self.attn_h(x, x, x) + self.attn_t(x, x, x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 193, in forward
a = self.attn(q, k, v, decode_step, decode_idx)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 244, in forward
out = scaled_dot_product_attention(q, k, v, training=self.training)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 500, in scaled_dot_product_attention
attn = torch.matmul(q, k.transpose(-1, -2))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Question about latent size

hi, this project use VQVAE to compress video into small latent space, and latent embedding dim is 512 or 256. But in LDM, they usually use very small embedding dim 4 or 3, SD use 4. Will this large latent dim make the diffusion training process too hard to learn, since it predict a high dim noise?

Tokenizers issue

ERROR: Cannot install -r requirements.txt (line 36)

variable aspect ratios, resolutions, durations

Is the implementation of variable aspect ratios, resolutions, durations different from that of NaViT? Are there any plans to implement the NaViT?

[Question] Open Source Sora Model: Clarification and Contribution

Hi there,

I came across the Sora model replication project and am interested in contributing ideas for improvement. As I lack access to powerful hardware for testing, my focus would be on suggesting enhancements to address any existing shortcomings in the Sora model.

Could you clarify if the project aims to replicate Sora exactly or if there's a focus on improving its current performance?

在xformers中使用默认的op加上attn_mask会出现nan

等了两天终于看到了xformers相关的内容。我自己测试也发现了xformers中使用构造的attn mask会出现nan的情况。我换了其他attention operator进行了尝试，出现了以下情况：

“[email protected]” is not supported because: attn_bias type is <class 'torch.Tensor'>

"tritonflashattF" is not supported because: attn_bias type is <class 'torch.Tensor'> operator wasn't built - see `python -m xformers.info` for more info triton is not available

"smallkF" is not supported because: max(query.shape[-1] != value.shape[-1]) > 32 dtype=torch.float16 (supported: {torch.float32}) bias with non-zero stride not supported unsupported embed per head: 72

所以针对目前的这种attn mask的构造，是没有办法使用xformers+attn_mask进行显存优化吗

Tokenizers library version issue

The conflict is caused by:
The user requested tokenizers==0.10.3
transformers 4.32.0 depends on tokenizers!=0.11.3, <0.14 and >=0.11.1

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

发错了，抱歉

          > 有个叫ganchengguang的，也不知道是何居心，你为啥给每个支持这个开源项目的人都弄个大拇指朝下的表情，也不知道你是什么成分，去了趟日本就把自己当日本人了是吧，还是你本来就是日本间谍？

对不起，手滑点错了。已改成赞

好的，没事，希望你是真的手滑

cd VideoGPT: No such file or directory

Thanks for the great work!!!

I just found out the repo directory has changed and VideoGPT/ is moved to src/sora/modules/ae/vqvae/videogpt/. But the README is still same, cd VideoGPT.

could u please update that as well? Much thanks!!!

bash: cd: VideoGPT: No such file or directory

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
conda create -n opensora python=3.8 -y
conda activate opensora
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
cd VideoGPT --> cd src/sora/modules/ae/vqvae/videogpt/
pip install -e .
cd ..

python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1

curating high-quality video data

Hi team members,
I would attribute the success of SORA to the training data like how OpenAI has done for GPT. Any ideas on curating high-quality video data?

NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation

Hi,

I'm using H100 (80GB) , but the specified pytorch version (torch==1.13.1+cu117) does not support H100 CUDA sm_90.

Has anyone met h100 issue? how to fix it? Much thanks!!

NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

(opensora) ubuntu@209-20-158-49:~/opensora-utah/Open-Sora-Plan$ python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Downloading...
From (original): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5
From (redirected): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5&confirm=t&uuid=9a37ecfb-0c55-4e77-a418-9129ea8e4ba4
To: /home/ubuntu/opensora-utah/Open-Sora-Plan/ucf101_stride4x4x4
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258M/258M [00:03<00:00, 83.7MB/s]
/home/ubuntu/.local/lib/python3.10/site-packages/torch/cuda/init.py:155: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

为什么是videogpt，不应该是dit这种扩散模型吗，videogpt这种是自回归模型吧。

Try using linear passthrough to train a model in dit?

One of the key ideas is that it works as if it was like "an online passthrough", by applying a loop on a module SuperClass, that groups layers, in a such way they get their forward method repeated in a loop. So, in theory, you can observe more intelligence in the same way MegaDolphin 120b, Professor 155b, Venus120b and other huge models, but use way less vRAM, because instead of cloning the weights, we share them in the vRAM.

https://huggingface.co/cognitivecomputations/dolphin-phi-2-kensho

训练Vit模型最小的配置需要多少呢？

点赞，加油。一点正面建议。

给Open-Sora-Plan兄弟们的一点建议：

1）与其仅仅跟随，不如顺带超越。技术性超越点：比如在基础版本上，增加大对象的检测和约束（正则），避免无中生有；建立一个关键词和属性表，增加对象的类型标定（刚体，准刚体，流体），增加刚体、准刚体之类的运动关键点检测，防止手穿越身体，用正则，用RLHF之类来尽可能增加物理合理性。（所有生成式模型，最后都拼的是对客观世界一致性的约束强弱），...

2）条件部分，从你们示意图可见，你们侧重图片的几个属性，个人建议：文字（caption越丰富越好，包括物体/场景对象组，包括运动，包括形容词） > 图片（原始照片 > 处理后图片）> UI操作数据

3）数据和算力很大，可以搞个赞助页面。赞助钱或者GPU云资源都行啊。
数据可以同步准备：特别是海量视频的caption。这里面应该很多人肉工作。在算法caption的基础上，还是需要人工去检查。需要提前规划，那些在视频生成的时候，用户可能会很care的那些关键字：相机位置（非常必要：比如导演心目中的镜头，机位移动轨迹，比如从人物背后跟随到绕道前方近距离人脸特写等，这个在一般视频caption中没这些；我做的立体中，这个是建模事后在线去确定），画面风格，影视术语，灯光材质，动作交互，...

4）Sora, Genie等出来后，我也当天就做了学习分析。比如：
https://github.com/yuedajiong/super-ai/blob/main/superai-20240216-sora.png
https://github.com/yuedajiong/super-ai
我主要精力在： ”立体，动态/交互，逼真，影级，复杂世界“ 的生成上。
本质上，我更在乎：利用显式的5D(dynamic，interactive)表示，强约束，来做vision generation。
如果技术上有用得到的，乐意加入一起写代码。
比如：逼真，在我的理解中，包含类别的逼真（人像人），更包含个体的逼真（刘亦菲像刘亦菲）。当我们文字描述：刘亦菲在跳舞，并且用了刘亦菲的照片的时候，产生的5D世界，或者sora的2D世界，确实一直就是刘亦菲的脸。对于影视创作（明星的IP），甚至video-fake等场景来说(这个就不用举例了，名人)，这个一致性约束，非常重要。

5）一个超越Sora技术思考点分享：
用户输入文字的时候，是one-shot一把输入的；但生成的是视频。当涉及到关于“运动”的描述时，如何将“运动”文本条件分解。大家都不知道sora的做法；我只是按照我的经验和理解来，在文本空间做一定处理，然后输入条件。我可能这么做：
比如：
input：一轮红日正在冉冉升起，中途一架飞机正好凌日飞过。
condition-processor-network: ....
output: {前：文本处理：一轮红日正在冉冉升起；图片补充：sunraise.jpg
中：文本处理：一轮红日正在冉冉升起，中途一架飞机正好凌日飞过; 图片补充：plane.jpg
后：文本处理：一轮红日正在冉冉升起；}
一句话：人类智能的很多来源就是在于对物理世界抽象后，在符号空间上，基于世界模拟，做大量的加工。
不管Sora有没有，我们构造一个好的：文本90% + 图片10%的，场景符号化细化的子网络，一定是超越sora的点。用户输入100个字，我们的场景细化模块产生10000个字，这些字的产生极快时间可以忽略不计，然后这些“字9图1”代表的约束，可以非常高的程度来强化一致性，甚至在2小时长的视频中起作用。（想象为导演的那些手稿）（在我的立体世界构造中，我的文字会变成：远景：2D图片；中景：高斯溅射；近景：立体可交互模型；灯光；等等展开的细节。完全不会无中生有的出来一个东西。 sora这种弱化约束的2d生成，也可以有类似的在符号空间的详细约束。）

6）再来一个超越Sora的技术点的思考分享：
不知道sora有没有哈，我抛砖我的想法。
假设要构造：“刘亦菲（含图片）在跳科目三”的视频。由于只有刘亦菲的的一张含脸照片输入，如何保证在所有帧里面的刘亦菲都是刘亦菲？
没有权限访问sora做试验，而且要用图片做条件的，所以不清楚sora支持的如何；但我估计，sora可能目前支持不好。要在“个体逼真，立体一致，动态一致，光照变换下一致，多表情一致”，有立体模型还相对好控制，但都不容易。
如何搞呢？
在sora-like的算法中，在condition输入部分，它有args的输入除了分辨率之类的，可以增加一个：face-high-fidelity。这个对特定人IP影视制作，极其有用。对输入的image做face detection/segmention，然后对人脸的特征，作为diffusion部分的条件，在多个steps都用原始特征cat上。有了“face-high-fidelity”指令，diffusion的condition构造策略可能和normal的构造策略都不一样。
如果sora做不好甚至做不到，这种face-high-fidelity的feature，对于直播，deep-face，影视制作，是最重要的。

7）增加controllable的设计
我不认为所有的Gen-AI based的符号和视觉类系统，可以做的很严谨的物理一致。因为不比显式表示模型的算法，确实可以逐点精准控制。数据和算法优化，会越来越好，但不能最终可控。
当商业需求确实需要很强的约束的时候，怎么办呢？我觉得就是可以在生成的过程可控。
一是通过工程上的可重复：虽然各种随机数是随机生成的，但每次随机数都记录下来，可以重放。（这种50行代码就可以搞一个通用的非侵入主算法的：various-random-make&save [or load] -> random-set&use
二是基于可重复，增加diffusion环节的可控性，比如有能力把无中生有的对象，不合理的东西，通过文本条件或者controllnet之类的技术给“再处理”“后处理”掉。

0）其他一些资源：
技术分析 https://arxiv.org/abs/2402.17177

基于cnn的vae在工程实现上能支持超长视频吗？

目前transformer能够支持序列并行，所以对超长视频比较友好。不知道cnn在处理超长视频上会不会有什么难点？

请停止这种对浪费社会资源的行为

通读了今天的各种 PR 宣传稿，我不得不来说一句——请停止这种浪费资源的行为！

你们仅仅是为了自身的学术名声，而没有在意这项工作是不是有社会和商业价值，这个行为和目的和 EMO 的空项目开源基本一致了。

请收手吧！

谢谢！

dataset format and size

Hi and thanks for the project. You said you needed a small sample dataset to validate the model. Do you have any specific format in mind, or just video description - video link? Also, how many rows? You've probably seen this one already, but just in case: https://huggingface.co/datasets/jovianzm/Pexels-400k

Community Integration: Making Sora cheaper, faster, and more efficient

Thank you for your outstanding contribution to Open-Sora-Plan!

AIGC, e.g., Sora, has recently risen to be one of the hottest topics in AI. We are happy to share a fantastic solution where the costs of Sora can be much cheaper!

Colossal-AI team provides an optimized open source Sora replication solution with 46% cost reduction, sequence expansion to nearly a million. More details can be found on the blog 中文博客.

Open-source code：https://github.com/hpcaitech/Open-Sora

We would appreciate it if we could build the integration with you and other users to benefit the community!

Thank you very much.

Integrating with nodeJS

Please I would like to know if it's possible to integrate the model with nodeJS at all. Forgive my ignorance if this doesn't sound relevant.

I'm trying to see if I can build an npm package for it so that it can be installed in applications.

Videogpt 1.0 requires torch~=1.7.1

Hi, great works!

I have the version conflict with videogpt and torch.
Downloaded the code from VideoGPT, and running pip install -e . But videogpt 1.0 requires torch version ~=1.7, which is different from the previously installed torch==1.13.1+cu117.
Looking forward to your reply! Thx

no function named is_image_file

from opensora.utils.dataset_utils import is_image_file

I can't find a function named is_image_file.

Hi

Is it really feasible to train a video dit without inserting temporal transformers or attention modules?

Todo list or discussion channel

Do we have an todo list on public for promising next steps and also a place for the open group members to join for brainstorming idea?

[feat] Add FreeNoise support for training-free longer video generation

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

The repo https://github.com/arthur-qiu/FreeNoise-LaVie

My Architecture Overhaul Practical Roadmap for Faster and Less Resources T2V Generation

Hi there!

I have been watching and contributing to the text2video ecosystem for a long time now. Now that Sora is out, there's more attention to the subject and I have been concerned with the topic of multimodal models too. However, while I have ideas in mind, some of them are too fundamental to train the model from scratch.

And here is that needs to be done, in my mind.

Switch the base to Latte/PixArt-a. It has good and fast architecture and supports ControlNet-Transformer out of the box.
Important! As you probably know, there's a push away from the vanilla Transformer architecture in the NLP community due to its quadratic costs. While the transition to Mamba will be too complicated, I propose to switch to the newer compromise of "Linear Transformers with Learnable Kernel Functions are Better In-Context Models" https://github.com/corl-team/rebased. (released this February)
Training A. Diffusion from noise is becoming outdated technology. Instead we can use flow matching where the process is accelerated due to approximating the direction of where the denoising process should follow, and not simulating the whole diffusion process.
Training B. Learning the real world representation is a daunting task for AI. We can greatly by again not generating videos from total scratch, but instead making the model fill-in (inpaint) the 3D-masked parts of existing videos. See Meta's Voicebox and V-JEPA models (present on github) for more details.
Use Temporal VAE from StableVideoDiffusion instead of VideoGPT. (simply better quality, and Latte uses it too)

RuntimeError in DiT `Attention` class `forward` function due to dimension mismatch

Description:

While attempting to run the code from the repository, I encountered a runtime error in the forward function of the Attention class located in src/sora/modules/diffusion/dit/models.py. I suspect that the issue might be caused by a mismatch in the dimensions of the attention_mask.

Steps to Reproduce:

Clone the repository and pull the latest code.
Download dependency model ucf101_stride4x4x4 and dataset UCF-101 from https://www.crcv.ucf.edu/datasets/human-actions/ucf101/UCF101.rar.
Run the model using the provided train.sh script.
Observe the runtime error occurring in the forward function of the Attention class.

torchrun  --nproc_per_node=8 src/train.py \
  --model DiT-XL/122 \
  --vae ucf101_stride4x4x4 \
  --data-path UCF-101 --num-classes 101 \
  --sample-rate 2 --num-frames 8 --max-image-size 128 --clip-grad-norm 1 \
  --epochs 14000 --global-batch-size 256 --lr 1e-4 \
  --ckpt-every 1000 --log-every 1000

Error Message:

RuntimeError: The expanded size of the tensor (384) must match the existing size (16) at non-singleton dimension 3. Target sizes: [32, 16, 384, 384]. Tensor sizes: [32, 2, 12, 16]

Expected Behavior:

The forward function should execute normally without any dimension mismatch errors.

Actual Behavior:

The execution of scaled_dot_product_attention results in a runtime error due to a dimension mismatch with the attention_mask.

Additional Information:

I am certain that I have not modified any other code or training scripts.
I attempted to print the dimensions of q, k, v, and attention_mask, which are as follows:

if self.fused_attn:
            print(q.shape, k.shape, v.shape, attention_mask.shape)
            x = F.scaled_dot_product_attention(
                q, k, v,
                attn_mask=attention_mask,
                dropout_p=self.attn_drop.p if self.training else 0.,
            )
# Output: torch.Size([32, 16, 384, 72]) torch.Size([32, 16, 384, 72]) torch.Size([32, 16, 384, 72]) torch.Size([32, 2, 12, 16])

This indicates that the dimension of attention_mask does not match.

Environment Information:

Operating System: Linux
Python Version: 3.10
PyTorch Version: 2.1.1
CUDA Version: 12.3

加油加油，希望早日胜利

希望这条issue可以见证

spatial temporal embedding 的定义方式

review源码的时候有个地方没想明白，想请教一下：
源码中 spatial embedding 相关代码是：

pos_embed = get_2d_sincos_pos_embed(self.hidden_size, [num_patches_height, num_patches_width])
...
emb = np.concatenate([emb_h, emb_w], axis=1)

也就说这里把 pos_embed 向量分成了两半，前一半标记y方向的pos，后一半标记x方向的pos。

然后引入了 temporal embeding, 源码中的变量名是 pos_embed_1d：

pos_embed_1d = get_2d_sincos_pos_embed(self.hidden_size, [num_tubes_length, 1])

然后最终总的embedding 大致相当于三部分相加：

emb = token_embed + pos_embed + temporal_embed

而在用patches表征视频时，空间维度和时间维度似乎是等价的，所以我觉得，空间x方向，空间y方向，时间t方向应该按相同方式处理，简单来说，我觉得应该这样定义时空相关的embedding：

spatial_temporal_embed = np.concatenate([emb_h, emb_w, emb_t], axis=1)

emb_total = token_embed + spatial_temporal_embed

这里的 emb_t 表示标记视频前后时间帧相关的embedding分量。这种定义中 x、y、t三个方向是同等方式处理的。

显然这种处理方式和源码中的embedding处理方式略有不同，不知道为什么采用了源码中的方式，或者我这里的想法有没有问题，烦请指教，谢谢！

open source

Sample Error

Two days ago, I train a Dit-XL with the following command:

torchrun --nproc_per_node=8 src/train.py \
  --model DiT-XL/122 \
  --vae ucf101_stride4x4x4 \
  --data-path ./UCF-101 --num-classes 101 \
  --sample-rate 2 --num-frames 8 --max-image-size 128 --clip-grad-norm 1 \
  --epochs 14000 --global-batch-size 64 --lr 1e-4 \
  --ckpt-every 4000 --log-every 1000 \
  --results-dir ./exp1

Today, I try to sample a video through:

python opensora/sample/sample.py \
  --model DiT-XL/122 --ae ucf101_stride4x4x4 \
  --ckpt ./exp1/000-DiT-XL-122/checkpoints/0012000.pt --extras 1 \
  --fps 10 --num-frames 16 --image-size 256

However, I met

    model.load_state_dict(state_dict)
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DiT:
        Unexpected key(s) in state_dict: "y_embedder.embedding_table.weight".

Thank you for taking the time to look into this issue. I look forward to your response.

Are you evaluating to experiment with Vim?

https://github.com/hustvl/Vim

Taking inspiration from Stable Diffusion 3

As you probably know, StabilityAI today published their architecture details of SD3.

https://stability.ai/news/stable-diffusion-3-research-paper / https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

The key takeaways are:

Rectified Flow (much faster than diffusion)
Joint Transformer for both Text and Image embedding processing
Improved text encoding/prompt-alignment by using mixture of CLIPs and T5
Deduplication efforts
Outperforms SOTA
Scales to Text2Video too

I think these ideas can be of much help to OpenSora project

Panda-70M as the dataset

Panda-70M is available right now.

https://snap-research.github.io/Panda-70M/

This contains 70M video-text pairs.
It seems useful for this project.
What do you think?