GithubHelp home page GithubHelp logo

pku-yuangroup / open-sora-plan Goto Github PK

View Code? Open in Web Editor NEW
10.3K 150.0 927.0 3.56 MB

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

License: MIT License

Python 98.04% Shell 1.43% C++ 0.20% Cuda 0.32%

open-sora-plan's Introduction

Open-Sora Plan

slack badge WeChat badge Twitter
hf_space hf_space Replicate demo and cloud API Open In Colab
License GitHub repo contributors GitHub Commit Pr GitHub issues GitHub closed issues
GitHub repo stars  GitHub repo forks  GitHub repo watchers  GitHub repo size

We are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. We are training for higher resolution (>1024) as well as longer duration (>10s) videos, here is a preview of the next release. We show compressed .gif on GitHub, which loses some quality.

Thanks to HUAWEI Ascend NPU Team for supporting us.

目前已支持国产AI芯片(华为昇腾,期待更多国产算力芯片)进行推理,下一步将支持国产算力训练,具体可参考昇腾分支hw branch.

257×512×512 (10s) 65×1024×1024 (2.7s) 65×1024×1024 (2.7s)
Time-lapse of a coastal landscape transitioning from sunrise to nightfall... A quiet beach at dawn, the waves gently lapping at the shore and the sky painted in pastel hues.... Sunset over the sea.
65×512×512 (2.7s) 65×512×512 (2.7s) 65×512×512 (2.7s)
A serene underwater scene featuring a sea turtle swimming... Yellow and black tropical fish dart through the sea. a dynamic interaction between the ocean and a large rock...
The dynamic movement of tall, wispy grasses swaying in the wind... Slow pan upward of blazing oak fire in an indoor fireplace. A serene waterfall cascading down moss-covered rocks...

💪 Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome!!!

本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前版本离目标差距仍然较大,仍需持续完善和快速迭代,欢迎Pull request!!!

Project stages:

  • Primary
  1. Setup the codebase and train an un-conditional model on a landscape dataset.
  2. Train models that boost resolution and duration.
  • Extensions
  1. Conduct text2video experiments on landscape dataset.
  2. Train the 1080p model on video2text dataset.
  3. Control model with more conditions.

📰 News

[2024.04.09] 🚀 Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos. Here is the dataset for train (updating): Open-Sora-Dataset.

[2024.04.07] 🔥🔥🔥 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.

[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.

[2024.03.10] 🚀🚀🚀 This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.

[2024.03.08] We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from Latte.

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, pull requests are welcome.

[2024.03.04] We re-organize and modulize our code to make it easy to contribute to the project, to contribute please see the Repo structure.

[2024.03.03] We open some discussions to clarify several issues.

[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.

✊ Todo

Setup the codebase and train an unconditional model on landscape dataset

  • Fix typos & Update readme. 🤝 Thanks to @mio2333, @CreamyLong, @chg0901, @Nyx-177, @HowardLi1984, @sennnnn, @Jason-fan20
  • Setup environment. 🤝 Thanks to @nameless1117
  • Add docker file. ⌛ [WIP] 🤝 Thanks to @Mon-ius, @SimonLeeGit
  • Enable type hints for functions. 🤝 Thanks to @RuslanPeresy, 🙏 [Need your contribution]
  • Resume from checkpoint.
  • Add Video-VQVAE model, which is borrowed from VideoGPT.
  • Support variable aspect ratios, resolutions, durations training on DiT.
  • Support Dynamic mask input inspired by FiT.
  • Add class-conditioning on embeddings.
  • Incorporating Latte as main codebase.
  • Add VAE model, which is borrowed from Stable Diffusion.
  • Joint dynamic mask input with VAE.
  • Add VQVAE from VQGAN. 🙏 [Need your contribution]
  • Make the codebase ready for the cluster training. Add SLURM scripts. 🙏 [Need your contribution]
  • Refactor VideoGPT. 🤝 Thanks to @qqingzheng, @luo3300612, @sennnnn
  • Add sampling script.
  • Add DDP sampling script. ⌛ [WIP]
  • Use accelerate on multi-node. 🤝 Thanks to @sysuyy
  • Incorporate SiT. 🤝 Thanks to @khan-yin
  • Add evaluation scripts (FVD, CLIP score). 🤝 Thanks to @rain305f

Train models that boost resolution and duration

  • Add PI to support out-of-domain size. 🤝 Thanks to @jpthu17
  • Add 2D RoPE to improve generalization ability as FiT. 🤝 Thanks to @jpthu17
  • Compress KV according to PixArt-sigma.
  • Support deepspeed for videogpt training. 🤝 Thanks to @sennnnn
  • Train a low dimension Video-AE, whether it is VAE or VQVAE.
  • Extract offline feature.
  • Train with offline feature.
  • Add frame interpolation model. 🤝 Thanks to @yunyangge
  • Add super resolution model. 🤝 Thanks to @Linzy19
  • Add accelerate to automatically manage training.
  • Joint training with images.
  • Implement MaskDiT technique for fast training. 🙏 [Need your contribution]
  • Incorporate NaViT. 🙏 [Need your contribution]
  • Add FreeNoise support for training-free longer video generation. 🙏 [Need your contribution]

Conduct text2video experiments on landscape dataset.

  • Load pretrained weights from Latte.
  • Implement PeRFlow for improving the sampling process. 🙏 [Need your contribution]
  • Finish data loading, pre-processing utils.
  • Add T5 support.
  • Add CLIP support. 🤝 Thanks to @Ytimed2020
  • Add text2image training script.
  • Add prompt captioner.
    • Collect training data.
      • Need video-text pairs with caption. 🙏 [Need your contribution]
      • Extract multi-frame descriptions by large image-language models. 🤝 Thanks to @HowardLi1984
      • Extract video description by large video-language models. 🙏 [Need your contribution]
      • Integrate captions to get a dense caption by using a large language model, such as GPT-4. 🤝 Thanks to @HowardLi1984
    • Train a captioner to refine captions. 🚀 [Require more computation]

Train the 1080p model on video2text dataset

  • Looking for a suitable dataset, welcome to discuss and recommend. 🙏 [Need your contribution]
  • Add synthetic video created by game engines or 3D representations. 🙏 [Need your contribution]
  • Finish data loading, and pre-processing utils.
  • Support memory friendly training.
    • Add flash-attention2 from pytorch.
    • Add xformers. 🤝 Thanks to @jialin-zhao
    • Support mixed precision training.
    • Add gradient checkpoint.
    • Support for ReBased and Ring attention. 🤝 Thanks to @kabachuha
    • Train using the deepspeed engine. 🤝 Thanks to @sennnnn
  • Train with a text condition. Here we could conduct different experiments: 🚀 [Require more computation]
    • Train with T5 conditioning.
    • Train with CLIP conditioning.
    • Train with CLIP + T5 conditioning (probably costly during training and experiments).

Control model with more condition

  • Incorporating ControlNet. ⌛ [WIP] 🙏 [Need your contribution]

📂 Repo structure (WIP)

├── README.md
├── docs
│   ├── Data.md                    -> Datasets description.
│   ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts                        -> All scripts.
├── opensora
│   ├── dataset
│   ├── models
│   │   ├── ae                     -> Compress videos to latents
│   │   │   ├── imagebase
│   │   │   │   ├── vae
│   │   │   │   └── vqvae
│   │   │   └── videobase
│   │   │       ├── vae
│   │   │       └── vqvae
│   │   ├── captioner
│   │   ├── diffusion              -> Denoise latents
│   │   │   ├── diffusion         
│   │   │   ├── dit
│   │   │   ├── latte
│   │   │   └── unet
│   │   ├── frame_interpolation
│   │   ├── super_resolution
│   │   └── text_encoder
│   ├── sample
│   ├── train                      -> Training code
│   └── utils

🛠️ Requirements and Installation

  1. Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
  1. Install required packages
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
  1. Install optional requirements such as static type checking:
pip install -e '.[dev]'

🗝️ Usage

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command. We also provide online demo hf_space and hf_space in Huggingface Spaces.

🤝 Enjoying the Replicate demo and cloud API and Open In Colab, created by @camenduru, who generously supports our research!

python -m opensora.serve.gradio_web_server

CLI Inference

sh scripts/text_condition/sample_video.sh

Datasets

Refer to Data.md

Evaluation

Refer to the document EVAL.md.

Causal Video VAE

Reconstructing

Example:

python examples/rec_imvi_vae.py --video_path test_video.mp4 --rec_path output_video.mp4 --fps 24 --resolution 512 --crop_size 512 --num_frames 128 --sample_rate 1 --ae CausalVAEModel_4x8x8 --model_path pretrained_488_release --enable_tiling --enable_time_chunk

Parameter explanation:

  • --enable_tiling: This parameter is a flag to enable a tiling conv.

  • --enable_time_chunk: This parameter is a flag to enable a time chunking. This will block the video in the temporal dimension and reconstruct the long video. This is only an operation performed in the video space, not the latent space, and cannot be used for training.

Training and Eval

Please refer to the document CausalVideoVAE.

VideoGPT VQVAE

Please refer to the document VQVAE.

Video Diffusion Transformer

Training

sh scripts/text_condition/train_videoae_17x256x256.sh
sh scripts/text_condition/train_videoae_65x256x256.sh
sh scripts/text_condition/train_videoae_65x512x512.sh

🚀 Improved Training Performance

In comparison to the original implementation, we implement a selection of training speed acceleration and memory saving features including gradient checkpointing, mixed precision training, and pre-extracted features, xformers, deepspeed. Some data points using a batch size of 1 with a A100:

64×32×32 (origin size: 256×256×256)

gradient checkpointing mixed precision xformers feature pre-extraction deepspeed config compress kv training speed memory
0.64 steps/sec 43G
Zero2 0.66 steps/sec 14G
Zero2 0.66 steps/sec 15G
Zero2 offload 0.33 steps/sec 11G
Zero2 offload 0.31 steps/sec 12G

128×64×64 (origin size: 512×512×512)

gradient checkpointing mixed precision xformers feature pre-extraction deepspeed config compress kv training speed memory
0.08 steps/sec 77G
Zero2 0.08 steps/sec 41G
Zero2 0.09 steps/sec 36G
Zero2 offload 0.07 steps/sec 39G
Zero2 offload 0.07 steps/sec 33G

💡 How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

  • Latte: The main codebase we built upon and it is an wonderful video generated model.
  • PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
  • VideoGPT: Video Generation using VQ-VAE and Transformers.
  • DiT: Scalable Diffusion Models with Transformers.
  • FiT: Flexible Vision Transformer for Diffusion Model.
  • Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

✏️ Citing

BibTeX

@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
  author       = {PKU-Yuan Lab and Tuzhan AI etc.},
  title        = {Open-Sora-Plan},
  month        = apr,
  year         = 2024,
  publisher    = {GitHub},
  doi          = {10.5281/zenodo.10948109},
  url          = {https://doi.org/10.5281/zenodo.10948109}
}

Latest DOI

DOI

🤝 Community contributors

open-sora-plan's People

Contributors

alonzoleeeooo avatar anapple-hub avatar chaojie avatar digger-yu avatar glgh avatar helios-fr avatar howardli1984 avatar jason-fan20 avatar jialin-zhao avatar jpthu17 avatar junwuzhang19 avatar kabachuha avatar khan-yin avatar linb203 avatar linzy19 avatar liuhanchen-github avatar luo3300612 avatar mon-ius avatar nameless1117 avatar qqingzheng avatar rain305f avatar ruslanperesy avatar samithuang avatar sennnnn avatar simonleegit avatar tzy010822 avatar yanyang1024 avatar ytimed2020 avatar yuanli2333 avatar yunyangge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-sora-plan's Issues

Ask for training resource requirement

I want to know how many GPUs and GPU RAMs are needed to run the demo training as well as the training times for various training configurations.

Is there any WeChat group to join for fast iteration and optimization of devepment of this project?

One more question, have you test your codes with outputs including moving humans or animals? I understand that the performance maybe not very good since the limitation of computing resources. But I'd love to find more various output examples.

Integration with `huggingface_hub`

Hi there 👋

My name is Sayak, one of the maintainers of the diffusers library at Hugging Face. Thanks for kicking this off!

I was wondering if you'd be interested to integrate with the huggingface_hub library to make model loading and saving easier with the Hugging Face Hub platform. I am happy to draft a PR to showcase the possibility.

Failure in 'pip install -r requirements.txt'

INFO: pip is looking at multiple versions of transformers to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 36) and tokenizers==0.10.3 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested tokenizers==0.10.3
transformers 4.32.0 depends on tokenizers!=0.11.3, <0.14 and >=0.11.1

To fix this you could try to:

  1. loosen the range of package versions you've specified
  2. remove package versions to allow pip attempt to solve the dependency conflict

It seems running install command fails due to transformers and tokenizers version conflict

License

Hi,
Thank you for releasing this, I noticed you mentioned that this is an “open source” project but the license is NC which doesn’t classify as an open source license. Is there any chance of could be changed or does the technology this repo depends on also have that license?
Thanks!

VQVAE or VAE?

Dear authors, thanks for your interesting work and plans. However, there is one question in my mind: why you choose to use VQVAE instead of VAE?
As stated both in DiT and SoRA's official website, both of them use VAE without quantization. So what drive you to choosing VQVAE as your tokenizer?
Looking forward to your reply and hoping to contribute to your project.

Join You!

I wanna to know how to join the program. Looking forward to your reply.

What's HW requirement to run this model?

I tried A100 (40GB SXM4) with 30 vCPUs, 200 GiB RAM, 512 GiB SSD but immediately CUDA out of memory.

which card / config shall i use? 8x A100 80GB? 1x H100 80GB? 8x H100 80GB?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation

(opensora) ubuntu@129-146-126-183:~/opensora-arizona/Open-Sora-Plan$ python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Downloading...
From (original): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5
From (redirected): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5&confirm=t&uuid=edea95d1-1e18-41c1-8b57-966749fb41ad
To: /home/ubuntu/opensora-arizona/Open-Sora-Plan/ucf101_stride4x4x4
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258M/258M [00:05<00:00, 45.4MB/s]
sample_frames_len 500, only can sample 300 assets/origin_video_0.mp4 300
Traceback (most recent call last):
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 110, in
main(args)
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 92, in main
encodings, embeddings = vqvae.encode(x_vae, include_embeddings=True)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 38, in encode
h = self.pre_vq_conv(self.encoder(x))
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 241, in forward
h = self.res_stack(h)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 125, in forward
return x + self.block(x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 104, in forward
x = self.attn_w(x, x, x) + self.attn_h(x, x, x) + self.attn_t(x, x, x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 193, in forward
a = self.attn(q, k, v, decode_step, decode_idx)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 244, in forward
out = scaled_dot_product_attention(q, k, v, training=self.training)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 500, in scaled_dot_product_attention
attn = torch.matmul(q, k.transpose(-1, -2))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Question about latent size

hi, this project use VQVAE to compress video into small latent space, and latent embedding dim is 512 or 256. But in LDM, they usually use very small embedding dim 4 or 3, SD use 4. Will this large latent dim make the diffusion training process too hard to learn, since it predict a high dim noise?

[Question] Open Source Sora Model: Clarification and Contribution

Hi there,

I came across the Sora model replication project and am interested in contributing ideas for improvement. As I lack access to powerful hardware for testing, my focus would be on suggesting enhancements to address any existing shortcomings in the Sora model.

Could you clarify if the project aims to replicate Sora exactly or if there's a focus on improving its current performance?

在xformers中使用默认的op加上attn_mask会出现nan

等了两天终于看到了xformers相关的内容。我自己测试也发现了xformers中使用构造的attn mask会出现nan的情况。我换了其他attention operator进行了尝试,出现了以下情况:

[email protected]” is not supported because: attn_bias type is <class 'torch.Tensor'>

"tritonflashattF" is not supported because: attn_bias type is <class 'torch.Tensor'> operator wasn't built - see `python -m xformers.info` for more info triton is not available

"smallkF" is not supported because: max(query.shape[-1] != value.shape[-1]) > 32 dtype=torch.float16 (supported: {torch.float32}) bias with non-zero stride not supported unsupported embed per head: 72

所以针对目前的这种attn mask的构造,是没有办法使用xformers+attn_mask进行显存优化吗

Tokenizers library version issue

INFO: pip is looking at multiple versions of transformers to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 36) and tokenizers==0.10.3 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested tokenizers==0.10.3
transformers 4.32.0 depends on tokenizers!=0.11.3, <0.14 and >=0.11.1

To fix this you could try to:

  1. loosen the range of package versions you've specified
  2. remove package versions to allow pip attempt to solve the dependency conflict

发错了,抱歉

          > 有个叫ganchengguang的,也不知道是何居心,你为啥给每个支持这个开源项目的人都弄个大拇指朝下的表情,也不知道你是什么成分,去了趟日本就把自己当日本人了是吧,还是你本来就是日本间谍?

对不起,手滑点错了。已改成赞

好的,没事,希望你是真的手滑

cd VideoGPT: No such file or directory

Thanks for the great work!!!

I just found out the repo directory has changed and VideoGPT/ is moved to src/sora/modules/ae/vqvae/videogpt/. But the README is still same, cd VideoGPT.

could u please update that as well? Much thanks!!!

bash: cd: VideoGPT: No such file or directory

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
conda create -n opensora python=3.8 -y
conda activate opensora
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
cd VideoGPT --> cd src/sora/modules/ae/vqvae/videogpt/
pip install -e .
cd ..

python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1

curating high-quality video data

Hi team members,
I would attribute the success of SORA to the training data like how OpenAI has done for GPT. Any ideas on curating high-quality video data?

NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation

Hi,

I'm using H100 (80GB) , but the specified pytorch version (torch==1.13.1+cu117) does not support H100 CUDA sm_90.

Has anyone met h100 issue? how to fix it? Much thanks!!

NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

(opensora) ubuntu@209-20-158-49:~/opensora-utah/Open-Sora-Plan$ python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Downloading...
From (original): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5
From (redirected): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5&confirm=t&uuid=9a37ecfb-0c55-4e77-a418-9129ea8e4ba4
To: /home/ubuntu/opensora-utah/Open-Sora-Plan/ucf101_stride4x4x4
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258M/258M [00:03<00:00, 83.7MB/s]
/home/ubuntu/.local/lib/python3.10/site-packages/torch/cuda/init.py:155: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

Try using linear passthrough to train a model in dit?

Try using linear passthrough to train a model in dit?

One of the key ideas is that it works as if it was like "an online passthrough", by applying a loop on a module SuperClass, that groups layers, in a such way they get their forward method repeated in a loop. So, in theory, you can observe more intelligence in the same way MegaDolphin 120b, Professor 155b, Venus120b and other huge models, but use way less vRAM, because instead of cloning the weights, we share them in the vRAM.

https://huggingface.co/cognitivecomputations/dolphin-phi-2-kensho

点赞,加油。一点正面建议。

给Open-Sora-Plan兄弟们的一点建议:

1)与其仅仅跟随,不如顺带超越。技术性超越点:比如在基础版本上,增加大对象的检测和约束(正则),避免无中生有;建立一个关键词和属性表,增加对象的类型标定(刚体,准刚体,流体),增加刚体、准刚体之类的运动关键点检测,防止手穿越身体,用正则,用RLHF之类来尽可能增加物理合理性。(所有生成式模型,最后都拼的是对客观世界一致性的约束强弱),...

2)条件部分,从你们示意图可见,你们侧重图片的几个属性,个人建议:文字(caption越丰富越好,包括物体/场景对象组,包括运动,包括形容词) > 图片(原始照片 > 处理后图片)> UI操作数据

3)数据和算力很大,可以搞个赞助页面。赞助钱或者GPU云资源都行啊。
数据可以同步准备:特别是海量视频的caption。这里面应该很多人肉工作。在算法caption的基础上,还是需要人工去检查。需要提前规划,那些在视频生成的时候,用户可能会很care的那些关键字:相机位置(非常必要:比如导演心目中的镜头,机位移动轨迹,比如从人物背后跟随到绕道前方近距离人脸特写等,这个在一般视频caption中没这些;我做的立体中,这个是建模事后在线去确定),画面风格,影视术语,灯光材质,动作交互,...

4)Sora, Genie等出来后,我也当天就做了学习分析。比如:
https://github.com/yuedajiong/super-ai/blob/main/superai-20240216-sora.png
https://github.com/yuedajiong/super-ai
我主要精力在: ”立体,动态/交互,逼真,影级,复杂世界“ 的生成上。
本质上,我更在乎:利用显式的5D(dynamic,interactive)表示,强约束,来做vision generation。
如果技术上有用得到的,乐意加入一起写代码。
比如:逼真,在我的理解中,包含类别的逼真(人像人),更包含个体的逼真(刘亦菲像刘亦菲)。当我们文字描述:刘亦菲在跳舞,并且用了刘亦菲的照片的时候,产生的5D世界,或者sora的2D世界,确实一直就是刘亦菲的脸。对于影视创作(明星的IP),甚至video-fake等场景来说(这个就不用举例了,名人),这个一致性约束,非常重要。

5)一个超越Sora技术思考点分享:
用户输入文字的时候,是one-shot一把输入的;但生成的是视频。当涉及到关于“运动”的描述时,如何将“运动”文本条件分解。 大家都不知道sora的做法;我只是按照我的经验和理解来,在文本空间做一定处理,然后输入条件。我可能这么做:
比如:
input: 一轮红日正在冉冉升起,中途一架飞机正好凌日飞过。
condition-processor-network: ....
output: {前:文本处理:一轮红日正在冉冉升起;图片补充:sunraise.jpg
中:文本处理:一轮红日正在冉冉升起,中途一架飞机正好凌日飞过; 图片补充:plane.jpg
后:文本处理:一轮红日正在冉冉升起;}
一句话:人类智能的很多来源就是在于对物理世界抽象后,在符号空间上,基于世界模拟,做大量的加工。
不管Sora有没有,我们构造一个好的: 文本90% + 图片10%的,场景符号化细化的子网络,一定是超越sora的点。 用户输入100个字,我们的场景细化模块产生10000个字,这些字的产生极快时间可以 忽略不计,然后这些“字9图1”代表的约束,可以非常高的程度来强化一致性,甚至在2小时长的视频中起作用。(想象为导演的那些手稿)(在我的立体世界构造中,我的文字会变成:远景:2D图片;中景:高斯溅射;近景:立体可交互模型;灯光;等等展开的细节。 完全不会无中生有的出来一个东西。 sora这种弱化约束的2d生成,也可以有类似的在符号空间的详细约束。)

6)再来一个超越Sora的技术点的思考分享:
不知道sora有没有哈,我抛砖我的想法。
假设要构造:“刘亦菲(含图片)在跳科目三”的视频。 由于只有刘亦菲的的一张含脸照片输入,如何保证在所有帧里面的刘亦菲都是刘亦菲?
没有权限访问sora做试验,而且要用图片做条件的,所以不清楚sora支持的如何;但我估计,sora可能目前支持不好。要在“个体逼真,立体一致,动态一致,光照变换下一致,多表情一致”,有立体模型还相对好控制,但都不容易。
如何搞呢?
在sora-like的算法中,在condition输入部分,它有args的输入除了分辨率之类的,可以增加一个:face-high-fidelity。这个对特定人IP影视制作,极其有用。 对输入的image做face detection/segmention,然后对人脸的特征,作为diffusion部分的条件,在多个steps都用原始特征cat上。有了“face-high-fidelity”指令,diffusion的condition构造策略可能和normal的构造策略都不一样。
如果sora做不好甚至做不到,这种face-high-fidelity的feature,对于直播,deep-face,影视制作,是最重要的。

7)增加controllable的设计
我不认为所有的Gen-AI based的符号和视觉类系统,可以做的很严谨的物理一致。因为不比显式表示模型的算法,确实可以逐点精准控制。数据和算法优化,会越来越好,但不能最终可控。
当商业需求确实需要很强的约束的时候,怎么办呢?我觉得就是可以在生成的过程可控。
一是通过工程上的可重复:虽然各种随机数是随机生成的,但每次随机数都记录下来,可以重放。(这种50行代码就可以搞一个通用的非侵入主算法的:various-random-make&save [or load] -> random-set&use
二是基于可重复,增加diffusion环节的可控性,比如有能力把无中生有的对象,不合理的东西,通过文本条件或者controllnet之类的技术给“再处理”“后处理”掉。

0)其他一些资源:
技术分析 https://arxiv.org/abs/2402.17177

请停止这种对浪费社会资源的行为

通读了今天的各种 PR 宣传稿,我不得不来说一句——请停止这种浪费资源的行为!

你们仅仅是为了自身的学术名声,而没有在意这项工作是不是有社会和商业价值,这个行为和目的和 EMO 的空项目开源基本一致了。

请收手吧!

谢谢!

Community Integration: Making Sora cheaper, faster, and more efficient

Thank you for your outstanding contribution to Open-Sora-Plan!

AIGC, e.g., Sora, has recently risen to be one of the hottest topics in AI. We are happy to share a fantastic solution where the costs of Sora can be much cheaper!

Colossal-AI team provides an optimized open source Sora replication solution with 46% cost reduction, sequence expansion to nearly a million. More details can be found on the blog 中文博客.

Open-source code:https://github.com/hpcaitech/Open-Sora

We would appreciate it if we could build the integration with you and other users to benefit the community!

Thank you very much.

Integrating with nodeJS

Please I would like to know if it's possible to integrate the model with nodeJS at all. Forgive my ignorance if this doesn't sound relevant.

I'm trying to see if I can build an npm package for it so that it can be installed in applications.

Videogpt 1.0 requires torch~=1.7.1

Hi, great works!

I have the version conflict with videogpt and torch.
Downloaded the code from VideoGPT, and running pip install -e . But videogpt 1.0 requires torch version ~=1.7, which is different from the previously installed torch==1.13.1+cu117.
Looking forward to your reply! Thx

Todo list or discussion channel

Do we have an todo list on public for promising next steps and also a place for the open group members to join for brainstorming idea?

My Architecture Overhaul Practical Roadmap for Faster and Less Resources T2V Generation

Hi there!

I have been watching and contributing to the text2video ecosystem for a long time now. Now that Sora is out, there's more attention to the subject and I have been concerned with the topic of multimodal models too. However, while I have ideas in mind, some of them are too fundamental to train the model from scratch.

And here is that needs to be done, in my mind.

  1. Switch the base to Latte/PixArt-a. It has good and fast architecture and supports ControlNet-Transformer out of the box.

  2. Important! As you probably know, there's a push away from the vanilla Transformer architecture in the NLP community due to its quadratic costs. While the transition to Mamba will be too complicated, I propose to switch to the newer compromise of "Linear Transformers with Learnable Kernel Functions are Better In-Context Models" https://github.com/corl-team/rebased. (released this February)

  3. Training A. Diffusion from noise is becoming outdated technology. Instead we can use flow matching where the process is accelerated due to approximating the direction of where the denoising process should follow, and not simulating the whole diffusion process.

  4. Training B. Learning the real world representation is a daunting task for AI. We can greatly by again not generating videos from total scratch, but instead making the model fill-in (inpaint) the 3D-masked parts of existing videos. See Meta's Voicebox and V-JEPA models (present on github) for more details.

  5. Use Temporal VAE from StableVideoDiffusion instead of VideoGPT. (simply better quality, and Latte uses it too)

RuntimeError in DiT `Attention` class `forward` function due to dimension mismatch

Description:

While attempting to run the code from the repository, I encountered a runtime error in the forward function of the Attention class located in src/sora/modules/diffusion/dit/models.py. I suspect that the issue might be caused by a mismatch in the dimensions of the attention_mask.

Steps to Reproduce:

  1. Clone the repository and pull the latest code.
  2. Download dependency model ucf101_stride4x4x4 and dataset UCF-101 from https://www.crcv.ucf.edu/datasets/human-actions/ucf101/UCF101.rar.
  3. Run the model using the provided train.sh script.
  4. Observe the runtime error occurring in the forward function of the Attention class.
torchrun  --nproc_per_node=8 src/train.py \
  --model DiT-XL/122 \
  --vae ucf101_stride4x4x4 \
  --data-path UCF-101 --num-classes 101 \
  --sample-rate 2 --num-frames 8 --max-image-size 128 --clip-grad-norm 1 \
  --epochs 14000 --global-batch-size 256 --lr 1e-4 \
  --ckpt-every 1000 --log-every 1000 

Error Message:

RuntimeError: The expanded size of the tensor (384) must match the existing size (16) at non-singleton dimension 3. Target sizes: [32, 16, 384, 384]. Tensor sizes: [32, 2, 12, 16]

Expected Behavior:

The forward function should execute normally without any dimension mismatch errors.

Actual Behavior:

The execution of scaled_dot_product_attention results in a runtime error due to a dimension mismatch with the attention_mask.

Additional Information:

  • I am certain that I have not modified any other code or training scripts.
  • I attempted to print the dimensions of q, k, v, and attention_mask, which are as follows:
if self.fused_attn:
            print(q.shape, k.shape, v.shape, attention_mask.shape)
            x = F.scaled_dot_product_attention(
                q, k, v,
                attn_mask=attention_mask,
                dropout_p=self.attn_drop.p if self.training else 0.,
            )
# Output: torch.Size([32, 16, 384, 72]) torch.Size([32, 16, 384, 72]) torch.Size([32, 16, 384, 72]) torch.Size([32, 2, 12, 16])

This indicates that the dimension of attention_mask does not match.

Environment Information:

  • Operating System: Linux
  • Python Version: 3.10
  • PyTorch Version: 2.1.1
  • CUDA Version: 12.3

spatial temporal embedding 的定义方式

review源码的时候有个地方没想明白,想请教一下:
源码中 spatial embedding 相关代码是:

pos_embed = get_2d_sincos_pos_embed(self.hidden_size, [num_patches_height, num_patches_width])
...
emb = np.concatenate([emb_h, emb_w], axis=1)

也就说这里把 pos_embed 向量分成了两半,前一半标记y方向的pos,后一半标记x方向的pos。

然后引入了 temporal embeding, 源码中的变量名是 pos_embed_1d:

pos_embed_1d = get_2d_sincos_pos_embed(self.hidden_size, [num_tubes_length, 1])

然后最终总的embedding 大致相当于三部分相加:

emb = token_embed + pos_embed + temporal_embed

而在用patches表征视频时,空间维度和时间维度似乎是等价的,所以我觉得,空间x方向,空间y方向,时间t方向应该按相同方式处理,简单来说,我觉得应该这样定义时空相关的embedding:

spatial_temporal_embed = np.concatenate([emb_h, emb_w, emb_t], axis=1)

emb_total = token_embed + spatial_temporal_embed

这里的 emb_t 表示标记视频前后时间帧相关的embedding分量。 这种定义中 x、y、t三个方向是同等方式处理的。

显然这种处理方式 和 源码中的embedding处理方式略有不同,不知道为什么采用了源码中的方式,或者我这里的想法有没有问题,烦请指教,谢谢!

Sample Error

Two days ago, I train a Dit-XL with the following command:

torchrun --nproc_per_node=8 src/train.py \
  --model DiT-XL/122 \
  --vae ucf101_stride4x4x4 \
  --data-path ./UCF-101 --num-classes 101 \
  --sample-rate 2 --num-frames 8 --max-image-size 128 --clip-grad-norm 1 \
  --epochs 14000 --global-batch-size 64 --lr 1e-4 \
  --ckpt-every 4000 --log-every 1000 \
  --results-dir ./exp1

Today, I try to sample a video through:

python opensora/sample/sample.py \
  --model DiT-XL/122 --ae ucf101_stride4x4x4 \
  --ckpt ./exp1/000-DiT-XL-122/checkpoints/0012000.pt --extras 1 \
  --fps 10 --num-frames 16 --image-size 256

However, I met

    model.load_state_dict(state_dict)
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DiT:
        Unexpected key(s) in state_dict: "y_embedder.embedding_table.weight".

Thank you for taking the time to look into this issue. I look forward to your response.

Taking inspiration from Stable Diffusion 3

As you probably know, StabilityAI today published their architecture details of SD3.

https://stability.ai/news/stable-diffusion-3-research-paper / https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

The key takeaways are:

  1. Rectified Flow (much faster than diffusion)
  2. Joint Transformer for both Text and Image embedding processing
  3. Improved text encoding/prompt-alignment by using mixture of CLIPs and T5
  4. Deduplication efforts
  5. Outperforms SOTA
  6. Scales to Text2Video too

image

I think these ideas can be of much help to OpenSora project

Could we enable type hints for the project

Given that this is designed as an open source project that supposes to have lots of contribution from different teams, maybe it's a good idea to enable type hints for functions so it's more readable.

Videogpt 1.0 requires torch~=1.7.1

Hi, great works!

I have the version conflict with videogpt and torch.
Downloaded the code from VideoGPT, and running pip install -e . But videogpt 1.0 requires torch version ~=1.7, which is different from the previously installed torch==1.13.1+cu117.
Looking forward to your reply! Thx

inference results

Thanks for your work. After training the model, can it infer from normal videos? Could you provide some video samples?

是否用claude3加速一下~

Open AI 说sora是基于gpt与dalle 3进行开发的,理论上要完全复刻sora,也需要攻克gpt与dalle 3才能去摸索sora的技术路线。现在claude 3如果在模仿人的思维模式上取得突破,也许能给出复现sora的正确技术方案。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.