hpcaitech / open-sora Goto Github PK

View Code? Open in Web Editor NEW

21.7K 183.0 2.1K 129.87 MB

Open-Sora: Democratizing Efficient Video Production for All

Home Page: https://hpcaitech.github.io/Open-Sora/

License: Apache License 2.0

Python 94.41% Shell 2.99% Jupyter Notebook 2.52% Dockerfile 0.08%

open-sora's Introduction

Open-Sora: Democratizing Efficient Video Production for All

We design and implement Open-Sora, an initiative dedicated to efficiently producing high-quality video. We hope to make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation. With Open-Sora, our goal is to foster innovation, creativity, and inclusivity within the field of content creation.

[中文文档] [潞晨云|OpenSora镜像|视频教程]

📰 News

[2024.06.17] 🔥 We released Open-Sora 1.2, which includes 3D-VAE, rectified flow, and score condition. The video quality is greatly improved. [checkpoints] [report] [blog]
[2024.04.25] 🤗 We released the Gradio demo for Open-Sora on Hugging Face Spaces.
[2024.04.25] We released Open-Sora 1.1, which supports 2s~15s, 144p to 720p, any aspect ratio text-to-image, text-to-video, image-to-video, video-to-video, infinite time generation. In addition, a full video processing pipeline is released. [checkpoints] [report]
[2024.03.18] We released Open-Sora 1.0, a fully open-source project for video generation. Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with acceleration, inference, and more. Our model can produce 2s 512x512 videos with only 3 days training. [checkpoints] [blog] [report]
[2024.03.04] Open-Sora provides training with 46% cost reduction. [blog]

🎥 Latest Demo

🔥 You can experience Open-Sora on our 🤗 Gradio application on Hugging Face. More samples and corresponding prompts are available in our Gallery.

4s 720×1280	4s 720×1280	4s 720×1280

OpenSora 1.1 Demo

2s 240×426	2s 240×426

2s 426×240	4s 480×854

16s 320×320	16s 224×448	2s 426×240

OpenSora 1.0 Demo

2s 512×512	2s 512×512	2s 512×512

A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop.	A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff.	The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall.

A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...]	The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...]	A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]

Videos are downsampled to .gif for display. Click for original videos. Prompts are trimmed for display, see here for full prompts.

🔆 New Features/Updates

📍 Open-Sora 1.2 released. Model weights are available here. See our report 1.2 for more details.
✅ Support rectified flow scheduling.
✅ Support more conditioning including fps, aesthetic score, motion strength and camera motion.
✅ Trained our 3D-VAE for temporal dimension compression.
📍 Open-Sora 1.1 released. Model weights are available here. It is trained on 0s~15s, 144p to 720p, various aspect ratios videos. See our report 1.1 for more discussions.
🔧 Data processing pipeline v1.1 is released. An automatic processing pipeline from raw videos to (text, video clip) pairs is provided, including scene cutting $\rightarrow$ filtering(aesthetic, optical flow, OCR, etc.) $\rightarrow$ captioning $\rightarrow$ managing. With this tool, you can easily build your video dataset.

✅ Improved ST-DiT architecture includes rope positional encoding, qk norm, longer text length, etc.
✅ Support training with any resolution, aspect ratio, and duration (including images).
✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
📍 Open-Sora 1.0 released. Model weights are available here. With only 400K video clips and 200 H800 days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos. See our report 1.0 for more discussions.
✅ Three-stage training from an image diffusion model to a video diffusion model. We provide the weights for each stage.
✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improves 55% training speed when training on 64x512x512 videos. Details locates at acceleration.md.
🔧 Data preprocessing pipeline v1.0, including downloading, video cutting, and captioning tools. Our data collection plan can be found at datasets.md.
✅ We find VQ-VAE from VideoGPT has a low quality and thus adopt a better VAE from Stability-AI. We also find patching in the time dimension deteriorates the quality. See our report for more discussions.
✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our STDiT achieves a better trade-off between quality and speed. See our report for more discussions.
✅ Support clip and T5 text conditioning.
✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See commands.md for more instructions.
✅ Support inference with official weights from DiT, Latte, and PixArt.
✅ Refactor the codebase. See structure.md to learn the project structure and how to use the config files.

TODO list sorted by priority

Training Video-VAE and adapt our model to new VAE.
Scaling model parameters and dataset size.
Incoporate a better scheduler (rectified flow).
Evaluation pipeline.
Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, etc.). See the dataset for more information
Support image and video conditioning.
Support variable aspect ratios, resolutions, durations.

Installation
Model Weights
Gradio Demo
Inference
Data Processing
Training
Evaluation
VAE Training & Evaluation
Contribution
Citation
Acknowledgement

Other useful documents and links are listed below.

Report: each version is trained from a image base seperately (not continuously trained), while a newer version will incorporate the techniques from the previous version.
- report 1.2: rectified flow, 3d-VAE, score condition, evaluation, etc.
- report 1.1: multi-resolution/length/aspect-ratio, image/video conditioning/editing, data preprocessing, etc.
- report 1.0: architecture, captioning, etc.
- acceleration.md
Repo structure: structure.md
Config file explanation: config.md
Useful commands: commands.md
Data processing pipeline and dataset: datasets.md
Each data processing tool's README: dataset conventions and management, scene cutting, scoring, caption
Evaluation: eval/README.md
Gallery: gallery

Installation

Install from Source

For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, please refer to Installation Documentation for more instructions on different cuda version, and additional dependency for data preprocessing, VAE, and model evaluation.

# create a virtual env and activate (conda as an example)
conda create -n opensora python=3.9
conda activate opensora

# download the repo
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora

# install torch, torchvision and xformers
pip install -r requirements/requirements-cu121.txt

# the default installation is for inference only
pip install -v . # for development mode, `pip install -v -e .`

(Optional, recommended for fast speed, especially for training) To enable layernorm_kernel and flash_attn, you need to install apex and flash-attn with the following commands.

# install flash attention
# set enable_flash_attn=False in config to disable flash attention
pip install packaging ninja
pip install flash-attn --no-build-isolation

# install apex
# set enable_layernorm_kernel=False in config to disable apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

Use Docker

Run the following command to build a docker image from Dockerfile provided.

docker build -t opensora .

Run the following command to start the docker container in interactive mode.

docker run -ti --gpus all -v .:/workspace/Open-Sora opensora

Model Weights

Open-Sora 1.2 Model Weights

Model	Model Size	Data	#iterations	Batch Size	URL
Diffusion	1.1B	30M	70k	Dynamic	🔗
VAE	384M	3M	1M	8	🔗

See our report 1.2 for more infomation. Weight will be automatically downloaded when you run the inference script.

For users from mainland China, try export HF_ENDPOINT=https://hf-mirror.com to successfully download the weights.

Open-Sora 1.1 Model Weights

Resolution	Model Size	Data	#iterations	Batch Size	URL
mainly 144p & 240p	700M	10M videos + 2M images	100k	dynamic	🔗
144p to 720p	700M	500K HQ videos + 1M images	4k	dynamic	🔗

See our report 1.1 for more infomation.

⚠️ LIMITATION: This version contains known issues which we are going to fix in the next version (as we save computation resource for the next release). In addition, the video generation may fail for long duration, and high resolution will have noisy results due to this problem.

Open-Sora 1.0 Model Weights

Resolution	Model Size	Data	#iterations	Batch Size	GPU days (H800)	URL
16×512×512	700M	20K HQ	20k	2×64	35	🔗
16×256×256	700M	20K HQ	24k	8×64	45	🔗
16×256×256	700M	366K	80k	8×64	117	🔗

Training orders: 16x256x256 $\rightarrow$ 16x256x256 HQ $\rightarrow$ 16x512x512 HQ.

Our model's weight is partially initialized from PixArt-α. The number of parameters is 724M. More information about training can be found in our report. More about the dataset can be found in datasets.md. HQ means high quality.

⚠️ LIMITATION: Our model is trained on a limited budget. The quality and text alignment is relatively poor. The model performs badly, especially on generating human beings and cannot follow detailed instructions. We are working on improving the quality and text alignment.

Gradio Demo

🔥 You can experience Open-Sora on our 🤗 Gradio application on Hugging Face online.

Local Deployment

If you want to deploy gradio locally, we have also provided a Gradio application in this repository, you can use the following the command to start an interactive web application to experience video generation with Open-Sora.

pip install gradio spaces
python gradio/app.py

This will launch a Gradio application on your localhost. If you want to know more about the Gradio applicaiton, you can refer to the Gradio README.

To enable prompt enhancement and other language input (e.g., 中文输入), you need to set the OPENAI_API_KEY in the environment. Check OpenAI's documentation to get your API key.

export OPENAI_API_KEY=YOUR_API_KEY

Getting Started

In the Gradio application, the basic options are as follows:

The easiest way to generate a video is to input a text prompt and click the "Generate video" button (scroll down if you cannot find). The generated video will be displayed in the right panel. Checking the "Enhance prompt with GPT4o" will use GPT-4o to refine the prompt, while "Random Prompt" button will generate a random prompt by GPT-4o for you. Due to the OpenAI's API limit, the prompt refinement result has some randomness.

Then, you can choose the resolution, duration, and aspect ratio of the generated video. Different resolution and video length will affect the video generation speed. On a 80G H100 GPU, the generation speed (with num_sampling_step=30) and peak memory usage is:

	Image	2s	4s	8s	16s
360p	3s, 24G	18s, 27G	31s, 27G	62s, 28G	121s, 33G
480p	2s, 24G	29s, 31G	55s, 30G	108s, 32G	219s, 36G
720p	6s, 27G	68s, 41G	130s, 39G	260s, 45G	547s, 67G

Note that besides text to video, you can also use image to video generation. You can upload an image and then click the "Generate video" button to generate a video with the image as the first frame. Or you can fill in the text prompt and click the "Generate image" button to generate an image with the text prompt, and then click the "Generate video" button to generate a video with the image generated with the same model.

Then you can specify more options, including "Motion Strength", "Aesthetic" and "Camera Motion". If "Enable" not checked or the choice is "none", the information is not passed to the model. Otherwise, the model will generate videos with the specified motion strength, aesthetic score, and camera motion.

For the aesthetic score, we recommend using values higher than 6. For motion strength, a smaller value will lead to a smoother but less dynamic video, while a larger value will lead to a more dynamic but likely more blurry video. Thus, you can try without it and then adjust it according to the generated video. For the camera motion, sometimes the model cannot follow the instruction well, and we are working on improving it.

You can also adjust the "Sampling steps", this is directly related to the generation speed as it is the number of denoising. A number smaller than 30 usually leads to a poor generation results, while a number larger than 100 usually has no significant improvement. The "Seed" is used for reproducibility, you can set it to a fixed number to generate the same video. The "CFG Scale" controls how much the model follows the text prompt, a smaller value will lead to a more random video, while a larger value will lead to a more text-following video (7 is recommended).

For more advanced usage, you can refer to Gradio README.

Inference

Open-Sora 1.2 Command Line Inference

The basic command line inference is as follows:

# text to video
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --prompt "a beautiful waterfall"

You can add more options to the command line to customize the generation.

python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --num-sampling-steps 30 --flow 5 --aes 6.5 \
  --prompt "a beautiful waterfall"

For image to video generation and other functionalities, the API is compatible with Open-Sora 1.1. See here for more instructions.

If your installation do not contain apex and flash-attn, you need to disable them in the config file, or via the folowing command.

python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p \
  --layernorm-kernel False --flash-attn False \
  --prompt "a beautiful waterfall"

Sequence Parallelism Inference

To enable sequence parallelism, you need to use torchrun to run the inference script. The following command will run the inference with 2 GPUs.

# text to video
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --prompt "a beautiful waterfall"

⚠️ LIMITATION: The sequence parallelism is not supported for gradio deployment. For now, the sequence parallelism is only supported when the dimension can be divided by the number of GPUs. Thus, it may fail for some cases. We tested 4 GPUs for 720p and 2 GPUs for 480p.

GPT-4o Prompt Refinement

We find that GPT-4o can refine the prompt and improve the quality of the generated video. With this feature, you can also use other language (e.g., Chinese) as the prompt. To enable this feature, you need prepare your openai api key in the environment:

export OPENAI_API_KEY=YOUR_API_KEY

Then you can inference with --llm-refine True to enable the GPT-4o prompt refinement, or leave prompt empty to get a random prompt generated by GPT-4o.

python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --llm-refine True

Open-Sora 1.1 Command Line Inference

Since Open-Sora 1.1 supports inference with dynamic input size, you can pass the input size as an argument.

# text to video
python scripts/inference.py configs/opensora-v1-1/inference/sample.py --prompt "A beautiful sunset over the city" --num-frames 32 --image-size 480 854

If your installation do not contain apex and flash-attn, you need to disable them in the config file, or via the folowing command.

python scripts/inference.py configs/opensora-v1-1/inference/sample.py --prompt "A beautiful sunset over the city" --num-frames 32 --image-size 480 854 --layernorm-kernel False --flash-attn False

See here for more instructions including text-to-image, image-to-video, video-to-video, and infinite time generation.

Open-Sora 1.0 Command Line Inference

We have also provided an offline inference script. Run the following commands to generate samples, the required model weights will be automatically downloaded. To change sampling prompts, modify the txt file passed to --prompt-path. See here to customize the configuration.

# Sample 16x512x512 (20s/sample, 100 time steps, 24 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path OpenSora-v1-HQ-16x512x512.pth --prompt-path ./assets/texts/t2v_samples.txt

# Sample 16x256x256 (5s/sample, 100 time steps, 22 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt

# Sample 64x512x512 (40s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt

# Sample 64x512x512 with sequence parallelism (30s/sample, 100 time steps)
# sequence parallelism is enabled automatically when nproc_per_node is larger than 1
torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt

The speed is tested on H800 GPUs. For inference with other models, see here for more instructions. To lower the memory usage, set a smaller vae.micro_batch_size in the config (slightly lower sampling speed).

Data Processing

High-quality data is crucial for training good generation models. To this end, we establish a complete pipeline for data processing, which could seamlessly convert raw videos to high-quality video-text pairs. The pipeline is shown below. For detailed information, please refer to data processing. Also check out the datasets we use.

Training

Open-Sora 1.2 Training

The training process is same as Open-Sora 1.1.

# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
    configs/opensora-v1-2/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
    configs/opensora-v1-2/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

Open-Sora 1.1 Training

Once you prepare the data in a csv file, run the following commands to launch training on a single node.

# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
    configs/opensora-v1-1/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
    configs/opensora-v1-1/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

Open-Sora 1.0 Training

Once you prepare the data in a csv file, run the following commands to launch training on a single node.

# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x256.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

To launch training on multiple nodes, prepare a hostfile according to ColossalAI, and run the following commands.

colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

For training other models and advanced usage, see here for more instructions.

Evaluation

We support evaluation based on:

Validation loss
VBench score
VBench-i2v score
Batch generation for human evaluation

All the evaluation code is released in eval folder. Check the README for more details. Our report also provides more information about the evaluation during training. The following table shows Open-Sora 1.2 greatly improves Open-Sora 1.0.

Model	Total Score	Quality Score	Semantic Score
Open-Sora V1.0	75.91%	78.81%	64.28%
Open-Sora V1.2	79.23%	80.71%	73.30%

VAE Training & Evaluation

We train a VAE pipeline that consists of a spatial VAE followed by a temporal VAE. For more details, refer to VAE Documentation. Before you run the following commands, follow our Installation Documentation to install the required dependencies for VAE and Evaluation.

If you want to train your own VAE, we need to prepare data in the csv following the data processing pipeline, then run the following commands. Note that you need to adjust the number of trained epochs (epochs) in the config file accordingly with respect to your own csv data size.

# stage 1 training, 380k steps, 8 GPUs
torchrun --nnodes=1 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage1.py --data-path YOUR_CSV_PATH
# stage 2 training, 260k steps, 8 GPUs
torchrun --nnodes=1 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage2.py --data-path YOUR_CSV_PATH
# stage 3 training, 540k steps, 24 GPUs
torchrun --nnodes=3 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage3.py --data-path YOUR_CSV_PATH

To evaluate the VAE performance, you need to run VAE inference first to generate the videos, then calculate scores on the generated videos:

# video generation
torchrun --standalone --nnodes=1 --nproc_per_node=1 scripts/inference_vae.py configs/vae/inference/video.py --ckpt-path YOUR_VAE_CKPT_PATH --data-path YOUR_CSV_PATH --save-dir YOUR_VIDEO_DIR
# the original videos will be saved to `YOUR_VIDEO_DIR_ori`
# the reconstructed videos through the pipeline will be saved to `YOUR_VIDEO_DIR_rec`
# the reconstructed videos through the spatial VAE only will be saved to `YOUR_VIDEO_DIR_spatial`

# score calculation
python eval/vae/eval_common_metric.py --batch_size 2 --real_video_dir YOUR_VIDEO_DIR_ori --generated_video_dir YOUR_VIDEO_DIR_rec --device cuda --sample_fps 24 --crop_size 256 --resolution 256 --num_frames 17 --sample_rate 1 --metric ssim psnr lpips flolpips

Contribution

Thanks goes to these wonderful contributors:

If you wish to contribute to this project, please refer to the Contribution Guideline.

Acknowledgement

Here we only list a few of the projects. For other works and datasets, please refer to our report.

ColossalAI: A powerful large model parallel acceleration and optimization system.
DiT: Scalable Diffusion Models with Transformers.
OpenDiT: An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
PixArt: An open-source DiT-based text-to-image model.
Latte: An attempt to efficiently train DiT for video.
StabilityAI VAE: A powerful image VAE model.
CLIP: A powerful text-image embedding model.
T5: A powerful text encoder.
LLaVA: A powerful image captioning model based on Mistral-7B and Yi-34B.
PLLaVA: A powerful video captioning model.
MiraData: A large-scale video dataset with long durations and structured caption.

We are grateful for their exceptional work and generous contribution to open source. Special thanks go to the authors of MiraData and Rectified Flow for their valuable advice and help. We wish to express gratitude towards AK for sharing this project on social media and Hugging Face for providing free GPU resources for our online Gradio demo.

Citation

@software{opensora,
  author = {Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You},
  title = {Open-Sora: Democratizing Efficient Video Production for All},
  month = {March},
  year = {2024},
  url = {https://github.com/hpcaitech/Open-Sora}
}

Star History

open-sora's People

Contributors

Stargazers

Watchers

Forkers

flybird11111 coderworld520 yuehchuan allinbsv seshakiran itsomsarraf guanghuisong shiva204 diogodsa skaiphd touristshaun vineetp6 wendongj ljq shenhongdeng shivam-sharma02 joberzheng ototao dotieuthien sungwookyoon zhuxiongwei24 knzhang superf0sh tramleit gogog01-29-2021 binmakeswell yang0110 jameshennessytempus carbirbal mr2cool zhaolijun1109 jamesthesnake aoyuqc tsw2 jhines2k7 huterox harrisking jaredshuai peiji1981 hkcao qshnqjitx zeroxclem qagitech camille7777 unanan clrvn zoucan520 kaidduong wuzhi-dev gush596 songyang86 joseph16388 hurrypass chevolier techthiyanes xiaolong-rrl chg0901 af-74413592 cduran restevesd ganeshkrishnan1 yangbinb jafarsab22 brianosaurus atiger808 ai-fbi macroustc huangweiboy2 aliang-cv keyman9848 jerrywei1985 rocketsizzlin99 amazewor-69 latinati41 lawrt-serenesilly chattymppromnica ferdywaves-vibrantman arani-k romancexoxox-p maljefairi mundefr-fource readerenesgotiz dailyerneheardoom roei88 eltociear leevaleeth mbakpur123 adeliavale elify1438 alexyangle zukasmerlini535 watermelonich dvmarchuk commentthug8lilll burrofill2 mindok7520 rb1717 xinqiyang qthrones aicodehunt

open-sora's Issues

Can provide a CKPT for a test

Mainly, training requires GPU cards and time.. Can you provide a stable version of CKPT for us to run demo examples on a daily basis. Now I just don't know the effect, so I need to train the data first and then look at it.. The time is too long.

error，麻烦看下

File "/data/Sora/Open-Sora-main/train.py", line 122, in main
ema = deepcopy(model)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 297, in _reconstruct
value = deepcopy(value, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 297, in _reconstruct
value = deepcopy(value, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 297, in _reconstruct
value = deepcopy(value, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/miniconda3/envs/opensora/lib/python3.10/copy.py", line 161, in deepcopy
rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object

为什么魔改DiT？

DiT原文结构采用了AdaLN，condition内的class label不包含sequence维度，因此需要加入cross attention才能处理复杂文本序列和patch序列的关系。但是本项目的实现魔改了这个结构，把patch的self attention直接修改为patch序列和text condition的cross attention，忽略了patch序列的self attention，这样做的目的是？
忽略patch的self attention，不会有帧生成的质量问题吗？

Shape mismatch error occurs in multiprocessing

If I use 2~ gpus on inference, following error occurs.

Traceback (most recent call last):
  File "/hub_data1/minhyuk/diffusion/opensora/scripts/inference.py", line 114, in <module>
    main()
  File "/hub_data1/minhyuk/diffusion/opensora/scripts/inference.py", line 95, in main
    samples = scheduler.sample(
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/schedulers/iddpm/__init__.py", line 72, in sample
    samples = self.p_sample_loop(
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 434, in p_sample_loop
    for sample in self.p_sample_loop_progressive(
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 485, in p_sample_loop_p
rogressive
    out = self.p_sample(
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 388, in p_sample
    out = self.p_mean_variance(
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 94, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 267, in p_mean_variance
    model_output = model(x, t, **model_kwargs)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 127, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/schedulers/iddpm/__init__.py", line 89, in forward_with_cfg
    model_out = model.forward(combined, timestep, y, **kwargs)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 267, in forward
    x = auto_grad_checkpoint(block, x, y, t0, y_lens, tpe)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/acceleration/checkpoint.py", line 24, in auto_grad_checkpoint
    return module(*args, **kwargs)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 111, in forward
    x = x + self.cross_attn(x, y, mask)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/minhyuk/.conda/envs/opensora/lib/python3.10/site-packages/opensora/models/layers/blocks.py", line 313, in forward
    kv = self.kv_linear(cond).view(B, -1, 2, self.num_heads, self.head_dim)
RuntimeError: shape '[4, -1, 2, 16, 72]' is invalid for input of size 105523

I tested on 2/3/4 gpus, and all give the same error.

有人训练出结果吗

我用默认的设置，8卡训了两三天了，5w视频数据左右，但是现在一点视频的感觉都没有，都是patch状的拼接，是训练的不够吗，还是哪里出错了？

training one's own dataset

Does this project support training one's own dataset from scratch

请问后续是否有考虑开源VQVAE的训练脚本？

[feat] ControlNet-Transformer and SparseCtrl

ControlNet is a pretty musthave feature for diffusion models, so it will be nice to have it implemented

For example, PixArt-alpha has a ControlNet-Transformer (modified from the Unet-one) module allowing it to take various conditionings

Additionally, the authors of AnimateDiff have released the SparseCtrl ControlNet modification specially for text2video enabling it to take the conditions sparsely at given frames instead of requiring to duplicate/interpolate them to all the frames

https://github.com/guoyww/AnimateDiff#202312-animatediff-v3-and-sparsectrl

执行训练的时候报错，怎么问题怎么解决？

torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
unable to import lightllm kernels
Traceback (most recent call last):
File "/root/autodl-tmp/Open-Sora/train.py", line 29, in
from open_sora.modeling import DiT_models
File "/root/autodl-tmp/Open-Sora/open_sora/modeling/init.py", line 2, in
from .latte import LatteT2V
File "/root/autodl-tmp/Open-Sora/open_sora/modeling/latte/init.py", line 1, in
from .latte_t2v import LatteT2V
File "/root/autodl-tmp/Open-Sora/open_sora/modeling/latte/latte_t2v.py", line 12, in
from diffusers.models.embeddings import (
ImportError: cannot import name 'CaptionProjection' from 'diffusers.models.embeddings' (/root/miniconda3/lib/python3.10/site-packages/diffusers/models/embeddings.py)
[2024-03-05 20:52:11,607] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 42774) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.1.2+cu121', 'console_scripts', 'torchrun')())
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/autodl-tmp/Open-Sora/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-03-05_20:52:11
host : autodl-container-929b4da753-1a924379
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 42774)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Access to huggingface vqvae

Thanks for open-sourcing this incredible repo!

I found that if specifying 'vqvae' for --compressor argument in train.py, it requires access to the pretrained model on Huggingface. Could you please provide the access to that model?

Best

checkpoint会开源吗？

大佬好，请问您训练的checkpoint会开源吗？

请问什么时候开源VQVAE的训练流程？

May I inquire whether the 4090 can participate in this project? For instance, in inference tasks？

[feat] MaskDiT: Fast Training of Diffusion Models with Masked Transformers

The MaskDiT project shows that it's possible to accelerate the training of a DiT by using masked transformers

https://github.com/Anima-Lab/MaskDiT

OOM with large sequence lengths

From the expand_mask_4d func this will make a crazy allocation for large tensor sizes. When the sequence length ~ 200k this will try to allocate 720gb.

How is it possible to achieve such high sequence lengths without going oom creating the masks?

Thanks

支持采用ZeRO-Infinity技术使用内存和NVME硬盘来训练模型吗？

我现在手头只有一台A100 40G、128G内存、1T的NVME硬盘，官方说可以在8块A100 80G上训练，如果采用ZeRO-Infinity技术，我的这个机器应该也可以训练，请问我的这个硬件可以支持全参数训练吗？
另外，想问一下，支持LoRA等PEFT微调方法吗？

秦岭

巍峨的大秦岭

Error inference 'colossalai.moe'

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

nvidia-smi
NVIDIA GeForce GTX 1660 Ti Off | 00000000:01:00.0 On | N/A |
| N/A 54C P8 9W / 80W | 54MiB / 6144MiB | 29% Default

pip freeze show
pip freeze show
absl-py==2.1.0
aiohttp==3.9.3
aiosignal==1.3.1
attrs==23.2.0
av==11.0.0
bcrypt==4.1.2
certifi==2024.2.2
cffi==1.16.0
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
colossalai==0.3.3
contexttimer==0.3.3
cryptography==42.0.5
datasets==2.18.0
decorator==5.1.1
Deprecated==1.2.14
dill==0.3.8
distlib==0.3.8
einops==0.7.0
fabric==3.2.2
filelock==3.13.1
frozenlist==1.4.1
fsspec==2024.2.0
grpcio==1.62.0
huggingface-hub==0.21.4
identify==2.5.35
idna==3.6
invoke==2.2.0
Jinja2==3.1.3
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.2.1
ninja==1.11.1.1
nodeenv==1.8.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
packaging==23.2
pandas==2.2.1
paramiko==3.4.0
pillow==10.2.0
platformdirs==4.2.0
pre-commit==3.6.2
protobuf==4.25.3
psutil==5.9.8
pyarrow==15.0.0
pyarrow-hotfix==0.6
pycparser==2.21
Pygments==2.17.2
PyNaCl==1.5.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2023.12.25
requests==2.31.0
rich==13.7.1
safetensors==0.4.2
setuptools==68.2.2
six==1.16.0
sympy==1.12
tensorboard==2.16.2
tensorboard-data-server==0.7.2
timm==0.9.16
tokenizers==0.15.2
torch==2.2.1
torchvision==0.17.1
tqdm==4.66.2
transformers==4.38.2
typing_extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
virtualenv==20.25.1
Werkzeug==3.0.1
wheel==0.41.2
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4

coomand: python sample.py -m "DiT/XL-2" --text "a person is walking on the street" --ckpt /path/to/checkpoint --height 256 --width 256 --fps 10 --sec 5 --disable-cfg

ERROR:
(open312) eduardo@eduardo-Creator-15M-A9SD:~/Documents/Open-Sora$ python sample.py -m "DiT/XL-2" --text "a person is walking on the street" --ckpt /path/to/checkpoint --height 256 --width 256 --fps 10 --sec 5 --disable-cfg
Traceback (most recent call last):
File "/home/eduardo/Documents/Open-Sora/sample.py", line 21, in
from open_sora.modeling import DiT_models
File "/home/eduardo/Documents/Open-Sora/open_sora/modeling/init.py", line 1, in
from .dit import DiT, DiT_models
File "/home/eduardo/Documents/Open-Sora/open_sora/modeling/dit/init.py", line 1, in
from .dit import SUPPORTED_SEQ_PARALLEL_MODES, DiT, DiT_models
File "/home/eduardo/Documents/Open-Sora/open_sora/modeling/dit/dit.py", line 22, in
from open_sora.utils.comm import gather_seq, split_seq
File "/home/eduardo/Documents/Open-Sora/open_sora/utils/comm.py", line 6, in
from colossalai.moe._operation import MoeInGradScaler, MoeOutGradScaler
ModuleNotFoundError: No module named 'colossalai.moe'

A million-scale text-to-video prompt-gallery dataset

Hi,
We contribute the first dataset featuring 1.67 million unique text-to-video prompts and 6.69 million videos generated from 4 different state-of-the-art diffusion models. We hope it can help your Open-Sora plan.

Title：VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Arxiv：https://arxiv.org/abs/2403.06098

Project：https://github.com/WangWenhao0716/VidProM

Download：https://huggingface.co/datasets/WenhaoWang/VidProM

SD3's MMDiT's unofficial implementation

Hi, hpcaitech!

~~I saw SD3 is on your todo list. While it wasn't officially released yet, I made an unofficial MMDiT implementation based on their paper and OpenDiT. (supporting joint CLIP and T5 embeddings as well)~~

~~NUS-HPC-AI-Lab/VideoSys#92~~

~~I think it will be useful for you and allow to save time~~

SD3 will be released on 12th June, so it might be better to refer to their implementation

使用vqvae之后loss下降的非常缓慢，请问有参考的loss曲线吗？

Can anyone share trained model weights or ckpt. It would be really helpful , Thanks.

Is there a checkpoint?

Inference ImportError

when i run inference there is an importerror

ImportError: /home/lz/anaconda3/envs/opensora/lib/python3.10/site-packages/fused_layer_norm_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEENS6_INS2_12MemoryFormatEEE
[2024-03-18 14:53:46,409] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 153812) of binary: /home/lz/anaconda3/envs/opensora/bin/python
Traceback (most recent call last):
File "/home/lz/anaconda3/envs/opensora/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.2.1+cu121', 'console_scripts', 'torchrun')())
File "/home/lz/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
File "/home/lz/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/lz/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/lz/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lz/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/inference.py FAILED

ViT architecture and dynamic resolution

Hello, I was wondering why NaViT or an architecture similar to it was not used as the vision transformer architecture. NaViT natively (hence native resolution) supports multi-resolution training as one of its defining features and a similar architecture was used for OpenAI's Sora to allow for good visual fidelity with differing resolutions. In the Latte paper here section 4.1 it states that the model was trained only on square images/videos and would require resizing to process non-square images/videos.

cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object

how to solve this problem, thanks

What's difference between open-sora and opensora plan ?

ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

/data/anaconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py(145)init()
-> self.model = T5EncoderModel.from_pretrained(path, **t5_model_kwargs).eval()
(Pdb) n
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'
/data/anaconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py(145)init()
-> self.model = T5EncoderModel.from_pretrained(path, **t5_model_kwargs).eval()

inference bug

when i run script from the mentioned, it occured following bug
it seems to be related to the code "colossalai.launch_from_torch({})" (from inference.py)

how can i solve it? thanks!

[W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known).

RuntimeError: memory format option is only supported by strided tensors

Can this be used to train image to video models? Is there any guide on how to prepare dataset for that task?

error~麻烦看一下

FileNotFoundError: [Errno 2] No such file or directory: '/root/miniconda3/lib/python3.10/site-packages/colossalai/kernel/extensions/csrc/cuda/cpu_adam.cpp'

这是服务器的信息：Linux #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

找不到这个模块:ModuleNotFoundError: No module named 'colossalai._C.cpu_adam_x86'

[feat] FreeInit and FreeNoise - training-free techniques for more temporal consistent and longer video generation

FreeInit is a method of improving temporal consistency with no extra training.

Project Page - https://tianxingwu.github.io/pages/FreeInit/

Code - https://github.com/TianxingWu/FreeInit

Demo - https://huggingface.co/spaces/TianxingWu/FreeInit

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

The repo https://github.com/arthur-qiu/FreeNoise-LaVie

Support for subquadratic attention methods such as Linear Attention

Hello!

As you probably know, there are developments proposing to switch away from the traditional transformer's attention architecture due to its quadratic context cost. While the approaches such as Mamba are too exotic and may be too complicated for the existing pipelines, such as ControlNet-Transformer, other sub-quadratic alternatives have been proposed recently. An example is ReBased Linear Transformers with Learnable kernels https://github.com/corl-team/rebased which seems to fare better than Mamba

Also it may be worth to take a look at Large World Model's ring attention https://github.com/lucidrains/ring-attention-pytorch enabling it to extend its context window to millions of tokens while reliably answering the needle in the haystack test

Here's my implementation for Latte Vchitect/Latte#51

Mismatch in last layers during inference

I am running the inference and this is what I am getting.
The command that I ran: python sample.py -m "DiT-XL/2" --text "a person is walking on the street" --ckpt /home/nlp/open_sora/Open-Sora/pretrained_models/DiT-XL-2-256x256.pt --height 256 --width 256 --fps 10 --sec 5 --disable-cfg
I downloaded the checkpoints using download.py in the given repo.

[feat] Piecewise Rectified Flow

As you know from SD3's paper they used Rectified Flow to make the training and the sampling process faster. However, in the past month a new modification of Rectified Flow named PiecewiseRectifiedFlow was released

Project page: https://piecewise-rectified-flow.github.io/
Github: https://github.com/magic-research/piecewise-rectified-flow/tree/main

Claims to be faster than the normal Rectified Flow (used in PKU-YuanGroup/Open-Sora-Plan#43)

I believe it will be a huge quality/speed win compared to the vanilla diffusion pipeline that is used at this moment here

Speed / Performance Question

What is the expected run-time (in mins or hours) for processing an image? Can it be done without gpu on cpu only?

Error on running Inference

Runing the inerence sample

python sample.py -m "DiT/XL-2" --text "a person is walking on the street" --ckpt /path/to/checkpoint --height 256 --width 256 --fps 10 --sec 5 --disable-cfg

I got the following error:

usage: sample.py [-h]
[-m {DiT-XL/2,DiT-XL/4,DiT-XL/8,DiT-L/2,DiT-L/4,DiT-L/8,DiT-B/2,DiT-B/4,DiT-B/8,DiT-S/2,DiT-S/4,DiT-S/8}]
[--text TEXT] [--cfg-scale CFG_SCALE] [--num-sampling-steps NUM_SAMPLING_STEPS]
[--seed SEED] --ckpt CKPT [-c {raw,vqvae,vae}] [--text_model TEXT_MODEL]
[--width WIDTH] [--height HEIGHT] [--fps FPS] [--sec SEC] [--disable-cfg]
sample.py: error: argument -m/--model: invalid choice: 'DiT/XL-2' (choose from 'DiT-XL/2', 'DiT-XL/4', 'DiT-XL/8', 'DiT-L/2', 'DiT-L/4', 'DiT-L/8', 'DiT-B/2', 'DiT-B/4', 'DiT-B/8', 'DiT-S/2', 'DiT-S/4', 'DiT-S/8')

Then, I changed to
!python sample.py -m "DiT-XL/2" --text "a person is walking on the street" --ckpt pretrained_models/DiT-XL-2-256x256.pt --height 256 --width 256 --fps 10 --sec 5 --disable-cfg

But I got a different error

Traceback (most recent call last):
File "/content/Open-Sora/sample.py", line 136, in
main(args)
File "/content/Open-Sora/sample.py", line 39, in main
model.load_state_dict(torch.load(args.ckpt))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DiT:
Missing key(s) in state_dict: "video_embedder.proj.weight", "video_embedder.proj.bias", "blocks.0.attn.to_q.weight", "blocks.0.attn.to_q.bias", "blocks.0.attn.to_k.weight", "blocks.0.attn.to_k.bias", "blocks.0.attn.to_v.weight", "blocks.0.attn.to_v.bias", "blocks.0.attn.to_out.0.weight", "blocks.0.attn.to_out.0.bias", "blocks.1.attn.to_q.weight", "blocks.1.attn.to_q.bias", "blocks.1.attn.to_k.weight", "blocks.1.attn.to_k.bias", "blocks.1.attn.to_v.weight", "blocks.1.attn.to_v.bias", "blocks.1.attn.to_out.0.weight", "blocks.1.attn.to_out.0.bias", "blocks.2.attn.to_q.weight", "blocks.2.attn.to_q.bias", "blocks.2.attn.to_k.weight", "blocks.2.attn.to_k.bias", "blocks.2.attn.to_v.weight", "blocks.2.attn.to_v.bias", "blocks.2.attn.to_out.0.weight", "blocks.2.attn.to_out.0.bias", "blocks.3.attn.to_q.weight", "blocks.3.attn.to_q.bias", "blocks.3.attn.to_k.weight", "blocks.3.attn.to_k.bias", "blocks.3.attn.to_v.weight", "blocks.3.attn.to_v.bias", "blocks.3.attn.to_out.0.weight", "blocks.3.attn.to_out.0.bias", "blocks.4.attn.to_q.weight", "blocks.4.attn.to_q.bias", "blocks.4.attn.to_k.weight", "blocks.4.attn.to_k.bias", "blocks.4.attn.to_v.weight", "blocks.4.attn.to_v.bias", "blocks.4.attn.to_out.0.weight", "blocks.4.attn.to_out.0.bias", "blocks.5.attn.to_q.weight", "blocks.5.attn.to_q.bias", "blocks.5.attn.to_k.weight", "blocks.5.attn.to_k.bias", "blocks.5.attn.to_v.weight", "blocks.5.attn.to_v.bias", "blocks.5.attn.to_out.0.weight", "blocks.5.attn.to_out.0.bias", "blocks.6.attn.to_q.weight", "blocks.6.attn.to_q.bias", "blocks.6.attn.to_k.weight", "blocks.6.attn.to_k.bias", "blocks.6.attn.to_v.weight", "blocks.6.attn.to_v.bias", "blocks.6.attn.to_out.0.weight", "blocks.6.attn.to_out.0.bias", "blocks.7.attn.to_q.weight", "blocks.7.attn.to_q.bias", "blocks.7.attn.to_k.weight", "blocks.7.attn.to_k.bias", "blocks.7.attn.to_v.weight", "blocks.7.attn.to_v.bias", "blocks.7.attn.to_out.0.weight", "blocks.7.attn.to_out.0.bias", "blocks.8.attn.to_q.weight", "blocks.8.attn.to_q.bias", "blocks.8.attn.to_k.weight", "blocks.8.attn.to_k.bias", "blocks.8.attn.to_v.weight", "blocks.8.attn.to_v.bias", "blocks.8.attn.to_out.0.weight", "blocks.8.attn.to_out.0.bias", "blocks.9.attn.to_q.weight", "blocks.9.attn.to_q.bias", "blocks.9.attn.to_k.weight", "blocks.9.attn.to_k.bias", "blocks.9.attn.to_v.weight", "blocks.9.attn.to_v.bias", "blocks.9.attn.to_out.0.weight", "blocks.9.attn.to_out.0.bias", "blocks.10.attn.to_q.weight", "blocks.10.attn.to_q.bias", "blocks.10.attn.to_k.weight", "blocks.10.attn.to_k.bias", "blocks.10.attn.to_v.weight", "blocks.10.attn.to_v.bias", "blocks.10.attn.to_out.0.weight", "blocks.10.attn.to_out.0.bias", "blocks.11.attn.to_q.weight", "blocks.11.attn.to_q.bias", "blocks.11.attn.to_k.weight", "blocks.11.attn.to_k.bias", "blocks.11.attn.to_v.weight", "blocks.11.attn.to_v.bias", "blocks.11.attn.to_out.0.weight", "blocks.11.attn.to_out.0.bias", "blocks.12.attn.to_q.weight", "blocks.12.attn.to_q.bias", "blocks.12.attn.to_k.weight", "blocks.12.attn.to_k.bias", "blocks.12.attn.to_v.weight", "blocks.12.attn.to_v.bias", "blocks.12.attn.to_out.0.weight", "blocks.12.attn.to_out.0.bias", "blocks.13.attn.to_q.weight", "blocks.13.attn.to_q.bias", "blocks.13.attn.to_k.weight", "blocks.13.attn.to_k.bias", "blocks.13.attn.to_v.weight", "blocks.13.attn.to_v.bias", "blocks.13.attn.to_out.0.weight", "blocks.13.attn.to_out.0.bias", "blocks.14.attn.to_q.weight", "blocks.14.attn.to_q.bias", "blocks.14.attn.to_k.weight", "blocks.14.attn.to_k.bias", "blocks.14.attn.to_v.weight", "blocks.14.attn.to_v.bias", "blocks.14.attn.to_out.0.weight", "blocks.14.attn.to_out.0.bias", "blocks.15.attn.to_q.weight", "blocks.15.attn.to_q.bias", "blocks.15.attn.to_k.weight", "blocks.15.attn.to_k.bias", "blocks.15.attn.to_v.weight", "blocks.15.attn.to_v.bias", "blocks.15.attn.to_out.0.weight", "blocks.15.attn.to_out.0.bias", "blocks.16.attn.to_q.weight", "blocks.16.attn.to_q.bias", "blocks.16.attn.to_k.weight", "blocks.16.attn.to_k.bias", "blocks.16.attn.to_v.weight", "blocks.16.attn.to_v.bias", "blocks.16.attn.to_out.0.weight", "blocks.16.attn.to_out.0.bias", "blocks.17.attn.to_q.weight", "blocks.17.attn.to_q.bias", "blocks.17.attn.to_k.weight", "blocks.17.attn.to_k.bias", "blocks.17.attn.to_v.weight", "blocks.17.attn.to_v.bias", "blocks.17.attn.to_out.0.weight", "blocks.17.attn.to_out.0.bias", "blocks.18.attn.to_q.weight", "blocks.18.attn.to_q.bias", "blocks.18.attn.to_k.weight", "blocks.18.attn.to_k.bias", "blocks.18.attn.to_v.weight", "blocks.18.attn.to_v.bias", "blocks.18.attn.to_out.0.weight", "blocks.18.attn.to_out.0.bias", "blocks.19.attn.to_q.weight", "blocks.19.attn.to_q.bias", "blocks.19.attn.to_k.weight", "blocks.19.attn.to_k.bias", "blocks.19.attn.to_v.weight", "blocks.19.attn.to_v.bias", "blocks.19.attn.to_out.0.weight", "blocks.19.attn.to_out.0.bias", "blocks.20.attn.to_q.weight", "blocks.20.attn.to_q.bias", "blocks.20.attn.to_k.weight", "blocks.20.attn.to_k.bias", "blocks.20.attn.to_v.weight", "blocks.20.attn.to_v.bias", "blocks.20.attn.to_out.0.weight", "blocks.20.attn.to_out.0.bias", "blocks.21.attn.to_q.weight", "blocks.21.attn.to_q.bias", "blocks.21.attn.to_k.weight", "blocks.21.attn.to_k.bias", "blocks.21.attn.to_v.weight", "blocks.21.attn.to_v.bias", "blocks.21.attn.to_out.0.weight", "blocks.21.attn.to_out.0.bias", "blocks.22.attn.to_q.weight", "blocks.22.attn.to_q.bias", "blocks.22.attn.to_k.weight", "blocks.22.attn.to_k.bias", "blocks.22.attn.to_v.weight", "blocks.22.attn.to_v.bias", "blocks.22.attn.to_out.0.weight", "blocks.22.attn.to_out.0.bias", "blocks.23.attn.to_q.weight", "blocks.23.attn.to_q.bias", "blocks.23.attn.to_k.weight", "blocks.23.attn.to_k.bias", "blocks.23.attn.to_v.weight", "blocks.23.attn.to_v.bias", "blocks.23.attn.to_out.0.weight", "blocks.23.attn.to_out.0.bias", "blocks.24.attn.to_q.weight", "blocks.24.attn.to_q.bias", "blocks.24.attn.to_k.weight", "blocks.24.attn.to_k.bias", "blocks.24.attn.to_v.weight", "blocks.24.attn.to_v.bias", "blocks.24.attn.to_out.0.weight", "blocks.24.attn.to_out.0.bias", "blocks.25.attn.to_q.weight", "blocks.25.attn.to_q.bias", "blocks.25.attn.to_k.weight", "blocks.25.attn.to_k.bias", "blocks.25.attn.to_v.weight", "blocks.25.attn.to_v.bias", "blocks.25.attn.to_out.0.weight", "blocks.25.attn.to_out.0.bias", "blocks.26.attn.to_q.weight", "blocks.26.attn.to_q.bias", "blocks.26.attn.to_k.weight", "blocks.26.attn.to_k.bias", "blocks.26.attn.to_v.weight", "blocks.26.attn.to_v.bias", "blocks.26.attn.to_out.0.weight", "blocks.26.attn.to_out.0.bias", "blocks.27.attn.to_q.weight", "blocks.27.attn.to_q.bias", "blocks.27.attn.to_k.weight", "blocks.27.attn.to_k.bias", "blocks.27.attn.to_v.weight", "blocks.27.attn.to_v.bias", "blocks.27.attn.to_out.0.weight", "blocks.27.attn.to_out.0.bias".
Unexpected key(s) in state_dict: "y_embedder.embedding_table.weight", "x_embedder.proj.weight", "x_embedder.proj.bias", "blocks.0.attn.qkv.weight", "blocks.0.attn.qkv.bias", "blocks.0.attn.proj.weight", "blocks.0.attn.proj.bias", "blocks.1.attn.qkv.weight", "blocks.1.attn.qkv.bias", "blocks.1.attn.proj.weight", "blocks.1.attn.proj.bias", "blocks.2.attn.qkv.weight", "blocks.2.attn.qkv.bias", "blocks.2.attn.proj.weight", "blocks.2.attn.proj.bias", "blocks.3.attn.qkv.weight", "blocks.3.attn.qkv.bias", "blocks.3.attn.proj.weight", "blocks.3.attn.proj.bias", "blocks.4.attn.qkv.weight", "blocks.4.attn.qkv.bias", "blocks.4.attn.proj.weight", "blocks.4.attn.proj.bias", "blocks.5.attn.qkv.weight", "blocks.5.attn.qkv.bias", "blocks.5.attn.proj.weight", "blocks.5.attn.proj.bias", "blocks.6.attn.qkv.weight", "blocks.6.attn.qkv.bias", "blocks.6.attn.proj.weight", "blocks.6.attn.proj.bias", "blocks.7.attn.qkv.weight", "blocks.7.attn.qkv.bias", "blocks.7.attn.proj.weight", "blocks.7.attn.proj.bias", "blocks.8.attn.qkv.weight", "blocks.8.attn.qkv.bias", "blocks.8.attn.proj.weight", "blocks.8.attn.proj.bias", "blocks.9.attn.qkv.weight", "blocks.9.attn.qkv.bias", "blocks.9.attn.proj.weight", "blocks.9.attn.proj.bias", "blocks.10.attn.qkv.weight", "blocks.10.attn.qkv.bias", "blocks.10.attn.proj.weight", "blocks.10.attn.proj.bias", "blocks.11.attn.qkv.weight", "blocks.11.attn.qkv.bias", "blocks.11.attn.proj.weight", "blocks.11.attn.proj.bias", "blocks.12.attn.qkv.weight", "blocks.12.attn.qkv.bias", "blocks.12.attn.proj.weight", "blocks.12.attn.proj.bias", "blocks.13.attn.qkv.weight", "blocks.13.attn.qkv.bias", "blocks.13.attn.proj.weight", "blocks.13.attn.proj.bias", "blocks.14.attn.qkv.weight", "blocks.14.attn.qkv.bias", "blocks.14.attn.proj.weight", "blocks.14.attn.proj.bias", "blocks.15.attn.qkv.weight", "blocks.15.attn.qkv.bias", "blocks.15.attn.proj.weight", "blocks.15.attn.proj.bias", "blocks.16.attn.qkv.weight", "blocks.16.attn.qkv.bias", "blocks.16.attn.proj.weight", "blocks.16.attn.proj.bias", "blocks.17.attn.qkv.weight", "blocks.17.attn.qkv.bias", "blocks.17.attn.proj.weight", "blocks.17.attn.proj.bias", "blocks.18.attn.qkv.weight", "blocks.18.attn.qkv.bias", "blocks.18.attn.proj.weight", "blocks.18.attn.proj.bias", "blocks.19.attn.qkv.weight", "blocks.19.attn.qkv.bias", "blocks.19.attn.proj.weight", "blocks.19.attn.proj.bias", "blocks.20.attn.qkv.weight", "blocks.20.attn.qkv.bias", "blocks.20.attn.proj.weight", "blocks.20.attn.proj.bias", "blocks.21.attn.qkv.weight", "blocks.21.attn.qkv.bias", "blocks.21.attn.proj.weight", "blocks.21.attn.proj.bias", "blocks.22.attn.qkv.weight", "blocks.22.attn.qkv.bias", "blocks.22.attn.proj.weight", "blocks.22.attn.proj.bias", "blocks.23.attn.qkv.weight", "blocks.23.attn.qkv.bias", "blocks.23.attn.proj.weight", "blocks.23.attn.proj.bias", "blocks.24.attn.qkv.weight", "blocks.24.attn.qkv.bias", "blocks.24.attn.proj.weight", "blocks.24.attn.proj.bias", "blocks.25.attn.qkv.weight", "blocks.25.attn.qkv.bias", "blocks.25.attn.proj.weight", "blocks.25.attn.proj.bias", "blocks.26.attn.qkv.weight", "blocks.26.attn.qkv.bias", "blocks.26.attn.proj.weight", "blocks.26.attn.proj.bias", "blocks.27.attn.qkv.weight", "blocks.27.attn.qkv.bias", "blocks.27.attn.proj.weight", "blocks.27.attn.proj.bias".
size mismatch for final_layer.linear.weight: copying a param with shape torch.Size([32, 1152]) from checkpoint, the shape in current model is torch.Size([24, 1152]).
size mismatch for final_layer.linear.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([24]).

I would appreciate the help to solve it.

torch的版本有限制吗？训练时报错！求助解答

之前那个diffusers的问题没有了，出现了新的问题。求助解答
torch的版本有限制吗？我用的版本是torch==2.1.2+cu121
现在的报错信息：
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
[2024-03-06 10:13:06,886] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 52488) of binary: /root/miniconda3/bin/python

我怀疑是torch版本的问题

Questions on the choice of VAE

Dear Authors,

Thanks for your great work!

I've just read your report and come up with some questions regarding the choice of the VAE. You mentioned that VideoGPT yields poor performance, so you chose 2D VAE because 3D sota VAEs like MAGVIT-v1/v2 are not open-sourced.

My question is have you ever tried using other 3D-VAE variants like TATS (Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer)?

Thanks in advance!

Issues during inference

python sample.py -m "DiT/XL-2" --text "a person is walking on the street" --ckpt /path/to/checkpoint --height 256 --width 256 --fps 10 --sec 5 --disable-cfg
What is the path to the checkpoint? Can you provide the weights or tell what weights to use?

Does Open-Sora support M2 Macbook Air? or Only can run on CUDA

Use VAE for video compress causes OOM

I fisrt use VQVAE for video compress, the code runs fine. But Loss drops very slowly.

So I change AE to VAE, and I got the OOM error, even though I set batch_size and accumulation_steps to 1.

Has anyone encountered this problem too?

有没有大佬提供checkpoint的，不想训练，只想自己玩一玩推理

8卡主不动了，4卡没问题，请问原因是？

[03/07/24 14:50:30] INFO colossalai - colossalai - INFO: train.py:155 main
INFO colossalai - colossalai - INFO: Dataset contains 105060 samples
[03/07/24 14:52:00] INFO colossalai - colossalai - INFO: train.py:165 main
INFO colossalai - colossalai - INFO: Booster init max device memory: 1222.12 MB
Epoch 0: 0%| | 0/410 [00:00<?, ?it/s]/root/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/root/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/root/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/root/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/root/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/root/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/root/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/root/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

tools/scenedetect/scene_detect.py Enhanced Scene Detection with AdaptiveDetector

Enhanced Scene Detection with `AdaptiveDetector`

For improved results in scene detection, I recommend using the AdaptiveDetector instead of the ContentDetector. The AdaptiveDetector provides a more nuanced approach, especially for videos with varying lighting or content. Here's how you can use it in your project:

Code Snippet:

from scenedetect import AdaptiveDetector, detect_scenes, split_video_ffmpeg

# Path to the input video
video_path = 'your_video_path_here.mp4'

# Directory to save the output clips
video_dir = 'your_output_directory_here'

# Perform scene detection
scene_list = detect_scenes(
                video_path=video_path,
                scene_detector=AdaptiveDetector(
                    luma_only=True,
                    adaptive_threshold=1.5,
                    min_scene_len=3 
                ),
            )

# Split and save the detected scenes into separate clips
for i, scene in enumerate(scene_list):
    output_file = f"{video_dir}/clip_{i+1}.mp4"
    split_video_ffmpeg(
        video_path=video_path,
        scene_list=[scene],
        output_file_template=output_file,
    )

---.

H100 support?

Hi, thanks for your great work! This is super useful, however, one minor issue is, it seems that this framework can only support A100 node, and get stuck on H100 node, I wonder whether the H100 support feature is undergoing or not?

RuntimeError: CUDA error: invalid device ordinal

[2024-03-07 12:26:19,748] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/root/miniconda3/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
unable to import lightllm kernels
unable to import lightllm kernels
unable to import lightllm kernels
/root/miniconda3/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: config is deprecated and will be removed soon.
warnings.warn("config is deprecated and will be removed soon.")
unable to import lightllm kernels
unable to import lightllm kernels
unable to import lightllm kernels
unable to import lightllm kernels
unable to import lightllm kernels
Traceback (most recent call last):
File "/tmp/pycharm_project_146/train.py", line 267, in
main(args)
File "/tmp/pycharm_project_146/train.py", line 96, in main
launch_from_torch({})
File "/root/miniconda3/lib/python3.10/site-packages/colossalai/initialize.py", line 173, in launch_from_torch
launch(
File "/root/miniconda3/lib/python3.10/site-packages/colossalai/initialize.py", line 61, in launch
cur_accelerator.set_device(local_rank)
File "/root/miniconda3/lib/python3.10/site-packages/colossalai/accelerator/cuda_accelerator.py", line 50, in set_device
torch.cuda.set_device(device)
File "/root/miniconda3/lib/python3.10/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/tmp/pycharm_project_146/train.py", line 267, in
main(args)
File "/tmp/pycharm_project_146/train.py", line 96, in main
launch_from_torch({})
File "/root/miniconda3/lib/python3.10/site-packages/colossalai/initialize.py", line 173, in launch_from_torch
launch(
File "/root/miniconda3/lib/python3.10/site-packages/colossalai/initialize.py", line 61, in launch
cur_accelerator.set_device(local_rank)
File "/root/miniconda3/lib/python3.10/site-packages/colossalai/accelerator/cuda_accelerator.py", line 50, in set_device
torch.cuda.set_device(device)
File "/root/miniconda3/lib/python3.10/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[03/07/24 12:26:26] INFO colossalai - colossalai - INFO: /root/miniconda3/lib/python3.10/site-packages/colossalai/initialize.py:67 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 8
Traceback (most recent call last):
File "/tmp/pycharm_project_146/train.py", line 267, in
main(args)
File "/tmp/pycharm_project_146/train.py", line 122, in main
ema = deepcopy(model)
File "/root/miniconda3/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, rv)
File "/root/miniconda3/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, rv)
File "/root/miniconda3/lib/python3.10/copy.py", line 297, in _reconstruct
value = deepcopy(value, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, rv)
File "/root/miniconda3/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, rv)
File "/root/miniconda3/lib/python3.10/copy.py", line 297, in _reconstruct
value = deepcopy(value, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, rv)
File "/root/miniconda3/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, rv)
File "/root/miniconda3/lib/python3.10/copy.py", line 297, in _reconstruct
value = deepcopy(value, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, rv)
File "/root/miniconda3/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/miniconda3/lib/python3.10/copy.py", line 161, in deepcopy
rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
[2024-03-07 12:26:30,044] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 108977 closing signal SIGTERM
[2024-03-07 12:26:30,158] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 108978) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(args, kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Can we use t5-large text encoder model to use with the opensora pretrained weights?

Edit: seems we can't use the smaller models, so it would be handy to have a way to load the xxl models in 8bit format for smaller vram GPUs. its doable for pixart image gen models using diffusers library.

I tried to use google/t5-v1_1-large model as text encoder instead of DeepFloyd/t5-v1_1-xxl, but encountered following error.

RuntimeError: Error(s) in loading state_dict for STDiT:
	size mismatch for y_embedder.y_embedding: copying a param with shape torch.Size([120, 4096]) from checkpoint, the shape in current model is torch.Size([120, 1024]).
	size mismatch for y_embedder.y_proj.fc1.weight: copying a param with shape torch.Size([1152, 4096]) from checkpoint, the shape in current model is torch.Size([1152, 1024]).

It seems the output embedding dimension for large model is 1024 and for xxl is 4096, and opensora weights only accept weights from xxl model i.e. 4096 dim weights.

is there anyway we can use the t5-large model instead of the xxl model? I want to run inference in cloud gpus i.e. T4 in colab notebooks.

here's my notebook as a gist i used to run on colab.
https://gist.github.com/sandeshrajbhandari/ac3857cd2aaae5e3a9de0d7c219ac351

Need the reference hyper-parameters and output of the default scripts.

I ran the training scripts (using DiT-S/8 by default) successfully, and the loss curve is shown below.

However, the sampled results (also using the default sampling parameters with DiT-XL/2 modified to DiT-S/8） are random noises.

Is that because the model is weak?
Could you please provide a recommended setting (hyper-params like the model arch, compression, etc) that we should start with?

hpcaitech / open-sora Goto Github PK

open-sora's Introduction

Open-Sora: Democratizing Efficient Video Production for All

📰 News

🎥 Latest Demo

🔆 New Features/Updates

TODO list sorted by priority

Contents

Installation

Install from Source

Use Docker

Model Weights

Open-Sora 1.2 Model Weights

Open-Sora 1.1 Model Weights

Open-Sora 1.0 Model Weights

Gradio Demo

Local Deployment

Getting Started

Inference

Open-Sora 1.2 Command Line Inference

Sequence Parallelism Inference

GPT-4o Prompt Refinement

Open-Sora 1.1 Command Line Inference

Open-Sora 1.0 Command Line Inference

Data Processing

Training

Open-Sora 1.2 Training

Open-Sora 1.1 Training

Open-Sora 1.0 Training

Evaluation

VAE Training & Evaluation

Contribution

Acknowledgement

Citation

Star History

open-sora's People

Contributors

Stargazers

Watchers

Forkers

open-sora's Issues

/root/autodl-tmp/Open-Sora/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-03-05_20:52:11 host : autodl-container-929b4da753-1a924379 rank : 0 (local_rank: 0) exitcode : 1 (pid: 42774) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

nvidia-smi NVIDIA GeForce GTX 1660 Ti Off | 00000000:01:00.0 On | N/A | | N/A 54C P8 9W / 80W | 54MiB / 6144MiB | 29% Default

Enhanced Scene Detection with AdaptiveDetector

Code Snippet:

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-03-05_20:52:11
host : autodl-container-929b4da753-1a924379
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 42774)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

nvidia-smi
NVIDIA GeForce GTX 1660 Ti Off | 00000000:01:00.0 On | N/A |
| N/A 54C P8 9W / 80W | 54MiB / 6144MiB | 29% Default

Enhanced Scene Detection with `AdaptiveDetector`