mini-sora / minisora Goto Github PK

View Code? Open in Web Editor NEW

1.0K 1.0K 143.0 69.2 MB

MiniSora: A community aims to explore the implementation path and future development direction of Sora.

Home Page: https://github.com/mini-sora/minisora

License: Apache License 2.0

Python 97.86% Jupyter Notebook 1.00% Shell 0.96% CSS 0.18%

diffusion sora video-generation

minisora's Introduction

MiniSora Community

English | 简体中文

👋 join us on WeChat

The MiniSora open-source community is positioned as a community-driven initiative organized spontaneously by community members. The MiniSora community aims to explore the implementation path and future development direction of Sora.

Regular round-table discussions will be held with the Sora team and the community to explore possibilities.
We will delve into existing technological pathways for video generation.
Leading the replication of papers or research results related to Sora, such as DiT (MiniSora-DiT), etc.
Conducting a comprehensive review of Sora-related technologies and their implementations, i.e., "From DDPM to Sora: A Review of Video Generation Models Based on Diffusion Models".

Hot News

Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
MiniSora-DiT: Reproducing the DiT Paper with XTuner
Introduction of MiniSora and Latest Progress in Replicating Sora

Reproduction Group of MiniSora Community

Sora Reproduction Goals of MiniSora

GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.
Training-Efficiency: It should achieve good results without requiring extensive training time.
Inference-Efficiency: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.

MiniSora-DiT: Reproducing the DiT Paper with XTuner

https://github.com/mini-sora/minisora-DiT

Requirements

We are recruiting MiniSora Community contributors to reproduce DiT using XTuner.

We hope the community member has the following characteristics:

Familiarity with the OpenMMLab MMEngine mechanism.
Familiarity with DiT.

Background

The author of DiT is the same as the author of Sora.
XTuner has the core technology to efficiently train sequences of length 1000K.

Support

Computational resources: 2*A100.
Strong supports from XTuner core developer P佬@pppppM.

Recent round-table Discussions

Paper Interpretation of Stable Diffusion 3 paper: MM-DiT

Speaker: MMagic Core Contributors

Live Streaming Time: 03/12 20:00

Highlights: MMagic core contributors will lead us in interpreting the Stable Diffusion 3 paper, discussing the architecture details and design principles of Stable Diffusion 3.

PPT: FeiShu Link

Highlights from Previous Discussions

Night Talk with Sora: Video Diffusion Overview

ZhiHu Notes: A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models

Paper Reading Program

Sora: Creating video from text
Technical Report: Video generation models as world simulators
Latte: Latte: Latent Diffusion Transformer for Video Generation
- Latte Paper Interpretation (zh-CN), ZhiHu(zh-CN)
DiT: Scalable Diffusion Models with Transformers
Stable Cascade (ICLR 24 Paper): Würstchen: An efficient architecture for large-scale text-to-image diffusion models
Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- SD3 Paper Interpretation (zh-CN), ZhiHu(zh-CN)
Updating...

Recruitment of Presenters

Related Work

01 Diffusion Model
02 Diffusion Transformer
03 Baseline Video Generation Models
04 Diffusion UNet
05 Video Generation
06 Dataset
- 6.1 Pubclic Datasets
- 6.2 Video Augmentation Methods
  - 6.2.1 Basic Transformations
  - 6.2.2 Feature Space
  - 6.2.3 GAN-based Augmentation
  - 6.2.4 Encoder/Decoder Based
  - 6.2.5 Simulation
07 Patchifying Methods
08 Long-context
09 Audio Related Resource
10 Consistency
11 Prompt Engineering
12 Security
13 World Model
14 Video Compression
15 Mamba
- 15.1 Theoretical Foundations and Model Architecture
- 15.2 Image Generation and Visual Applications
- 15.3 Video Processing and Understanding
- 15.4 Medical Image Processing
16 Existing high-quality resources
17 Efficient Training
- 17.1 Parallelism based Approach
  - 17.1.1 Data Parallelism (DP)
  - 17.1.2 Model Parallelism (MP)
  - 17.1.3 Pipeline Parallelism (PP)
  - 17.1.4 Generalized Parallelism (GP)
  - 17.1.5 ZeRO Parallelism (ZP)
- 17.2 Non-parallelism based Approach
  - 17.2.1 Reducing Activation Memory
  - 17.2.2 CPU-Offloading
  - 17.2.3 Memory Efficient Optimizer
- 17.3 Novel Structure
18 Efficient Inference
- 18.1 Reduce Sampling Steps
  - 18.1.1 Continuous Steps
  - 18.1.2 Fast Sampling
  - 18.1.3 Step distillation
- 18.2 Optimizing Inference
  - 18.2.1 Low-bit Quantization
  - 18.2.2 Parallel/Sparse inference

01 Diffusion Models
Paper	Link
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis	NeurIPS 21 Paper, GitHub
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models	CVPR 22 Paper, GitHub
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models	NeurIPS 22 Paper, GitHub
4) DDPM: Denoising Diffusion Probabilistic Models	NeurIPS 20 Paper, GitHub
5) DDIM: Denoising Diffusion Implicit Models	ICLR 21 Paper, GitHub
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations	ICLR 21 Paper, GitHub, Blog
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models	ICLR 24 Paper, GitHub, Blog
8) Diffusion Models in Vision: A Survey	TPAMI 23 Paper, GitHub
9) Improved DDPM: Improved Denoising Diffusion Probabilistic Models	ICML 21 Paper, Github
10) Classifier-free diffusion guidance	NIPS 21 Paper
11) Glide: Towards photorealistic image generation and editing with text-guided diffusion models	Paper, Github
12) VQ-DDM: Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation	CVPR 22 Paper, Github
13) Diffusion Models for Medical Anomaly Detection	Paper, Github
14) Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems	Paper
15) DiffusionDet: Diffusion Model for Object Detection	ICCV 23 Paper, Github
16) Label-efficient semantic segmentation with diffusion models	ICLR 22 Paper, Github, Project
02 Diffusion Transformer
Paper	Link
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models	CVPR 23 Paper, GitHub, ModelScope
2) DiT: Scalable Diffusion Models with Transformers	ICCV 23 Paper, GitHub, Project, ModelScope
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers	ArXiv 23, GitHub, ModelScope
4) FiT: Flexible Vision Transformer for Diffusion Model	ArXiv 24, GitHub
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers	ArXiv 24, GitHub
6) Large-DiT: Large Diffusion Transformer	GitHub
7) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks	ArXiv 24, GitHub
8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis	Paper, Blog
9) PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation	ArXiv 24, Project
10) PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesiss	ArXiv 23, GitHub ModelScope
11) PIXART-δ: Fast and Controllable Image Generation With Latent Consistency Model	ArXiv 24,
03 Baseline Video Generation Models
Paper	Link
1) ViViT: A Video Vision Transformer	ICCV 21 Paper, GitHub
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models	CVPR 23 Paper
3) DiT: Scalable Diffusion Models with Transformers	ICCV 23 Paper, Github, Project, ModelScope
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators	ArXiv 23, GitHub
5) Latte: Latent Diffusion Transformer for Video Generation	ArXiv 24, GitHub, Project
04 Diffusion UNet ModelScope
Paper	Link
1) Taming Transformers for High-Resolution Image Synthesis	CVPR 21 Paper,GitHub ,Project
2) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	ArXiv 24 Github
05 Video Generation
Paper	Link
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning	ICLR 24 Paper, GitHub, ModelScope
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models	ArXiv 23, GitHub, ModelScope
3) Imagen Video: High Definition Video Generation with Diffusion Models	ArXiv 22
4) MoCoGAN: Decomposing Motion and Content for Video Generation	CVPR 18 Paper
5) Adversarial Video Generation on Complex Datasets	Paper
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models	ArXiv 23, Project
7) VideoGPT: Video Generation using VQ-VAE and Transformers	ArXiv 21, GitHub
8) Video Diffusion Models	ArXiv 22, GitHub, Project
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation	NeurIPS 22 Paper, GitHub, Project, Blog
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation	ArXiv 23, Project, Blog
11) MAGVIT: Masked Generative Video Transformer	CVPR 23 Paper, GitHub, Project, Colab
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions	ArXiv 24, GitHub, Project
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation	Paper, GitHub, Project
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, GitHub, Project
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	Paper, GitHub
16) ADD: Adversarial Diffusion Distillation	Paper, GitHub
17) GenTron: Diffusion Transformers for Image and Video Generation	CVPR 24 Paper, Project
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models	CVPR 23 Paper, GitHub
19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models	ArXiv 23, GitHub
20) TGAN-ODE: Latent Neural Differential Equations for Video Generation	Paper, GitHub
21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation	ArXiv 23, GitHub
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models	ArXiv 24, GitHub
23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation	ArXiv 22, GitHub
24) LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models	ArXiv 23, GitHub ,Project
25) PYoCo: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models	ICCV 23 Paper, Project
26) VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation	CVPR 23 Paper
06 Dataset
6.1 Public Datasets
Dataset Name - Paper	Link
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers `70M Clips, 720P, Downloadable`	CVPR 24 Paper, Github, Project, ModelScope
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation `10M Clips, 720P, Downloadable`	ArXiv 24, Github
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset `70K Clips, 720P, Downloadable`	CVPR 23 Paper, Github, Project
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation `130M Clips, 720P, Downloadable`	ArXiv 23, Github, Tool
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions `100M Clips, 720P, Downloadable`	CVPR 22 Paper, Github
6) VideoCC - Learning Audio-Video Modalities from Image Captions `10.3M Clips, 720P, Downloadable`	ECCV 22 Paper, Github
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models `180M Clips, 480P, Downloadable`	NeurIPS 21 Paper, Github, Project
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips `136M Clips, 240P, Downloadable`	ICCV 19 Paper, Github, Project
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild `13K Clips, 240P, Downloadable`	CVPR 12 Paper, Project
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation `122K Clips, 240P, Downloadable`	ACL 11 Paper, Project
11) Fashion-Text2Video - A human video dataset with rich label and text annotations `600 Videos, 480P, Downloadable`	ArXiv 23, Project
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M `5B Clips, Downloadable`	NeurIPS 22 Paper, Project
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time `20k videos, Downloadable`	Arxiv 17 Paper, Project
14) MSR-VTT - A large-scale video benchmark for video understanding `10k Clips, Downloadable`	CVPR 16 Paper, Project
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labeling `Downloadable`	Arxiv 16 Paper, Project
16) Youku-mPLUG - First open-source large-scale Chinese video text dataset `Downloadable`	ArXiv 23, Project, ModelScope
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models `6.69M, Downloadable`	ArXiv 24, Github
18) Pixabay100 - A video dataset collected from Pixabay `Downloadable`	Github
19) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites `Long Durations and Structured Captions`	ArXiv 21, Project , ModelScope
20) MiraData(Mini-Sora Data): A Large-Scale Video Dataset with Long Durations and Structured Captions `10M video-text pairs`	Github, Project
6.2 Video Augmentation Methods
6.2.1 Basic Transformations
Three-stream CNNs for action recognition	PRL 17 Paper
Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks	EL 19 Paper
Intra-clip Aggregation for Video Person Re-identification	ICIP 20 Paper
VideoMix: Rethinking Data Augmentation for Video Classification	CVPR 20 Paper
mixup: Beyond Empirical Risk Minimization	ICLR 17 Paper
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features	ICCV 19 Paper
Video Salient Object Detection via Fully Convolutional Networks	ICIP 18 Paper
Illumination-Based Data Augmentation for Robust Background Subtraction	SKIMA 19 Paper
Image editing-based data augmentation for illumination-insensitive background subtraction	EIM 20 Paper
6.2.2 Feature Space
Feature Re-Learning with Data Augmentation for Content-based Video Recommendation	ACM 18 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer	Trans 21 Paper
6.2.3 GAN-based Augmentation
Deep Video-Based Performance Cloning	CVPR 18 Paper
Adversarial Action Data Augmentation for Similar Gesture Action Recognition	IJCNN 19 Paper
Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples	MM 20 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer	Trans 20 Paper
Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets	TPAMI 20 Paper
CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond	TPAMI 22 Paper
6.2.4 Encoder/Decoder Based
Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video	ECCV 20 Paper
Autoencoder-based Data Augmentation for Deepfake Detection	ACM 23 Paper
6.2.5 Simulation
A data augmentation methodology for training machine/deep learning gait recognition algorithms	CVPR 16 Paper
ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications	IEEE 21 Paper
Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights	CVPR 19 Paper
Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models	IJCV 19 Paper
Using synthetic data for person tracking under adverse weather conditions	IVC 21 Paper
Unlimited Road-scene Synthetic Annotation (URSA) Dataset	ITSC 18 Paper
SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data	CVPR 21 Paper
Universal Semantic Segmentation for Fisheye Urban Driving Images	SMC 20 Paper
07 Patchifying Methods
Paper	Link
1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	CVPR 21 Paper, Github
2) MAE: Masked Autoencoders Are Scalable Vision Learners	CVPR 22 Paper, Github
3) ViViT: A Video Vision Transformer (-)	ICCV 21 Paper, GitHub
4) DiT: Scalable Diffusion Models with Transformers (-)	ICCV 23 Paper, GitHub, Project, ModelScope
5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-)	CVPR 23 Paper, GitHub, ModelScope
6) FlexiViT: One Model for All Patch Sizes	Paper, Github
7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution	ArXiv 23, Github
8) VQ-VAE: Neural Discrete Representation Learning	Paper, Github
9) VQ-GAN: Neural Discrete Representation Learning	CVPR 21 Paper, Github
10) LVT: Latent Video Transformer	Paper, Github
11) VideoGPT: Video Generation using VQ-VAE and Transformers (-)	ArXiv 21, GitHub
12) Predicting Video with VQVAE	ArXiv 21
13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers	ICLR 23 Paper, Github
14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer	ECCV 22 Paper, Github
15) MAGVIT: Masked Generative Video Transformer (-)	CVPR 23 Paper, GitHub, Project, Colab
16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	ICLR 24 Paper, Github
17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-)	ArXiv 23, Project, Blog
18) CLIP: Learning Transferable Visual Models From Natural Language Supervision	CVPR 21 Paper, Github
19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ArXiv 22, Github
20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	ArXiv 23, Github
08 Long-context
Paper	Link
1) World Model on Million-Length Video And Language With RingAttention	ArXiv 24, GitHub
2) Ring Attention with Blockwise Transformers for Near-Infinite Context	ArXiv 23, GitHub
3) Extending LLMs' Context Window with 100 Samples	ArXiv 24, GitHub
4) Efficient Streaming Language Models with Attention Sinks	ICLR 24 Paper, GitHub
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey	Paper
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	CVPR 24 Paper, GitHub, Project
7) MemoryBank: Enhancing Large Language Models with Long-Term Memory	Paper, GitHub
09 Audio Related Resource
Paper	Link
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion	ArXiv 24, Github, Blog
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation	CVPR 23 Paper, GitHub
3) Pengi: An Audio Language Model for Audio Tasks	NeurIPS 23 Paper, GitHub
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset	NeurlPS 23 Paper, GitHub
5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	ArXiv 23, GitHub
6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality	TPAMI 24 Paper, GitHub
7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	ICLR 24 Paper, GitHub
8) UniAudio: An Audio Foundation Model Toward Universal Audio Generation	ArXiv 23, GitHub
9) Diffsound: Discrete Diffusion Model for Text-to-sound Generation	TASLP 22 Paper
10) AudioGen: Textually Guided Audio Generation	ICLR 23 Paper, Project
11) AudioLDM: Text-to-audio generation with latent diffusion models	ICML 23 Paper, GitHub, Project, Huggingface
12) AudioLDM2: Learning Holistic Audio Generation with Self-supervised Pretraining	ArXiv 23, GitHub, Project, Huggingface
13) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models	ICML 23 Paper, GitHub
14) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation	ArXiv 23
15) TANGO: Text-to-audio generation using instruction-tuned LLM and latent diffusion model	ArXiv 23, GitHub, Project, Huggingface
16) AudioLM: a Language Modeling Approach to Audio Generation	ArXiv 22
17) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head	ArXiv 23, GitHub
18) MusicGen: Simple and Controllable Music Generation	NeurIPS 23 Paper, GitHub
19) LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	ArXiv 23
20) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners	CVPR 24 Paper
21) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	EMNLP 23 Paper
22) Audio-Visual LLM for Video Understanding	ArXiv 23
23) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-)	ArXiv 23, Project, Blog
10 Consistency
Paper	Link
1) Consistency Models	Paper, GitHub
2) Improved Techniques for Training Consistency Models	ArXiv 23
3) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations (-)	ICLR 21 Paper, GitHub, Blog
4) Improved Techniques for Training Score-Based Generative Models	NIPS 20 Paper, GitHub
4) Generative Modeling by Estimating Gradients of the Data Distribution	NIPS 19 Paper, GitHub
5) Maximum Likelihood Training of Score-Based Diffusion Models	NIPS 21 Paper, GitHub
6) Layered Neural Atlases for Consistent Video Editing	TOG 21 Paper, GitHub, Project
7) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, GitHub, Project
8) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing	Paper, GitHub, Project
9) Sora Generates Videos with Stunning Geometrical Consistency	Paper, GitHub, Project
10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency	ECCV 22 Paper, GitHub
11) Bootstrap Motion Forecasting With Self-Consistent Constraints	ICCV 23 Paper
12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting	Paper
13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment	CVPRW 23 Paper, GitHub
14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing	ArXiv 21
15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter	TCSVT 23 Paper
16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking	CVPRW 19 Paper
17) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-)	ArXiv 23
18) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-)	ArXiv 24
19) MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask	ArXiv 23
11 Prompt Engineering
Paper	Link
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models	ArXiv 24, GitHub, Project
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs	ArXiv 24, GitHub
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	TMLR 23 Paper, GitHub
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS	ICLR 24 Paper, GitHub
5) Progressive Text-to-Image Diffusion with Soft Latent Direction	ArXiv 23
6) Self-correcting LLM-controlled Diffusion Models	CVPR 24 Paper, GitHub
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation	MM 23 Paper
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models	NeurIPS 23 Paper, GitHub
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition	ArXiv 24, GitHub
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions	ArXiv 23, GitHub
11) Controllable Text-to-Image Generation with GPT-4	ArXiv 23
12) LLM-grounded Video Diffusion Models	ICLR 24 Paper
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	ArXiv 23
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax	ArXiv 23, Github, Project
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM	ArXiv 24
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator	NeurIPS 23 Paper
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models	ArXiv 23
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation	ArXiv 23
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning	ArXiv 23
20) Multimodal Procedural Planning via Dual Text-Image Prompting	ArXiv 23, Github
21) InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists	ICLR 24 Paper, Github
22) DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback	ArXiv 23
23) TaleCrafter: Interactive Story Visualization with Multiple Characters	SIGGRAPH Asia 23 Paper
24) Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis	ArXiv 23, Github
25) COLE: A Hierarchical Generation Framework for Graphic Design	ArXiv 23
26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision	ArXiv 23
27) Vlogger: Make Your Dream A Vlog	CVPR 24 Paper, Github
28) GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting	Paper
29) MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion	ArXiv 24
Recaption
Paper	Link
1) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models	ArXiv 23, GitHub
2) Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation	ArXiv 23, GitHub
3) CoCa: Contrastive Captioners are Image-Text Foundation Models	ArXiv 22, Github
4) CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion	ArXiv 24
5) VideoChat: Chat-Centric Video Understanding	CVPR 24 Paper, Github
6) De-Diffusion Makes Text a Strong Cross-Modal Interface	ArXiv 23
7) HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	ArXiv 23
8) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data	ArXiv 24
9) LLMGA: Multimodal Large Language Model based Generation Assistant	ArXiv 23, Github
10) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	ArXiv 24, Github
11) MyVLM: Personalizing VLMs for User-Specific Queries	ArXiv 24
12) A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation	ArXiv 23, Github
13) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs(-)	ArXiv 24, Github
14) FlexCap: Generating Rich, Localized, and Flexible Captions in Images	ArXiv 24
15) Video ReCap: Recursive Captioning of Hour-Long Videos	ArXiv 24, Github
16) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ICML 22, Github
17) PromptCap: Prompt-Guided Task-Aware Image Captioning	ICCV 23, Github
18) CIC: A framework for Culturally-aware Image Captioning	ArXiv 24
19) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion	ArXiv 24
20) FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions	WACV 24, Github
12 Security
Paper	Link
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset	NeurIPS 23 Paper, Github
2) LIMA: Less Is More for Alignment	NeurIPS 23 Paper
3) Jailbroken: How Does LLM Safety Training Fail?	NeurIPS 23 Paper
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models	CVPR 23 Paper
5) Stable Bias: Evaluating Societal Representations in Diffusion Models	NeurIPS 23 Paper
6) Ablating concepts in text-to-image diffusion models	ICCV 23 Paper
7) Diffusion art or digital forgery? investigating data replication in diffusion models	ICCV 23 Paper, Project
8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks	ICCV 20 Paper
9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks	ICML 20 Paper
10) A pilot study of query-free adversarial attack against stable diffusion	ICCV 23 Paper
11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models	ICCV 23 Paper
12) Erasing Concepts from Diffusion Models	ICCV 23 Paper, Project
13) Ablating Concepts in Text-to-Image Diffusion Models	ICCV 23 Paper, Project
14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset	NeurIPS 23 Paper, Project
15) LIMA: Less Is More for Alignment	NeurIPS 23 Paper
16) Stable Bias: Evaluating Societal Representations in Diffusion Models	NeurIPS 23 Paper
17) Threat Model-Agnostic Adversarial Defense using Diffusion Models	Paper
18) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?	Paper, Github
19) Differentially Private Diffusion Models Generate Useful Synthetic Images	Paper
20) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models	SIGSAC 23 Paper, Github
21) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models	Paper, Github
22) Unified Concept Editing in Diffusion Models	WACV 24 Paper, Project
23) Diffusion Model Alignment Using Direct Preference Optimization	ArXiv 23
24) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment	TMLR 23 Paper , Github
25) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation	Paper, Github, Project
13 World Model
Paper	Link
1) NExT-GPT: Any-to-Any Multimodal LLM	ArXiv 23, GitHub
14 Video Compression
Paper	Link
1) H.261: Video codec for audiovisual services at p x 64 kbit/s	Paper
2) H.262: Information technology - Generic coding of moving pictures and associated audio information: Video	Paper
3) H.263: Video coding for low bit rate communication	Paper
4) H.264: Overview of the H.264/AVC video coding standard	Paper
5) H.265: Overview of the High Efficiency Video Coding (HEVC) Standard	Paper
6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its Applications	Paper
7) DVC: An End-to-end Deep Video Compression Framework	CVPR 19 Paper, GitHub
8) OpenDVC: An Open Source Implementation of the DVC Video Compression Method	Paper, GitHub
9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement	CVPR 20 Paper, Github
10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model	J-STSP 21 Paper, Github
11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GAN	IJCAI 22 Paper, Github
12) ALVC: Advancing Learned Video Compression with In-loop Frame Prediction	T-CSVT 22 Paper, Github
13) DCVC: Deep Contextual Video Compression	NeurIPS 21 Paper, Github
14) DCVC-TCM: Temporal Context Mining for Learned Video Compression	TM 22 Paper, Github
15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression	MM 22 Paper, Github
16) DCVC-DC: Neural Video Compression with Diverse Contexts	CVPR 23 Paper, Github
17) DCVC-FM: Neural Video Compression with Feature Modulation	CVPR 24 Paper, Github
18) SSF: Scale-Space Flow for End-to-End Optimized Video Compression	CVPR 20 Paper, Github
15 Mamba
15.1 Theoretical Foundations and Model Architecture
Paper	Link
1) Mamba: Linear-Time Sequence Modeling with Selective State Spaces	ArXiv 23, Github
2) Efficiently Modeling Long Sequences with Structured State Spaces	ICLR 22 Paper, Github
3) Modeling Sequences with Structured State Spaces	Paper
4) Long Range Language Modeling via Gated State Spaces	ArXiv 22, GitHub
15.2 Image Generation and Visual Applications
Paper	Link
1) Diffusion Models Without Attention	ArXiv 23
2) Pan-Mamba: Effective Pan-Sharpening with State Space Model	ArXiv 24, Github
3) Pretraining Without Attention	ArXiv 22, Github
4) Block-State Transformers	NIPS 23 Paper
5) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model	ArXiv 24, Github
6) VMamba: Visual State Space Model	ArXiv 24, Github
7) ZigMa: Zigzag Mamba Diffusion Model	ArXiv 24, Github
15.3 Video Processing and Understanding
Paper	Link
1) Long Movie Clip Classification with State-Space Video Models	ECCV 22 Paper, Github
2) Selective Structured State-Spaces for Long-Form Video Understanding	CVPR 23 Paper
3) Efficient Movie Scene Detection Using State-Space Transformers	CVPR 23 Paper, Github
4) VideoMamba: State Space Model for Efficient Video Understanding	Paper, Github
15.4 Medical Image Processing
Paper	Link
1) Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining	ArXiv 24, Github
2) MambaIR: A Simple Baseline for Image Restoration with State-Space Model	ArXiv 24, Github
3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation	ArXiv 24, Github

16 Existing high-quality resources
Resources	Link
1) Datawhale - AI视频生成学习	Feishu doc
2) A Survey on Generative Diffusion Model	TKDE 24 Paper, GitHub
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models	ArXiv 23, GitHub
4) Awesome-Text-To-Video：A Survey on Text-to-Video Generation/Synthesis	GitHub
5) video-generation-survey: A reading list of video generation	GitHub
6) Awesome-Video-Diffusion	GitHub
7) Video Generation Task in Papers With Code	Task
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models	ArXiv 24, GitHub
9) Open-Sora-Plan (PKU-YuanGroup)	GitHub
10) State of the Art on Diffusion Models for Visual Computing	Paper
11) Diffusion Models: A Comprehensive Survey of Methods and Applications	CSUR 24 Paper, GitHub
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable	Paper
13) On the Design Fundamentals of Diffusion Models: A Survey	Paper
14) Efficient Diffusion Models for Vision: A Survey	Paper
15) Text-to-Image Diffusion Models in Generative AI: A Survey	Paper
16) Awesome-Diffusion-Transformers	GitHub, Project
17) Open-Sora (HPC-AI Tech)	GitHub, Blog
18) LAVIS - A Library for Language-Vision Intelligence	ACL 23 Paper, GitHub, Project
19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference	GitHub
20) Awesome-Long-Context	GitHub1, GitHub2
21) Lite-Sora	GitHub
22) Mira: A Mini-step Towards Sora-like Long Video Generation	GitHub, Project
17 Efficient Training
17.1 Parallelism based Approach
17.1.1 Data Parallelism (DP)
1) A bridging model for parallel computation	Paper
2) PyTorch Distributed: Experiences on Accelerating Data Parallel Training	VLDB 20 Paper
17.1.2 Model Parallelism (MP)
1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	ArXiv 19 Paper
2) TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models	PMLR 21 Paper
17.1.3 Pipeline Parallelism (PP)
1) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	NeurIPS 19 Paper
2) PipeDream: generalized pipeline parallelism for DNN training	SOSP 19 Paper
17.1.4 Generalized Parallelism (GP)
1) Mesh-TensorFlow: Deep Learning for Supercomputers	ArXiv 18 Paper
2) Beyond Data and Model Parallelism for Deep Neural Networks	MLSys 19 Paper
17.1.5 ZeRO Parallelism (ZP)
1) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	ArXiv 20
2) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters	ACM 20 Paper
3) ZeRO-Offload: Democratizing Billion-Scale Model Training	ArXiv 21
4) PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel	ArXiv 23
17.2 Non-parallelism based Approach
17.2.1 Reducing Activation Memory
1) Gist: Efficient Data Encoding for Deep Neural Network Training	IEEE 18 Paper
2) Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization	MLSys 20 Paper
3) Training Deep Nets with Sublinear Memory Cost	ArXiv 16 Paper
4) Superneurons: dynamic GPU memory management for training deep neural networks	ACM 18 Paper
17.2.2 CPU-Offloading
1) Training Large Neural Networks with Constant Memory using a New Execution Algorithm	ArXiv 20 Paper
2) vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design	IEEE 16 Paper
17.2.3 Memory Efficient Optimizer
1) Adafactor: Adaptive Learning Rates with Sublinear Memory Cost	PMLR 18 Paper
2) Memory-Efficient Adaptive Optimization for Large-Scale Learning	Paper
17.3 Novel Structure
1) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	ArXiv 24 Github
18 Efficient Inference
18.1 Reduce Sampling Steps
18.1.1 Continuous Steps
1) Generative Modeling by Estimating Gradients of the Data Distribution	NeurIPS 19 Paper
2) WaveGrad: Estimating Gradients for Waveform Generation	ArXiv 20
3) Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders	ICASSP 21 Paper
4) Noise Estimation for Generative Diffusion Models	ArXiv 21
18.1.2 Fast Sampling
1) Denoising Diffusion Implicit Models	ICLR 21 Paper
2) DiffWave: A Versatile Diffusion Model for Audio Synthesis	ICLR 21 Paper
3) On Fast Sampling of Diffusion Probabilistic Models	ArXiv 21
4) DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps	NeurIPS 22 Paper
5) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models	ArXiv 22
6) Fast Sampling of Diffusion Models with Exponential Integrator	ICLR 22 Paper
18.1.3 Step distillation
1) On Distillation of Guided Diffusion Models	CVPR 23 Paper
2) Progressive Distillation for Fast Sampling of Diffusion Models	ICLR 22 Paper
3) SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds	NeurIPS 23 Paper
4) Tackling the Generative Learning Trilemma with Denoising Diffusion GANs	ICLR 22 Paper
18.2 Optimizing Inference
18.2.1 Low-bit Quantization
1) Q-Diffusion: Quantizing Diffusion Models	CVPR 23 Paper
2) Q-DM: An Efficient Low-bit Quantized Diffusion Model	NeurIPS 23 Paper
3) Temporal Dynamic Quantization for Diffusion Models	NeurIPS 23 Paper
18.2.2 Parallel/Sparse inference
1) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models	CVPR 24 Paper
2) Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models	NeurIPS 22 Paper

Citation

If this project is helpful to your work, please cite it using the following format:

@misc{minisora,
    title={MiniSora},
    author={MiniSora Community},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}

@misc{minisora,
    title={Diffusion Model-based Video Generation Models From DDPM to Sora: A Survey},
    author={Survey Paper Group of MiniSora Community},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}

Minisora Community WeChat Group

Star History

How to Contribute to the Mini Sora Community

We greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

Community contributors

minisora's People

Contributors

Stargazers

Watchers

Forkers

fanqino1 zhanghui-china jimmyma99 guoyifantastic drryanhuang ming-zch pommespeter zhangrui-wolf chg0901 junyaohu wenmengzhou yunyiliu matrixgame2018 lum1104 nobody-ml lymdlut ariafyy baiyu96 d-mahony-x rickymf4 axyzdong seifer08ms thomas-yanxin zhuxiongwei24 geoffreyfan rese1f hrain1016 xiaohu2015 chenglele buxianchen leonfrank coobiw w-sunmoon cominclip shoufachen tocasoft mr-harry ljtcxq hhhhwb ai-tianlong adammayor2018 goldwaterfall oahzxl a-new-b aplzem cavities proalize macroustc gsnice fab-liu qingtian5 xiechengmude rocket2q19 qingjingfei puellaquae cabraltattoo oxtapinear eltociear desksound-steenfr agentsolidhvidtoxyt halliele93cottonhope coolig8 stunnaani-e divisionclassy25 b-sleeventi bagotoxic-y generalweare-b 40versinda f-farerthebest l-audienche soccertary-98 truelasting-m trippindsto-c buffywideangeliche awhipped cephagne-o billite-jiggyough bozsyineedmoon buffar-m xrinairgi facubingessents openwheal-buferine chikktinnyschoncu bozdw79 waybeauty-r 791428954 evproductions folkevil asdlei99 saitaotechnology yanxg commentthug8lilll 0xchikdot wplayergy m-mckningou ox-precisepress truelasting-lowaller letsuchem-l tonyonst-t insidelifel

minisora's Issues

[Add] - file description in docs/README.md

Detailed Description

[Add] - file description in docs/README.md

Now it is empty.

add new instruction to the contribution method in [CONTRIBUTING.md]

Refer #72 to add new instruction to the contribution method in CONTRIBUTING.md for English and Chinese Version README files.

[docs] Move CONTRIBUTING_EN.md and CONTRIBUTING.md to docs folder

create a docs folder
move files
update links

[Add] Prompt Engineering Papers

Detailed Description

Content Name/Link: Prompt Engineering paper.

Current Status/Issue: Empty content

Add [ConferenceName Year] to Each Paper in the colomn of `Links` or `链接`

For example：

Diffusion Model
论文	链接
1) [ICCV 23] StableVideo: Text-driven Consistency-aware Diffusion Video Editing	Paper, Github, Project
2) [CVPR 24] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	Paper, Github, Project
3) DDPM: Denoising Diffusion Probabilistic Models	Paper, Github

This table in the above could be changed as the following:

Diffusion Model
论文	链接
1) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, Github, Project
2) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	CVPR 24 Paper, Github, Project
3) DDPM: Denoising Diffusion Probabilistic Models	NeurIPS 20 Paper, Github

Paper link is wrong

The link to the 8th paper leads to the 6th paper

[Add] new sora techrxiv preprint

Detailed Description

Content Name/Link: Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable

Current Status/Issue: The document lacks a comprehensive review focusing on OpenAI Sora, Stable Diffusion, and Lumiere.

Update Details: The document will be updated with a detailed review of the paper "Generate Impressive Videos with Text Instructions," which examines the architectures and models of the mentioned AI systems. The new paper will also address the challenges and implications of text-to-video AI, including trustworthiness, data transparency, and environmental sustainability.

Additional Information

Reason for Update: This update is necessary to provide the community with a thorough understanding of the current state and future potential of text-to-video AI technologies. It will help researchers, developers, and industry professionals to stay informed about the latest developments and their broader impacts.

Deadline (if any): ASAP

Is there a relationship between video encoding standards and video compression in video generation?

I have some questions regarding the relationship between video encoding standards and video compression in video generation. From my understanding, video encoding standards such as H.261, H.262, H.263, H.264, and H.265 are used to compress digital videos, reducing file size or lowering bandwidth requirements. However, I would like to delve deeper into how these encoding standards are related to video compression in the context of video generation.

I would greatly appreciate more detailed information to gain a better understanding of the relationship between video encoding standards and video compression. This knowledge will help me grasp the technical intricacies involved in video generation and processing.

Thank you for your assistance!

[Update] - Refering README.md to update README_CN.md

Detailed Description

I updated the README.md for better organization of contents, please follow the README.md to update the README_EN.md.

Additional Information

You need not to add the translated matrials to the English README_EN.md since it is meaningless to add Chinese translation matrials in a English Page.

[Add] - Improved DDPM to Diffusion model

Detailed Description

Content Name/Link: Improved Denoising Diffusion Probabilistic Models，github

Current Status/Issue: The README does not currently include the recent advancements in denoising diffusion probabilistic models, specifically the paper "Improved Denoising Diffusion Probabilistic Models" which introduces significant improvements to the field.

Update Details: The update will involve adding a new section or subsection within the Diffusion Models part of the README. This will include the title of the paper, and a link to the paper or its repository if available.

Additional Information

Reason for Update: This paper significantly enhances sample generation quality and efficiency through improved denoising diffusion probabilistic models and fosters further research and practical applications in the field by providing open-source code.

Deadline (if any): There is no strict deadline for this update; however, it is recommended to implement the changes as soon as possible to ensure the README remains up-to-date and relevant.

Upload the notes about the presentation of Video Diffussion Paper

add corresponding modification to readme_en.md with respect to PR 45

ref #45

[Update] translate the notes/README.md in English

Issue description

The current notes/README.md is in Chinese, please refer other pages to translate it in Chinese.

Steps：

cope notes/README.md as notes/README_CN.md
translate the notes/README.md in English
add Chinese ([English](./README.md) | 简体中文 ) and English links(English | [简体中文](./README_CN.md) ) to notes/README_CN.md and notes/README.md

详情

目前的notes/README.md是中文的，请参考其他页面翻译成中文。

步骤：

将 notes/README.md 处理为 notes/README_CN.md
将notes/README.md翻译成英文
1. 在notes/README_CN.md中添加中文链接([English](./README.md) | 简体中文)和英文链接(English | [简体中文](./README_CN.md)) 注释/README.md

Update PR template so its title has different prefix tags.

Is your feature request related to a problem? Please describe.

The current PR submission title is not standardized, it is best to unify the title format to facilitate the subsequent review and retrieval process.

Describe the solution you'd like
Create multiple PR templates. User should select the appropriate template when submitting, and automatically fill in prefix labels.

Describe alternatives you've considered
None.

Additional context
We can refer to openmmlab's template. And you can comment below to supplement your suggestions. I may finish this task tonight.

Update Latte.md with the markdown equation content failed to render

Please refer modification in PR #69 to update Latte.md with the markdown equation content failed to render successful

Add star history to English version README.md

Add star history to English version README.md.

不成熟想法：复现模拟一个小世界的可能性？

1、选定一部动漫(例如侠盗罗宾汉)或者电视剧(例如南来北往)，复现表现的世界。用来验证技术路径。
2、第一有对应小说文本支撑，数据量足够，场景不多比较固定

[Add] - Add a table of contents

Create a table of contents below Related Works for the different levels of headings in a document. such as

Related Works

Diffusion Models
Diffusion Transformer
Baseline Video Generation Models
.....

The link should also be added for English and Chinese README files [Note that, the links for them are different], for example.

In README.md, add https://github.com/mini-sora/minisora#diffusion-models

Diffusion Models

In README_zh-CN.md, add https://github.com/mini-sora/minisora/blob/main/README_zh-CN.md#diffusion-model

Diffusion Models

You can find the link when you check the websit in the front of <h3> tags

[Update] - Audio related resources

Detailed Description

Content Name/Link: Stable Audio Paper and GitHub Link

Current Status/Issue: The paper link and GitHub repository link for Stable Audio are currently missing, and the content name "NaturalSpeech" has incorrectly used Chinese brackets instead of English brackets.

Update Details:

Add the correct English paper link for Stable Audio.
Add the correct English GitHub link for Stable Audio.
Replace the Chinese brackets with English brackets in "NaturalSpeech".

Additional Information

Reason for Update: The update is necessary to provide accurate and accessible resources for users interested in Stable Audio. Correcting the brackets ensures consistency and readability for an international audience.

Deadline (if any): ASAP

[Update] - update papers of Audio Related Resource

Detailed Description

Content Name/Link: Make-An-Audio, AudioGPT, AudioLM, AudioGen, Audio-Visual LLM for Video Understanding, Macaw-LLM

Current Status/Issue: these papers is necessary about Audio Related Resource

Update Details:

Add the correct English paper link for Audio related resources paper.
Add the correct English GitHub link for Audio related resource paper.
Remove the extra commas for 'Layered Neural Atlases for Consistent Video Editing' in the en/zh ver.

Additional Information

Reason for Update: The update is necessary to provide accurate and accessible resources for users interested in Audio Related Resource. Correcting the commas ensures readability is necessary.

[Update] - Move two new survey papers into "最近更新", which should be located before '论文复现小组'

Detailed Description

Content Name/Link: State of the Art on Diffusion Models for Visual Computing

Current Status/Issue: The document may not include the latest advancements in diffusion models for visual computing.

Update Details: The "State of the Art on Diffusion Models for Visual Computing" paper will be incorporated.

Update

I think we could add these two new papers into "最近更新" to extract more attention for our SoraSurvey Team.

What's more, "最近更新" should also be moved to a place near the top part of README.md. For example, before '论文复现小组'

Additional Information

Reason for Update: This paper provides an intuitive starting point to explore video Diffusion model topic for researchers, artists, and practitioners alike.

README.md index to CONTRIBUTING.md is a error path

Describe the bug
can't jump into CONTRIBUTING.md from the README.md.

Update an issue template which is more appropriate for this repo

Is your feature request related to a problem? Please describe.
The current template is more inclined to the template of the development project, and a template more suitable for the style of this project is needed.

Describe the solution you'd like
Type of issue may include:

Request to add / update the code repo, arxiv website, project website, blog, demo of a paper...
Add / Fix features of this repo.
Typographical / link / spelling issues...
Other discussions...

Describe alternatives you've considered
None.

Additional context
We can refer to openmmlab's template. And you can comment below to supplement your suggestions. I may finish this task tonight.

License?

Contributors might not be sure what they're allowed to do in this project.

Can you add a license preferably an open source license so we can be sure of what we are allowed to do?

[Add] - ICML 23 Paper AudioLDM for t2a task

Detailed Description

Content Name/Link: AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.

Current Status/Issue: AudioLDM is missing in the section of audio related papers.

Update Details: Add the link of the paper , project, github , and etc.

Additional Information

Reason for Update: AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion.

Deadline (if any): ASAP

add corresponding modification to readme_en.md with respect to PR 43

ref #43

[Add] paper 'Taming Transformers for High-Resolution Image Synthesis' to 'Diffusion Transformer'

Add paper 'Taming Transformers for High-Resolution Image Synthesis' to 'Diffusion Transformer'

CVPR 21 paper: https://openaccess.thecvf.com/content/CVPR2021/papers/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.pdf

Github: https://github.com/CompVis/taming-transformers

Project: https://compvis.github.io/taming-transformers/

[Update] - add papers related to `PIXART-Σ`

Description

add papers related to PIXART-Σ

project: https://pixart-alpha.github.io/PixArt-sigma-project/

papers and links

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
https://arxiv.org/pdf/2403.04692.pdf

PIXART-α: FAST TRAINING OF DIFFUSION TRANSFORMER FOR PHOTOREALISTIC TEXT-TO-IMAGE SYNTHESIS
https://arxiv.org/pdf/2310.00426.pdf

PIXART-δ: FAST AND CONTROLLABLE IMAGE GENERATION WITH LATENT CONSISTENCY MODEL
https://arxiv.org/pdf/2401.05252.pdf

Create a pull request template and an issue template for the repository

[Add] Open-Sora Project to Repo

Add Open-Sora Project to Repo

Detailed Description

Content Name/Link: Open-Sora (https://github.com/hpcaitech/Open-Sora)

Current Status/Issue: Open-Sora is not currently listed in the repository.

Update Details: I propose to add Open-Sora to the repository as it is a high-performance open-source project that provides a development pipeline for Sora-like applications, powered by Colossal-AI. The project includes a complete architecture solution from data processing to training and deployment, supports dynamic resolution training, multiple model structures, various video compression methods, and multiple parallel training optimizations.

Additional Information

Reason for Update: Adding Open-Sora to the repository will benefit the community by providing access to a robust and versatile tool for developing and training multimodal AI models. It will also promote the use of Colossal-AI and contribute to the advancement of AI research and development in the field of video processing and multimodal learning.

Deadline (if any): There is no specific deadline for this update, but it would be beneficial to include it in the next repository update cycle.

Update equations in `./notes/SD3_zh-CN.md`

Some eqations could be interpreted by the Github, please check the reason and update it

Add dev-branch to minisora /codes/

copy codes from OpenDiT、SiT、W.A.L.T
check if the added codes could be updated
multi-branches development may be need for keep tracking updating source codes and adding our improvements to replicate Sora: add dev-branch

Sora有关论文复现小组人员招募

添加Sora有关论文复现小组微信二维码, 并在主页添加如下信息

复现论文主要有

DiT with OpenDiT
SiT
W.A.L.T

添加位置位于"近期圆桌讨论"上面

[Update] - Standardize the Labels of arXiv Papers [更新] - 标准化arXiv论文的标签

Detailed Description

Content Name/Link: The labels of arxiv papers

Current Status/Issue: The readme file currently lists arXiv papers with inconsistent types, some are labeled as "Paper" and others are formatted as "Arxiv YY".

Update Details: The update involves standardizing the type designation for all arXiv papers listed in the readme.

Additional Information

Reason for Update: The uniform categorization of paper types will improve the clarity and navigability of the readme file for users. It will make it easier for the community to identify the nature of each paper at a glance, thus enhancing the overall user experience and utility of the resource.

Deadline (if any): It is recommended to complete this formatting update before the next content refresh to ensure all users have access to the most current and accurate information.

详细描述

内容名称/链接：arXiv论文的标签

当前状态/问题：readme文件目前列出了类型不一致的arXiv论文，有些被标记为“Paper”，其他则格式化为“Arxiv YY”。

更新详情：更新涉及标准化readme中列出的所有arXiv论文的类型标识。

附加信息

更新原因：统一的论文类型分类将提高readme文件的清晰度和可导航性，方便用户使用。这将使社区更容易一眼识别每篇论文的性质，从而提升整体用户体验和资源的实用性。

截止日期（如果有）：建议在下次内容更新之前完成此格式更新，以确保所有用户都能获得最新和最准确的信息。

add corresponding modification to readme_en.md with respect to PR 44

ref #44

[Update] - add project and github link for paper FlowZero

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

https://flowzero-video.github.io/

https://github.com/aniki-ly/FlowZero

Please check the reasearch works that just have paper links

[Update] - hot news and remove unnecessary para in the link of paper

Detailed Description

Content Name/Link: hot news and the link of sora techrxiv preprint

Current Status/Issue: The first Sora survey paper is currently missing from the hot news section. Additionally, the provided link to the Sora TechRxiv preprint includes an unnecessary commit parameter.

Update Details: Add the first sora paper into the list of hot news. And remove the commit para in the link of the preprint paper.

Additional Information

Reason for Update: To ensure that the latest and most relevant content is featured in the hot news section, and to provide a clean and direct link to the Sora TechRxiv preprint for easier access and citation purposes.

Deadline (if any): ASAP

[Update] - update papers of Audio Related Resource

Detailed Description

Content Name/Link: Diffsound, AudioLDM2, TANGO, MusicGen, LauraGPT

Current Status/Issue: these papers is necessary about Audio Related Resource

Update Details:

Add the correct English paper link for Audio related resources paper.
Add the correct English GitHub link for Audio related resource paper.

Additional Information

Reason for Update: The update is necessary to provide accurate and accessible resources for users interested in Audio Related Resource.

add corresponding modification to readme_en.md with respect to PR 42

ref #42

[Update] translate the `codes/README.md` in English

The current codes/README.md is in Chinese, please refer other pages to translate it in Chinese.

[Add] - ACL 2023 Paper Link to LAVIS project

Detailed Description

Content Name/Link: LAVIS - A Library for Language-Vision Intelligence

Current Status/Issue: The resource is missing the link to the associated paper.

Update Details: The paper titled "LAVIS - A Library for Language-Vision Intelligence" was presented at ACL 2023 and is currently available on the ACL Anthology. The link to the paper is https://aclanthology.org/2023.acl-demo.3.pdf. This link should be added to the resource page to provide direct access to the research for interested users.

Additional Information

Reason for Update: The update is necessary to ensure that users can easily access the full paper, which is a valuable resource for those interested in the field of language-vision intelligence. Providing the link will enhance the resource's utility and allow for better dissemination of the research findings.

Deadline (if any): There is no specific deadline mentioned for this update. However, it is recommended to perform the update as soon as possible to maintain the currency and relevance of the resource.

[Update] - update papers of Audio Related Resource

Detailed Description

Content Name/Link: Make-An-Audio 2, Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding.

Current Status/Issue: these papers are necessary about Audio Related Resource

Update Details:

Add the correct English paper link for Audio related resources paper.
Add the correct English GitHub link for Audio related resource paper.

Additional Information

Reason for Update: The update is necessary to provide accurate and accessible resources for users interested in Audio Related Resource.

[Update] Optimize English expression 优化英文表达

Issue Description

Optimize English expression (in README.md') for English page of README_CN.md's translation for each folder.

Requirements

Please ensure that the expression across all pages is consistent (use the same expression for the same meaning).
Aim to use scientific, concise, and professional vocabulary in your descriptions.
During the optimization process, feel free to modify the expressions in both Chinese and English simultaneously.
This task can be assigned to multiple people. When claiming the task, please specify which page you will optimize and leave a comment in the comment section.

问题描述

优化每个文件夹中 README_CN.md 的英文页面 (README.md) 的表达。

要求

请确保所有页面的表达一致（相同含义使用相同表达方式）。
尽量使用科学、简洁和专业的词汇描述。
在优化过程中，可以同时修改中文和英文的表达方式。
这个任务可以多人领取，请领取时，说明要优化哪个页面，在评论区评论

[Correct] - broken link to CONTRIBUTING_EN document in PR template and CONTRIBUTING File Name Update

New Issue Description

Change the CONTRIBUTING.md in Chinese and CONTRIBUTING_EN.md in English --> CONTRIBUTING.md in English and CONTRIBUTING_CN.md in Chinese

Detailed Description

Content Name/Link: CONTRIBUTING_EN.md

Current Status/Issue: The provided link for the CONTRIBUTING_EN document is not accessible or does not lead to the expected file.

Update Details: The issue needs to be resolved by either fixing the broken link or by providing the correct and functional link to the CONTRIBUTING_EN document.

Additional Information

Reason for Update: Ensuring that contributors can access the CONTRIBUTING_EN document is crucial for maintaining a clear and effective contribution process. A broken link can lead to confusion and hinder community engagement.

Deadline (if any): There is no specific deadline, but it would be beneficial to address this issue as soon as possible to minimize disruption to potential contributors.

[Update] - Synchronize the README_CN.md with the README.md. 同步中文页面README_CN.md内容到英文页面README.md

Task Announcement

Here, we are announcing tasks that we need assistance with. If you are interested, please let us know and become a part of our project's developer contributors.

The main task is to remove duplicates and synchronize content between the Chinese and English readme pages.
Next, we need to work on the Baseline Video Generation Models. This can be placed before Video Generation as a baseline model.
We need to include the current state-of-the-art papers and typical papers, and move less typical works to the Video Generation section.
Please mention this in the Chinese and English contributor's manual, and include the link to the contributor's manual in the PR template.

Some rules regarding the list format, which include the following points:

First, search to ensure that the literature is not already in the list to avoid duplication.
For typical papers or models, you can add an abbreviation before the paper's name.
For papers with a colon in the title, you can bold the model name before the colon.
For top conference papers and top journals, add the corresponding name in the Paper link, such as CVPR 23, and only bold the CVPR 23 ,-->[CVPR 23 paper], i.e., [**CVPR 23** paper] in markdown.

任务发布

这里发布下需要大家帮忙的任务, 感兴趣的可以提一下, 成为这个项目的开发者贡献者中的一员

主要是中英文readme页面的论文去重和内容同步,
再就是Baseline Video Generation Models, 这个可以放在 Video Generation前面, 作为baseline模型,
把目前soat的论文和典型论文放在里面, 不够典型的工作移动到Video Generation中
在中英文的贡献者手册中提一下, 并把贡献者手册的链接放在PR template中吧

规定下list格式问题, 包括以下要点

先搜索是否文献已经在list中, 不要重复,
典型论文或者模型, 可以在论文名前添加缩写名,
论文中有冒号的, 可以将冒号前的model名粗体
top会议论文和top期刊, 在Paper link中添加对应名字, 如 CVPR 23, 并只对CVPR 23 进行粗体表示 ,-->[CVPR 23 paper], 即用markdown语法表示为 [**CVPR 23** paper]

Linking to a wrong page.

It looks like this link is pointing to the wrong page. It points to the paper 'VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models'.

[Update] - update format of papers in section of `Prompt Engineering`

Detailed Description

add github and project page link for these papers
format modification refering other sections

Additional Information

Deadline (if any): Before Monday

add the two paper reading sources about Latte to the README_EN.md

please mention that, these two files about Latte in notes folder are written by Chinese.

Such as xx.pdf in Chinese

[Add] - Add VideoMamba

Detailed Description

Content Name/Link: VideoMamba: State Space Model for Efficient Video Understanding

Current Status/Issue: This is a new paper/project/resource that has not been previously included in the repository.

Update Details: The addition of this paper to the repository will provide a new benchmark for video understanding and contribute to the field's advancement. The code and models for VideoMamba are available on GitHub for easy access and further exploration. The repository can be found at: https://github.com/OpenGVLab/VideoMamba

Additional Information

Reason for Update: The inclusion of the VideoMamba paper is crucial as it presents a significant advancement in the field of video understanding. By adding this resource, the community will gain access to a state-of-the-art model that can enhance the efficiency and comprehensiveness of video analysis.

Deadline (if any): There is no specific deadline for this update, but it is recommended to add the paper as soon as possible to keep the repository current and relevant.

The contributing guide of mini sora community

In order to better contribute to the community, we should establish an open source contribution specification

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

mini-sora / minisora Goto Github PK

minisora's Introduction

MiniSora Community

Hot News

Sora Reproduction Goals of MiniSora

MiniSora-DiT: Reproducing the DiT Paper with XTuner

Requirements

Background

Support

Recent round-table Discussions

Paper Interpretation of Stable Diffusion 3 paper: MM-DiT

Highlights from Previous Discussions

Recruitment of Presenters

Related Work

01 Diffusion Models

02 Diffusion Transformer

03 Baseline Video Generation Models

04 Diffusion UNet

05 Video Generation

06 Dataset

6.1 Public Datasets

6.2 Video Augmentation Methods

6.2.1 Basic Transformations

6.2.2 Feature Space

6.2.3 GAN-based Augmentation

6.2.4 Encoder/Decoder Based

6.2.5 Simulation

07 Patchifying Methods

08 Long-context

09 Audio Related Resource

10 Consistency

11 Prompt Engineering

Recaption

12 Security

13 World Model

14 Video Compression

15 Mamba

15.1 Theoretical Foundations and Model Architecture

15.2 Image Generation and Visual Applications

15.3 Video Processing and Understanding

15.4 Medical Image Processing

16 Existing high-quality resources

17 Efficient Training

17.1 Parallelism based Approach

17.1.1 Data Parallelism (DP)

17.1.2 Model Parallelism (MP)

17.1.3 Pipeline Parallelism (PP)

17.1.4 Generalized Parallelism (GP)

17.1.5 ZeRO Parallelism (ZP)

17.2 Non-parallelism based Approach

17.2.1 Reducing Activation Memory

17.2.2 CPU-Offloading

17.2.3 Memory Efficient Optimizer

17.3 Novel Structure

18 Efficient Inference

18.1 Reduce Sampling Steps

18.1.1 Continuous Steps

18.1.2 Fast Sampling

18.1.3 Step distillation

18.2 Optimizing Inference

18.2.1 Low-bit Quantization

18.2.2 Parallel/Sparse inference

Citation

Minisora Community WeChat Group

Star History

How to Contribute to the Mini Sora Community

Community contributors

minisora's People

Contributors

Stargazers

Watchers

Forkers

minisora's Issues

Detailed Description

Detailed Description

Diffusion Model

Diffusion Model

Detailed Description

Additional Information

Detailed Description