GithubHelp home page GithubHelp logo

tiger-ai-lab / consisti2v Goto Github PK

View Code? Open in Web Editor NEW
155.0 16.0 10.0 29.38 MB

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

Home Page: https://tiger-ai-lab.github.io/ConsistI2V/

License: MIT License

Python 100.00%
diffusion-models image-to-video-generation video-generation video-synthesis

consisti2v's Introduction

ConsistI2V

🌐 Homepage | 📖 arXiv | 🤗 Model | 📊 I2V-Bench | 🤗 Space | 🎬 Replicate Demo

This repo contains the codebase for the paper "ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation"

We propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation. Specifically, we introduce (1) spatiotemporal attention over the first frame to maintain spatial and motion consistency, (2) noise initialization from the low-frequency band of the first frame to enhance layout consistency. These two approaches enable ConsistI2V to generate highly consistent videos. ConsistI2V

🔔News

  • [2024-03-26]: Try our Gradio Demo on Huggingface Space! Thanks @AK for the help.
  • [2024-03-21]: Add Gradio Demo. Run python app.py to launch the demo locally.
  • [2024-03-09]: Add Replicate Demo. Thanks @chenxwh for the effort!
  • [2024-02-26]: Release code and model for ConsistI2V.

Environment Setup

Prepare codebase and Conda environment using the following commands:

git clone https://github.com/TIGER-AI-Lab/ConsistI2V
cd ConsistI2V

conda env create -f environment.yaml
conda activate consisti2v

Inference

Our model is available for download on 🤗 Hugging Face. To generate videos with ConsistI2V, modify the inference configurations in configs/inference/inference.yaml and the input prompt file configs/prompts/default.yaml, and then run the sampling script with the following command:

python -m scripts.animate \
    --inference_config configs/inference/inference.yaml \
    --prompt_config configs/prompts/default.yaml \
    --format mp4

The inference script automatically downloads the model from Hugging Face by specifying pretrained_model_path in configs/inference/inference.yaml as TIGER-Lab/ConsistI2V (default configuration). If you are having trouble downloading the model from the script, you can store the model on your local storage and modify pretrained_model_path to the local model path.

You can also explicitly define the input text prompt, negative prompt, sampling seed and first frame path as:

python -m scripts.animate \
    --inference_config configs/inference/inference.yaml \
    --prompt "timelapse at the snow land with aurora in the sky." \
    --n_prompt "your negative prompt" \
    --seed 42 \
    --path_to_first_frame assets/example/example_01.png \
    --format mp4

To modify inference configurations in configs/inference/inference.yaml from command line, append extra arguments to the end of the inference command:

python -m scripts.animate \
    --inference_config configs/inference/inference.yaml \
    ... # additional arguments
    --format mp4
    sampling_kwargs.num_videos_per_prompt=4 \ # overwrite the configs in the config file
    frameinit_kwargs.filter_params.d_s=0.5

We also created a Gradio demo for easier use of ConsistI2V. The demo can be launched locally by running the following command:

conda activate consisti2v
python app.py

By default, the demo will be running at localhost:7860.

Training

Modify the training configurations in configs/training/training.yaml and run the following command to train the model:

python -m torch.distributed.run \
    --nproc_per_node=${GPU_PER_NODE} \
    --master_addr=${MASTER_ADDR} \
    --master_port=${MASTER_PORT} \
    --nnodes=${NUM_NODES} \
    --node_rank=${NODE_RANK} \
    train.py \
    --config configs/training/training.yaml \
    -n consisti2v_training \
    --wandb

where GPU_PER_NODE, MASTER_ADDR, MASTER_PORT, NUM_NODES and NODE_RANK can be defined based on your training environment. The dataloader in our code assumes a root folder train_data.webvid_config.video_folder containing all videos and a jsonl file train_data.webvid_config.json_path containing video relative paths and captions, with each line in the following format:

{"text": "A man rolling a winter sled with a child sitting on it in the snow close-up", "time": "30.030", "file": "relative/path/to/video.mp4", "fps": 29.97002997002997}

Videos can be stored in multiple subdirectories. Alternatively, you can modify the dataloader to support your own dataset. Similar to model inference, you can also add additional arguments at the end of the training command to modify the training configurations in configs/training/training.yaml.

Citation

Please kindly cite our paper if you find our code, data, models or results to be helpful.

@article{ren2024consisti2v,
  title={ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation},
  author={Ren, Weiming and Yang, Harry and Zhang, Ge and Wei, Cong and Du, Xinrun and Huang, Stephen and Chen, Wenhu},
  journal={arXiv preprint arXiv:2402.04324},
  year={2024}
}

Acknowledgements

Our codebase is built upon AnimateDiff, FreeInit and 🤗 diffusers. Thanks for open-sourcing.

consisti2v's People

Contributors

chenxwh avatar wren93 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

consisti2v's Issues

Discussion on Computing Resources and Training Details

Hello, I am very interested in your work, and I am really impressed with your demo. I would like to inquire about the number of GPUs used for training the diffusion model and the duration of the training time. Additionally, the dataset is sampled from WebVid-10M, and I noticed that you only sampled 16 frames each video. How do you ensure that the sampled series are sufficiently dynamic, and is this 16-frame sampling a tradeoff? Looking forward to your response!

autoregressive doesn't work?

Hi there - thanks for this amazing project and releasing the code!

I'm trying to run autoregressive inference using the default yaml file inference_autoregress, but the resulting video ends up being the same length as using the regular inference

any ideas what I might be doing wrong?

Where to download the Training Dataset

Hi authors,
Thanks for this awesome work! In the paper, the ConsistI2V is trained with WebVid-10M dataset. If I want to reproduce the training, which website should I refer to download this dataset? Thanks

Code Availability for ConsistI2V Project?

Hi there!

I'm excited about the ConsistI2V project's ability to have consistent image to video as the source. I noticed the code isn't currently available in the repository. While I understand it's still under development, I'm curious if there's any information about a potential release timeframe.

I appreciate any insights you can share about the code's availability. Thanks for your time and the awesome project!

watermark problem

Hi,

I wonder why there is always a watermark like pattern appealing in the generated video? Any idea how to get rid of it?

The camera motion cannot be used

Hello, Thanks for your nice work! I want to use the code to achieve the camera motion result. By simply setting the camera motion (such as pan_left), the dim match would generate some problems in z_T calculation. So how to use the code correctly to get the camera motion results?

Issue with blurry results in fine-tuned Model

Hello, your work is really cool!

I have been fine-tuning your model using my dataset, 25k videos, starting from your TIGER-Lab/ConsistI2V checkpoint. Due to limited resources, I used a batch size of 2 training on 2 RTX 6000, while keeping the rest of the configuration the same. However, I noticed that the geometry of the moving objects is blurry.
003

Is this an expected outcome since I cannot replicate batch size of 192? Are the number of GPU or dataset matter here? Did you observe this problem during training the model and was it gone after training for longer?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.