GithubHelp home page GithubHelp logo

cameractrl's Introduction

CameraCtrl

This repository is the official implementation of CameraCtrl.

CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, Ceyuan Yang

Todo List

  • Release inference code.
  • Release pretrained models on AnimateDiffV3.
  • Release training code.
  • Release Gradio Demo.
  • Release pretrained models on SVD.

Configurations

Environment

  • 64-bit Python 3.10 and PyTorch 1.13.0 or higher.
  • CUDA 11.7
  • Users can use the following commands to install the packages
conda env create -f environment.yaml
conda activate cameractrl

Dataset

  • Download the camera trajectories and videos from RealEstate10K.
  • Run tools/gather_realestate.py to get all the clips for each video.
  • Run tools/get_realestate_clips.py to get the video clips from the original videos.
  • Using LAVIS or other methods to generate a caption for each video clip. We provide our extracted captions in Google Drive and Google Drive.
  • Run tools/generate_realestate_json.py to generate the json files for training and test, you can construct the validation json file by randomly sampling some item from the training json file.
  • After the above steps, you can get the dataset folder like this
- RealEstate10k
  - annotations
    - test.json
    - train.json
    - validation.json
  - pose_files
    - 0000cc6d8b108390.txt
    - 00028da87cc5a4c4.txt
    - 0002b126b0a8a685.txt
    - 0003a9bce989e532.txt
    - 000465ebe46a98d2.txt
    - ...
  - video_clips
    - 00ccbtp2aSQ
    - 00rMZpGSeOI
    - 01bTY_glskw
    - 01PJ3skCZPo
    - 01uaDoluhzo
    - ...

Inferences

Prepare Models

  • Download Stable Diffusion V1.5 (SD1.5) from HuggingFace.
  • Download the checkpoints of AnimatediffV3 (ADV3) adaptor and motion module from AnimateDiff.
  • Download the pretrained camera control model from HuggingFace.
  • Run tools/merge_lora2unet.py to merge the ADV3 adaptor weights into SD1.5 unet and save results to new subfolder (like, unet_webvidlora_v3) under the SD1.5 folder.
  • (Optional) Download the pretrained image LoRA model on RealEstate10K dataset from HuggingFace to sample videos on indoor and outdoor estates.
  • (Optional) Download the personalized base model, like Realistic Vision from CivitAI.

Prepare camera trajectory & prompts

  • Adopt tools/select_realestate_clips.py to prepare trajectory txt file, some example trajectories and corresponding reference videos are in assets/pose_files and assets/reference_videos, respectively. The generated trajectories can be visualized with tools/visualize_trajectory.py.
  • Prepare the prompts (negative prompts, specific seeds), one example is assets/cameractrl_prompts.json.

Inference

  • Run inference.py to sample videos
python -m torch.distributed.launch --nproc_per_node=8 --master_port=25000 inference.py \
      --out_root ${OUTPUT_PATH} \
      --ori_model_path ${SD1.5_PATH} \ 
      --unet_subfolder ${SUBFOUDER_NAME} \
      --motion_module_ckpt ${ADV3_MM_CKPT} \ 
      --pose_adaptor_ckpt ${CAMERACTRL_CKPT} \
      --model_config configs/train_cameractrl/adv3_256_384_cameractrl_relora.yaml \
      --visualization_captions assets/cameractrl_prompts.json \
      --use_specific_seeds \
      --trajectory_file assets/pose_files/0f47577ab3441480.txt \
      --n_procs 8

where

  • OUTPUT_PATH refers to the path to save resules.
  • SD1.5_PATH refers to the root path of the downloaded SD1.5 model.
  • SUBFOUDER_NAME refers to the subfolder name of unet in the SD1.5_PATH, default is unet. Here we adopt the name specified by tools/merge_lora2unet.py.
  • ADV3_MM_CKPT refers to the path of the downloaded AnimateDiffV3 motion module checkpoint.
  • CAMERACTRL_CKPT refers to the

The above inference example is used to generate videos in the original T2V model domain. The inference.py script supports generate videos in other domains with image LoRAs (args.image_lora_rank and args.image_lora_ckpt), like the RealEstate10K LoRA or some personalized base models (args.personalized_base_model), like the Realistic Vision. please refer to the code for detail.

Results

  • Same text prompt with different camera trajectories
Camera Trajectory Video Camera Trajectory Video Camera Trajectory Video
horse1_traj horse1_vid horse2_traj horse2_vid horse3_traj horse3_vid
horse4_traj horse4_vid horse5_traj horse5_vid horse6_traj horse6_vid
  • Camera control on different domains' videos
Generator Camera Trajectory Video Camera Trajectory Video Camera Trajectory Video
SD1.5 dd1_traj dd1_vid dd2_traj dd2_vid dd3_traj dd3_vid
SD1.5 + RealEstate LoRA dd4_traj dd4_vid dd5_traj dd5_vid dd6_traj dd6_vid
Realistic Vision dd7_traj dd7_vid dd8_traj dd8_vid dd9_traj dd9_vid
ToonYou dd10_traj dd10_vid dd11_traj dd8_vid dd12_traj dd12_vid

Note that, each image paired with the video represents the camera trajectory. Each small tetrahedron on the image represents the position and orientation of the camera for one video frame. Its vertex stands for the camera location, while the base represents the imaging plane of the camera. The red arrows indicate the movement of camera position. The camera rotation can be observed through the orientation of the tetrahedrons.

Training

Step1 (RealEstate10K image LoRA)

Update the below paths to data and pretrained model of the config configs/train_image_lora/realestate_lora.yaml

pretrained_model_path: "[replace with SD1.5 root path]"
train_data:
  root_path: "[replace RealEstate10K root path]"

Other training parameters (lr, epochs, validation settings, etc.) are also included in the config files.

Then, launch the image LoRA training using slurm

./slurm_run.sh ${PARTITION} image_lora 8 configs/train_image_lora/realestate_lora.yaml train_image_lora.py

or PyTorch

./dist_run.sh configs/train_image_lora/realestate_lora.yaml 8 train_image_lora.py

We provide our pretrained checkpoint of the RealEstate10K LoRA model in HuggingFace.

Step2 (Camera control model)

Update the below paths to data and pretrained model of the config configs/train_cameractrl/adv3_256_384_cameractrl_relora.yaml

pretrained_model_path: "[replace with SD1.5 root path]"
train_data:
  root_path: "[replace RealEstate10K root path]"
validation_data:
  root_path:       "[replace RealEstate10K root path]"
lora_ckpt: "[Replace with RealEstate10k image LoRA ckpt]"
motion_module_ckpt: "[Replace with ADV3 motion module]"

Other training parameters (lr, epochs, validation settings, etc.) are also included in the config files.

Then, launch the camera control model training using slurm

./slurm_run.sh ${PARTITION} cameractrl 8 configs/train_cameractrl/adv3_256_384_cameractrl_relora.yaml train_camera_control.py

or PyTorch

./dist_run.sh configs/train_cameractrl/adv3_256_384_cameractrl_relora.yaml 8 train_camera_control.py

Acknowledgement

We thank AnimateDiff for their amazing codes and models.

BibTeX

@misc{he2024cameractrl,
      title={CameraCtrl: Enabling Camera Control for Text-to-Video Generation}, 
      author={Hao He and Yinghao Xu and Yuwei Guo and Gordon Wetzstein and Bo Dai and Hongsheng Li and Ceyuan Yang},
      year={2024},
      eprint={2404.02101},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

cameractrl's People

Contributors

hehao13 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.