exponentialml / text-to-video-finetuning Goto Github PK

View Code? Open in Web Editor NEW

619.0 18.0 103.0 1.86 MB

Finetune ModelScope's Text To Video model using Diffusers 🧨

License: MIT License

Python 100.00%

diffusers stable-diffusion text-to-video modelscope deep-learning diffusion-models pytorch text2video

text-to-video-finetuning's Introduction

Video Credit: dotsimulate

Model: Zeroscope XL

Text-To-Video-Finetuning

Finetune ModelScope's Text To Video model using Diffusers 🧨

Important Update 2023-12-14

First of all a note from me. Thank you guys for your support, feedback, and journey through discovering the nascent, innate potential of video Diffusion Models.

@damo-vilab Has released a repository for finetuning all things Video Diffusion Models, and I recommend their implementation over this repository. https://github.com/damo-vilab/i2vgen-xl

62e33a713e863650.mp4

This repository will no longer be updated, but will instead be archived for researchers & builders that wish to bootstrap their projects. I will be leaving the issues, pull requests, and all related things for posterity purposes.

Thanks again!

Updates

2023-7-12: You can now train a LoRA that is compatibile with the webui extension! See instructions here.
2023-4-17: You can now convert your trained models from diffusers to .ckpt format for A111 webui. Thanks @kabachuha!
2023-4-8: LoRA Training released! Checkout configs/v2/lora_training_config.yaml for instructions.
2023-4-8: Version 2 is released!
2023-3-29: Added gradient checkpointing support.
2023-3-27: Support for using Scaled Dot Product Attention for Torch 2.0 users.

Getting Started

Requirements & Installation

git clone https://github.com/ExponentialML/Text-To-Video-Finetuning.git
cd Text-To-Video-Finetuning
git lfs install
git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope_diffusers/

Other Models

Alternatively, you can train starting from other models made by the community.

Contributer	Model Name	Link
cerspense	ZeroScope	https://huggingface.co/cerspense/zeroscope_v2_576w
cameduru	Potat1	https://huggingface.co/camenduru/potat1
strangeman3107	animov-512x	https://huggingface.co/strangeman3107/animov-512x

Create Conda Environment (Optional)

It is recommended to install Anaconda.

Windows Installation: https://docs.anaconda.com/anaconda/install/windows/

Linux Installation: https://docs.anaconda.com/anaconda/install/linux/

conda create -n text2video-finetune python=3.10
conda activate text2video-finetune

Python Requirements

pip install -r requirements.txt

Hardware

All code was tested on Python 3.10.9 & Torch version 1.13.1 & 2.0.

It is highly recommended to install >= Torch 2.0. This way, you don't have to install Xformers or worry about memory performance.

If you don't have Xformers enabled, you can follow the instructions here: https://github.com/facebookresearch/xformers

Recommended to use a RTX 3090, but you should be able to train on GPUs with <= 16GB ram with:

Validation turned off.
Xformers or Torch 2.0 Scaled Dot-Product Attention
Gradient checkpointing enabled.
Resolution of 256.
Hybrid LoRA training.
Training only using LoRA with ranks between 4-16.

Preprocessing your data

Using Captions

You can use caption files when training on images or video. Simply place them into a folder like so:

Images: /images/img.png /images/img.txt Videos: /videos/vid.mp4 | /videos/vid.txt

Then in your config, make sure to have -folder enabled, along with the root directory containing the files.

Process Automatically

You can automatically caption the videos using the Video-BLIP2-Preprocessor Script

Configuration

The configuration uses a YAML config borrowed from Tune-A-Video reposotories.

All configuration details are placed in configs/v2/train_config.yaml. Each parameter has a definition for what it does.

How would you recommend I proceed with making a config with my data?

I highly recommend (I did this myself) going to configs/v2/train_config.yaml. Then make a copy of it and name it whatever you wish my_train.yaml.

Then, follow each line and configure it for your specific use case.

The instructions should be clear enough to get you up and running with your dataset, but feel free to ask any questions in the discussion board.

Training a LoRA

Please read this section carefully if you are training a LoRA model

You can also train a LoRA that is compatible with the webui extension. By default it's set to 'cloneofsimo', which was the first LoRA implementation for Stable Diffusion.

This ('cloneofsimo') version you can use in the inference.py file in this repository. It is not compatible with the webui.

To train and use a LoRA with the webui, change the lora_version to "stable_lora" in your config if you already have one made.

This will train an A1111 webui extension compatibile LoRA. You can get started at configs/v2/stable_lora_config.yaml and everything is set by default in there. During and after training, LoRAs will be saved in your outputs directory with the prefix _webui.

If you do not choose this setting, you will not currently be able to use these in the webui. If you train a Stable LoRA file, you cannot currently use them in inference.py.

Continue training a LoRA

To continue training a LoRA, simply set your lora_path in your config to the directory that contains your LoRA file(s), not an individual file. Each specific LoRA should have _unet or _text_encoder in the file name respectively, or else it will not work.

You should then be able to resume training from a LoRA model, regardless of which method you use (as long as the trained LoRA matches the version in the config).

What you cannot do:

Use LoRA files that were made for SD image models in other trainers.
Use 'cloneofsimo' LoRAs in another project (unless you build it or create a PR)
Merge LoRA weights together (yet).

Finetune.

python train.py --config ./configs/v2/train_config.yaml

Training Results

With a lot of data, you can expect training results to show at roughly 2500 steps at a constant learning rate of 5e-6.

When finetuning on a single video, you should see results in half as many steps.

After training, you should see your results in your output directory.

By default, it should be placed at the script root under ./outputs/train_<date>

From my testing, I recommend:

Keep the number of sample frames between 4-16. Use long frame generation for inference, not training.
If you have a low VRAM system, you can try single frame training or just use n_sample_frames: 2.
Using a learning rate of about 5e-6 seems to work well in all cases.
The best quality will always come from training the text encoder. If you're limited on VRAM, disabling it can help.
Leave some memory to avoid OOM when saving models during training.

Running inference

The inference.py script can be used to render videos with trained checkpoints.

Example usage:

python inference.py \
  --model camenduru/potat1 \
  --prompt "a fast moving fancy sports car" \
  --num-frames 60 \
  --window-size 12 \
  --width 1024 \
  --height 576 \
  --sdp

> python inference.py --help

usage: inference.py [-h] -m MODEL -p PROMPT [-n NEGATIVE_PROMPT] [-o OUTPUT_DIR]
                    [-B BATCH_SIZE] [-W WIDTH] [-H HEIGHT] [-T NUM_FRAMES]
                    [-WS WINDOW_SIZE] [-VB VAE_BATCH_SIZE] [-s NUM_STEPS]
                    [-g GUIDANCE_SCALE] [-i INIT_VIDEO] [-iw INIT_WEIGHT] [-f FPS]
                    [-d DEVICE] [-x] [-S] [-lP LORA_PATH] [-lR LORA_RANK] [-rw]

options:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        HuggingFace repository or path to model checkpoint directory
  -p PROMPT, --prompt PROMPT
                        Text prompt to condition on
  -n NEGATIVE_PROMPT, --negative-prompt NEGATIVE_PROMPT
                        Text prompt to condition against
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Directory to save output video to
  -B BATCH_SIZE, --batch-size BATCH_SIZE
                        Batch size for inference
  -W WIDTH, --width WIDTH
                        Width of output video
  -H HEIGHT, --height HEIGHT
                        Height of output video
  -T NUM_FRAMES, --num-frames NUM_FRAMES
                        Total number of frames to generate
  -WS WINDOW_SIZE, --window-size WINDOW_SIZE
                        Number of frames to process at once (defaults to full
                        sequence). When less than num_frames, a round robin diffusion
                        process is used to denoise the full sequence iteratively one
                        window at a time. Must be divide num_frames exactly!
  -VB VAE_BATCH_SIZE, --vae-batch-size VAE_BATCH_SIZE
                        Batch size for VAE encoding/decoding to/from latents (higher
                        values = faster inference, but more memory usage).
  -s NUM_STEPS, --num-steps NUM_STEPS
                        Number of diffusion steps to run per frame.
  -g GUIDANCE_SCALE, --guidance-scale GUIDANCE_SCALE
                        Scale for guidance loss (higher values = more guidance, but
                        possibly more artifacts).
  -i INIT_VIDEO, --init-video INIT_VIDEO
                        Path to video to initialize diffusion from (will be resized to
                        the specified num_frames, height, and width).
  -iw INIT_WEIGHT, --init-weight INIT_WEIGHT
                        Strength of visual effect of init_video on the output (lower
                        values adhere more closely to the text prompt, but have a less
                        recognizable init_video).
  -f FPS, --fps FPS     FPS of output video
  -d DEVICE, --device DEVICE
                        Device to run inference on (defaults to cuda).
  -x, --xformers        Use XFormers attnetion, a memory-efficient attention
                        implementation (requires `pip install xformers`).
  -S, --sdp             Use SDP attention, PyTorch's built-in memory-efficient
                        attention implementation.
  -lP LORA_PATH, --lora_path LORA_PATH
                        Path to Low Rank Adaptation checkpoint file (defaults to empty
                        string, which uses no LoRA).
  -lR LORA_RANK, --lora_rank LORA_RANK
                        Size of the LoRA checkpoint's projection matrix (defaults to
                        64).
  -rw, --remove-watermark
                        Post-process the videos with LAMA to inpaint ModelScope's
                        common watermarks.

Developing

Please feel free to open a pull request if you have a feature implementation or suggesstion! I welcome all contributions.

I've tried to make the code fairly modular so you can hack away, see how the code works, and what the implementations do.

Deprecation

If you want to use the V1 repository, you can use the branch here.

Shoutouts

Showlab and bryandlee[https://github.com/bryandlee/Tune-A-Video] for their Tune-A-Video contribution that made this much easier.
lucidrains for their implementations around video diffusion.
cloneofsimo for their diffusers implementation of LoRA.
kabachuha for their conversion scripts, training ideas, and webui works.
JCBrouwer Inference implementations.
sergiobr Helpful ideas and bug fixes.

Citation

If you find this work interesting, consider citing the original ModelScope Text-to-Video Technical Report:

@article{ModelScopeT2V,
  title={ModelScope Text-to-Video Technical Report},
  author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
  journal={arXiv preprint arXiv:2308.06571},
  year={2023}
}

text-to-video-finetuning's People

Contributors

Stargazers

Watchers

Forkers

treksis techthiyanes wpu93 sergiobr stjordanis spookybando hithereai gdtiti jcbrouwer dguo98 kabachuha brstar96 zero506 sigil-wen tthking nopeanuts jags111 kustomzone florinnichifiriuc akashc1 syntheticthinkers cluna80 genecyber dvschultz one-shot-finish cephdon yolandaw2021 ddaying 0x-maker zhouliang-yu camenduru commerceless ren-creater abdellahgoplatform xiaoya-li bruefire m5l14i11 delta-qin gaowudao womboai dreamingtulpa maximepeabody nofeetbird0321 bfasenfest cdalinghaus justinwking nahidalam myprivateclonelibrary shameforest sentient-22 791428954 pengge revanthraja yrdpplgngr xuzhouwang nbardy sohaib0399 philipmeier pixeli99 jaberwiki jacobyuan7 ssws3 basicsix zideliu shenhongdeng 5l1v3r1 liangzuan1983 samran-elahi yangbinb neko-149 dailingx xuweiyichen xunguo76 xfanac cuizhuyefei usghdfic julkaztwittera tyrannicawe wipwai steveefemsc qiudi0127 jinwook-shim kai0226 thucth-qt cjt222 luthandomaqondo 4yu5h-crtl joel-osebe leojc wenhui-ml arkboy1224 hyeonho99 cryptoholder-la zhuxiongwei24 qingshui szad670401 darkcloudn appimatesa black141312 thanhnm-cs

text-to-video-finetuning's Issues

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

After running the train.py I get this RuntimeError:

C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_diris deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Useproject_dir` instead.
warnings.warn(
03/24/2023 08:52:48 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
03/24/2023 08:52:50 - INFO - main - ***** Running training *****
03/24/2023 08:52:50 - INFO - main - Num examples = 1
03/24/2023 08:52:50 - INFO - main - Num Epochs = 1200
03/24/2023 08:52:50 - INFO - main - Instantaneous batch size per device = 1
03/24/2023 08:52:50 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1
03/24/2023 08:52:50 - INFO - main - Gradient Accumulation steps = 1
03/24/2023 08:52:50 - INFO - main - Total optimization steps = 1200
Steps: 0%| | 0/1200 [00:00<?, ?it/s]Traceback (most recent call last):
File "D:\Art\Text-To-Video-Finetuning\train.py", line 498, in
main(**OmegaConf.load(args.config))
File "D:\Art\Text-To-Video-Finetuning\train.py", line 394, in main
loss, latents = finetune_unet(batch, train_encoder=train_text_encoder)
File "D:\Art\Text-To-Video-Finetuning\train.py", line 339, in finetune_unet
latents = tensor_to_vae_latent(pixel_values, vae)
File "D:\Art\Text-To-Video-Finetuning\train.py", line 157, in tensor_to_vae_latent
latents = vae.encode(t).latent_dist.sample()
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\models\autoencoder_kl.py", line 164, in encode
h = self.encoder(x)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\models\vae.py", line 109, in forward
sample = self.conv_in(sample)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
Steps: 0%| | 0/1200 [00:00<?, ?it/s]`

how to remove that watermark？

use a lot of data or with more layers unfrozen could make it ?

Accelerator 'function' object has no attribute 'func'

is there a specific version of accelerate that will work? I recently had to reinstall my requirements, and what worked before, doesn't work anymore. I think accelerate changed something on their end that caused this error message. I am using a fresh install at the moment, and everything works up until saving the first checkpoint....

Configuration saved in ./outputs\train_2023-07-02T07-13-31\checkpoint-100\model_index.json
Traceback (most recent call last):
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 994, in <module>
    main(**OmegaConf.load(args.config))
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 899, in main
    save_pipe(
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 506, in save_pipe
    unet, text_encoder = accelerator.prepare(unet, text_encoder)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1182, in prepare
    result = tuple(
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1308, in prepare_model
    model.forward = MethodType(torch.cuda.amp.autocast(dtype=torch.float16)(model.forward.__func__), model)
AttributeError: 'function' object has no attribute '__func__'

Question about finetuning multiple videos, each one with a different prompt, using the video_folder.yaml

So, i put all the .mp4 videos in a folder, and each video needs to be paired with a .txt file named like the video it was paired with, and that contains the prompt, in the same folder. Is this right?

error of "from accelerate.logging import get_logger"

Hi, I meet the following error when I run finetune training.
(aigc) cwh8szh@SZH-C-006RW:/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning$ CUDA_VISIBLE_DEVICES=1 python train.py --config ./configs/v2/train_config_caixukun.yaml
Traceback (most recent call last):
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 1035, in
main(**OmegaConf.load(args.config))
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 617, in main
create_logging(logging, logger, accelerator)
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 64, in create_logging
logger.info(accelerator.state, main_process_only=False)
AttributeError: 'str' object has no attribute 'info'

I have tried different version of accelerate, but cannot slove this error.
the following of my main pip list:
Package Version

accelerate 0.20.3
tensorboard 2.10.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tokenizers 0.13.3
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
transformers 4.30.2
triton 2.0.0

About VideoLDM

Do you have any knowledge of VideoLDM, and is it possible to integrate its algorithms to further enhance the capabilities of current models, such as generating longer videos?

How much VRAM do I need for saving the weights?

I have 16GB available at first place. <12GB is used for training. Then I encounter OOM when saving the weights. This feels a bit insane...

about vid2vid in inference

How to use the vid2vid function? Do I only need to provide an initial video?

Main ways to avoid overfitting?

NameError: name 'glob' is not defined

After i run the script train_config.yaml i get this error below:

2023-04-09 13:40:38.702636: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
04/09/2023 13:40:40 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Non-existant JSON path. Skipping.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/Text-To-Video-Finetuning/train.py:915 in │
│ │
│ 912 │ parser.add_argument("--config", type=str, default="./configs/my_co │
│ 913 │ args = parser.parse_args() │
│ 914 │ │
│ ❱ 915 │ main(**OmegaConf.load(args.config)) │
│ 916 │
│ │
│ /content/Text-To-Video-Finetuning/train.py:582 in main │
│ │
│ 579 │ ) │
│ 580 │ │
│ 581 │ # Get the training dataset based on types (json, single_video, ima │
│ ❱ 582 │ train_datasets = get_train_dataset(dataset_types, train_data, toke │
│ 583 │ │
│ 584 │ # Extend datasets that are less than the greatest one. This allows │
│ 585 │ attrs = ['train_data', 'frames', 'image_dir', 'video_files'] │
│ │
│ /content/Text-To-Video-Finetuning/train.py:86 in get_train_dataset │
│ │
│ 83 │ for DataSet in [VideoJsonDataset, SingleVideoDataset, ImageDataset │
│ 84 │ │ for dataset in dataset_types: │
│ 85 │ │ │ if dataset == DataSet.getname(): │
│ ❱ 86 │ │ │ │ train_datasets.append(DataSet(**train_data, tokenizer= │
│ 87 │ │
│ 88 │ if len(train_datasets) > 0: │
│ 89 │ │ return train_datasets │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:487 in init │
│ │
│ 484 │ │ │
│ 485 │ │ self.fallback_prompt = fallback_prompt │
│ 486 │ │ │
│ ❱ 487 │ │ self.video_files = glob(f"{path}/*.mp4") │
│ 488 │ │ │
│ 489 │ │ self.width = width │
│ 490 │ │ self.height = height │
╰──────────────────────────────────────────────────────────────────────────────╯
NameError: name 'glob' is not defined

What is the difference between using the video_folder.yaml and using the my_config.yaml?

I want to finetune the model using multiple videos, same prompt each video. Which .yaml file should i use?

First GPU occupies more VRAM in distributed training

link，
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cached_latent = torch.load(self.cached_data_list[index], map_location=device)
Otherwise, in multi-GPU distributed training, the first GPU may occupy excessive VRAM compared to the other GPUs.

Is this the new version of the model?

There was a new version of modelscope released recently, it was trained for a month longer and it can generate better videos, is this repo using the new model or the old one?

TypeError: get_logger() got an unexpected keyword argument 'log_level'

Hi, I am trying to run your script but it always shows me this error.
Another thing is that it's not possible for me to install triton. It's like the repo doesn't exist anymore.

Error caught was: No module named 'triton'
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\Life\Text-To-Video-Finetuning\train.py:43 in <module>                                   │
│                                                                                                  │
│    40 # Will error if the minimal version of diffusers is not installed. Remove at your own ri   │
│    41 check_min_version("0.10.0.dev0")                                                           │
│    42                                                                                            │
│ ❱  43 logger = get_logger(__name__, log_level="INFO")                                            │
│    44                                                                                            │
│    45 def create_logging(logging, logger, accelerator):                                          │
│    46 │   logging.basicConfig(                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: get_logger() got an unexpected keyword argument 'log_level'

Lora inference problem

When trying to run inference using --lora_path parameter, getting :

LoRA rank 64 is too large. setting to: 4
list index out of range
Couldn't inject LoRA's due to an error.

0%|          | 0/50 [00:00<?, ?it/s]
0%|          | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/content/drive/MyDrive/Text-To-Video-Finetuning/inference.py", line 194, in <module>
videos = inference(**args)
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/inference.py", line 141, in inference
videos = pipeline(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py", line 646, in __call__
noise_pred = self.unet(
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/models/unet_3d_condition.py", line 399, in forward
emb = self.time_embedding(t_emb, timestep_cond)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 192, in forward
sample = self.linear_1(sample)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/utils/lora.py", line 60, in forward
+ self.dropout(self.lora_up(self.selector(self.lora_down(input))))
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (6x320 and 1280x16)

I'm running it on a Colab

Enabling Multi-GPU training

How to enable multi-GPU training? No matter how many GPUs I use, only one process starts.

Using a trained model on webui

Is it possible?
Is there information that I should refer to anywhere?

Does using "attentions" unfreeze all the layers?

Also, what layers should i unfreeze the get the best possible quality? Even if it consumes a ton of vram.

Watermark removal

I fine-tuned the model via single video fine tuning, but the watermark is still there. Would like to know fine-tune detail that can remove the watermark

Many thanks

How do you turn off validation?

finetune train error of "UnboundLocalError: local variable 'use_offset_noise' referenced before assignment"

after I comment the code about get_logger, I can run as the following output, but meet the other error.
————————————————————————————————————————————
(aigc) cwh8szh@SZH-C-006RW:/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning$ CUDA_VISIBLE_DEVICES=1 python train.py --config ./configs/v2/train_config_caixukun.yaml
{'variance_type', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
LoRA rank 16 is too large. setting to: 4
LoRA rank 16 is too large. setting to: 4
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Non-existant JSON path. Skipping.
Caching Latents.: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:14<00:00, 2.68it/s]
Steps: 0%| | 0/10000 [00:00<?, ?it/s]2628 params have been unfrozen for training.
Traceback (most recent call last):
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 1035, in
main(**OmegaConf.load(args.config))
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 908, in main
loss, latents = finetune_unet(batch, train_encoder=train_text_encoder)
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 810, in finetune_unet
use_offset_noise = use_offset_noise and not rescale_schedule
UnboundLocalError: local variable 'use_offset_noise' referenced before assignment
Steps: 0%| | 0/10000 [00:55<?, ?it/s]
————————————————————————————————————————————
Is there more guideline of finetune training, much thanks.

model inference of version2

After the fine-tuning of version 2 is completed, how to perform model inference? version1 is as the following:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

my_trained_model_path = "./trained_model_path/"
pipe = DiffusionPipeline.from_pretrained(my_trained_model_path, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Your prompt based on train data"
video_frames = pipe(prompt, num_inference_steps=25).frames

out_file = "./my_video.mp4"
video_path = export_to_video(video_frames, out_file)

webui Lora Might be causing errors in checkpoint models.

Some weights of the model checkpoint were not used when initializing UNet3DConditionModel:
This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Has anyone else had similar issues. I believe it has to do with the Lora Training because I only notice such behavior on models created while also training the new webui lora. The most recent model did not use the Loras, and had no such issues.

How does n_sample_frames: work?

So for example, if i have set it to 4, it will finetune using only 4 frames of the video, no matter the length of the video?

Please keep updating and improving this repo, it works shockingly well.

I dont think there is anything like this out there, this has a ton of potential.

VRAM optimization?

I want to train it with n_sample_frames: 100. With 100 videos

I'm using a 3090 Ti and the max n_sample_frames is 24

Last question, Can Text-To-Video-Finetuning recreate this video (same amount of frames and mainteining the cameras)?

video.mp4

Is there a problem with attn3 here?

code，from the network of unet3d, attn3 will do nothing?

Folder for models?

In which folder to put git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b models?

How to test my model

Hello sir after training the model then how to test my model giving text as input please help me in this issue

Default model seems to output only noise or greenscreen

After several unsuccessful attempts at fine-tuning where the output was a still frame of noise or a green field, I followed instructions and skipped to the inference to test the base model. It reacted the same way.

Am I not pointing to the model directory correctly?

!cd /content/Text-To-Video-Finetuning && python inference.py --model /content/Text-To-Video-Finetuning/models/model_scope_diffusers --prompt "cat in a space suit"

Colab?

I am a lazy person.
Has anyone managed to run the finetune on Colab?

Abount offset_noise

What is the effect of the offset noise in training?

Error with inference

Get this using both my finetuned model and the original 1.7b model

│ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:192 in <module> │
│                                                                              │
│   189 │   │   init = interpolate(init, size=(args["num_frames"], args["heigh │
│   190 │   │   args["init_video"] = init                                      │
│   191 │                                                                      │
│ ❱ 192 │   videos = inference(**args)                                         │
│   193 │                                                                      │
│   194 │   os.makedirs(output_dir, exist_ok=True)                             │
│   195 │   out_stem = f"{output_dir}/"                                        │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115 in    │
│ decorate_context                                                             │
│                                                                              │
│   112 │   @functools.wraps(func)                                             │
│   113 │   def decorate_context(*args, **kwargs):                             │
│   114 │   │   with ctx_factory():                                            │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                               │
│   116 │                                                                      │
│   117 │   return decorate_context                                            │
│   118                                                                        │
│                                                                              │
│ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:120 in          │
│ inference                                                                    │
│                                                                              │
│   117 │   lora_rank=64                                                       │
│   118 ):                                                                     │
│   119 │   with torch.autocast(device, dtype=torch.half):                     │
│ ❱ 120 │   │   pipeline = initialize_pipeline(model, device, xformers, sdp)   │
│   121 │   │   inject_inferable_lora(pipeline, lora_path, r=lora_rank)        │
│   122 │   │   prompt = [prompt] * batch_size                                 │
│   123 │   │   negative_prompt = ([negative_prompt] * batch_size) if negative │
│                                                                              │
│ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:33 in           │
│ initialize_pipeline                                                          │
│                                                                              │
│    30 │   │   unet=unet.to(device=device, dtype=torch.half),                 │
│    31 │   )                                                                  │
│    32 │   pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipel │
│ ❱  33 │   unet._set_gradient_checkpointing(value=False)                      │
│    34 │   handle_memory_attention(xformers, sdp, unet)                       │
│    35 │   vae.enable_slicing()                                               │
│    36 │   return pipeline                                                    │
╰──────────────────────────────────────────────────────────────────────────────╯
TypeError: UNet3DConditionModel._set_gradient_checkpointing() missing 1 required
positional argument: 'module'```

Transformer2D initializing

More of a question really, but do you know why the num_attention_heads and attention_head_dim are opposite when initialising Transformer2D blocks?

https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/79e13d17167f66f424a8acad88e83fc76d6d210d/models/unet_3d_blocks.py#L286C17-L286C35

It is opposite in unit_2d_blocks.py
https://github.com/huggingface/diffusers/blob/5439e917cacc885c0ac39dda1b8af12258e6e16d/src/diffusers/models/unet_2d_blocks.py#L872

About step_loss of version2

During the training of version2, the step loss easily becomes NaN, even if the learning rate is lowered. Have you encountered this issue before?

Refer to the official release of Diffusers?

It seems like the models finetuned for Diffusers are referring to the latest beta version and not the latest official release of Diffusers, making the models error out when trying to load them with the official version of Diffusers. Could it be change to official releases referred to instead?

training video

I want to train my own video model, please give me some help

How long should I cut each video into? How many frames per video? How many videos are needed?
After I have finished training, how to call the model in the webui?

Custom resolution?

Is there possibility of having custom resolution for training/inference ?

Generates wrong model_index.json

Generated on finetuned. Unet is null

{
  "_class_name": "TextToVideoSDPipeline",
  "_diffusers_version": "0.15.0.dev0",
  "scheduler": [
    "diffusers",
    "DDIMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    null,
    "UNet3DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

text-to-video-ms-1.7b correct

{
  "_class_name": "TextToVideoSDPipeline",
  "_diffusers_version": "0.15.0.dev0",
  "scheduler": [
    "diffusers",
    "DDIMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    "diffusers",
    "UNet3DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

TypeError: UNet3DConditionModel._set_gradient_checkpointing() got multiple values for argument 'value'

cloned this model git clone https://huggingface.co/camenduru/potat1

the command

python inference.py -m "F:\potat text to video\potat1" -p "fast moving fancy sports car" -W 1024 -H 576 -o "F:\potat text to video" -d cuda -x -s 50 -g 23 -f 24 -T 48

(venv) F:\potat text to video>python check.py
CUDA is available on your system.
CUDA device count: 2
CUDA device name: NVIDIA GeForce RTX 3090 Ti


(venv) F:\potat text to video>cd Text-To-Video-Finetuning

(venv) F:\potat text to video\Text-To-Video-Finetuning>python inference.py -m "F:\potat text to video\potat1" -p "fast moving fancy sports car" -W 1024 -H 576 -o "F:\potat text to video" -d cuda -x -s 50 -g 23 -f 24 -T 48
Traceback (most recent call last):
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 194, in <module>
    videos = inference(**args)
  File "F:\potat text to video\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 122, in inference
    pipeline = initialize_pipeline(model, device, xformers, sdp)
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 24, in initialize_pipeline
    unet.disable_gradient_checkpointing()
  File "F:\potat text to video\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 216, in disable_gradient_checkpointing
    self.apply(partial(self._set_gradient_checkpointing, value=False))
  File "F:\potat text to video\venv\lib\site-packages\torch\nn\modules\module.py", line 884, in apply
    module.apply(fn)
  File "F:\potat text to video\venv\lib\site-packages\torch\nn\modules\module.py", line 885, in apply
    fn(self)
TypeError: UNet3DConditionModel._set_gradient_checkpointing() got multiple values for argument 'value'

(venv) F:\potat text to video\Text-To-Video-Finetuning>

inference code generate only green noise

while the validation output during training seems to be good. Any bugs in the inference code ? Or it is due to different diffuser version?

Links to weights?

How about sharing text2video/fine-tuned weights here?

The two working weights I have so far found are these two:
damo-vilab/text-to-video-ms-1.7b
strangeman3107/animov-0.1.1

Encountered 2 file(s) that may not have been copied correctly on Windows

I see the following after Requirements and Installation step

Encountered 2 file(s) that may not have been copied correctly on Windows:
	unet/diffusion_pytorch_model.safetensors
	unet/diffusion_pytorch_model.bin

Could this break something during finetuning ?

The video dosent move.

After finetuning, the output video dosent move, it just stays still. It looks good but there is no movement.

multi-gpu training

will it be difficult to modify the code to support multi-gpu training?

Thanks，I will try it.> will it be difficult to modify the code to support multi-gpu training?

          > will it be difficult to modify the code to support multi-gpu training?

I've never tried multiple GPU training, but you may be able to do it naively with accelerate.

accelerate config

You should be prompted to configure your setup, including multiple GPU training.
Then it should be as simple as:

accelerate launch train.py --config ./configs/my_config_hq.yaml

Let us know how it goes if you decide to try! If it doesn't I could try to implement it, but I don't have multiple GPUs and would probably need to rent out a server to do so.

Originally posted by @ExponentialML in #14 (comment)

a link to a repository that we can use to generate videos with our new diffusion models, or a small example on how to do it with python or something like that.
a way to specify the frame rate of the sample videos. Everything seems to sample at 6-8 fps, so the default 24fps videos seem too fast to really see what the sample video looks like.
if we use a json file, do we also need to specify the video folder, or does the json's hyperlinks take care of that?

Thank you!

exponentialml / text-to-video-finetuning Goto Github PK

text-to-video-finetuning's Introduction

Text-To-Video-Finetuning

Finetune ModelScope's Text To Video model using Diffusers 🧨

Important Update 2023-12-14

Updates

Getting Started

Requirements & Installation

Other Models

Create Conda Environment (Optional)

Python Requirements

Hardware

Preprocessing your data

Using Captions

Process Automatically

Configuration

How would you recommend I proceed with making a config with my data?

Training a LoRA

Continue training a LoRA

What you cannot do:

Finetune.

Training Results

Running inference

Developing

Deprecation

Shoutouts

Citation

text-to-video-finetuning's People

Contributors

Stargazers

Watchers

Forkers

text-to-video-finetuning's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs