GithubHelp home page GithubHelp logo

cloneofsimo / lora Goto Github PK

View Code? Open in Web Editor NEW
6.6K 59.0 456.0 181.51 MB

Using Low-rank adaptation to quickly fine-tune diffusion models.

Home Page: https://arxiv.org/abs/2106.09685

License: Apache License 2.0

Python 0.87% Shell 0.02% Jupyter Notebook 99.12%
diffusion lora stable-diffusion dreambooth fine-tuning

lora's Introduction

Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning

Using LoRA to fine tune on illustration dataset : $W = W_0 + \alpha \Delta W$, where $\alpha$ is the merging ratio. Above gif is scaling alpha from 0 to 1. Setting alpha to 0 is same as using the original model, and setting alpha to 1 is same as using the fully fine-tuned model.

SD 1.5 PTI on Kiriko, the game character, Various Prompts.

"baby lion in style of <s1><s2>", with disney-style LoRA model.

"superman, style of <s1><s2>", with pop-art style LoRA model.

Main Features

  • Fine-tune Stable diffusion models twice as fast than dreambooth method, by Low-rank Adaptation
  • Get insanely small end result (1MB ~ 6MB), easy to share and download.
  • Compatible with diffusers
  • Support for inpainting
  • Sometimes even better performance than full fine-tuning (but left as future work for extensive comparisons)
  • Merge checkpoints + Build recipes by merging LoRAs together
  • Pipeline to fine-tune CLIP + Unet + token to gain better results.
  • Out-of-the box multi-vector pivotal tuning inversion

Web Demo

UPDATES & Notes

2023/02/06

  • Support for training inpainting on LoRA PTI. Use flag --train-inpainting with a inpainting stable diffusion base model (see inpainting_example.sh).

2023/02/01

  • LoRA Joining is now available with --mode=ljl flag. Only three parameters are required : path_to_lora1, path_to_lora2, and path_to_save.

2023/01/29

  • Dataset pipelines
  • LoRA Applied to Resnet as well, use --use_extended_lora to use it.
  • SVD distillation now supports resnet-lora as well.
  • Compvis format Conversion script now works with safetensors, and will for PTI it will return Textual inversion format as well, so you can use it in embeddings folder.
  • ๐Ÿฅณ๐Ÿฅณ, LoRA is now officially integrated into the amazing Huggingface ๐Ÿค— diffusers library! Check out the Blog and examples! (NOTE : It is CURRENTLY DIFFERENT FILE FORMAT)

2023/01/09

  • Pivotal Tuning Inversion with extended latent
  • Better textual inversion with Norm prior
  • Mask conditioned score estimation loss
  • safetensor support, xformers support (thanks to @hafriedlander)
  • Distill fully trained model to LoRA with SVD distillation CLI
  • Flexiable dataset support

2022/12/22

  • Pivotal Tuning now available with run_lorpt.sh
  • More Utilities added, such as datasets, patch_pipe function to patch CLIP, Unet, Token all at once.
  • Adjustable Ranks, Fine-tuning Feed-forward layers.
  • More example notebooks added.

2022/12/10

  • You can now fine-tune text_encoder as well! Enabled with simple --train_text_encoder
  • Converting to CKPT format for A1111's repo consumption! (Thanks to jachiam's conversion script)
  • Img2Img Examples added.
  • Please use large learning rate! Around 1e-4 worked well for me, but certainly not around 1e-6 which will not be able to learn anything.

Lengthy Introduction

Thanks to the generous work of Stability AI and Huggingface, so many people have enjoyed fine-tuning stable diffusion models to fit their needs and generate higher fidelity images. However, the fine-tuning process is very slow, and it is not easy to find a good balance between the number of steps and the quality of the results.

Also, the final results (fully fined-tuned model) is very large. Some people instead works with textual-inversion as an alternative for this. But clearly this is suboptimal: textual inversion only creates a small word-embedding, and the final image is not as good as a fully fine-tuned model.

Well, what's the alternative? In the domain of LLM, researchers have developed Efficient fine-tuning methods. LoRA, especially, tackles the very problem the community currently has: end users with Open-sourced stable-diffusion model want to try various other fine-tuned model that is created by the community, but the model is too large to download and use. LoRA instead attempts to fine-tune the "residual" of the model instead of the entire model: i.e., train the $\Delta W$ instead of $W$.

$$ W' = W + \Delta W $$

Where we can further decompose $\Delta W$ into low-rank matrices : $\Delta W = A B^T $, where $A, \in \mathbb{R}^{n \times d}, B \in \mathbb{R}^{m \times d}, d &lt;&lt; n$. This is the key idea of LoRA. We can then fine-tune $A$ and $B$ instead of $W$. In the end, you get an insanely small model as $A$ and $B$ are much smaller than $W$.

Also, not all of the parameters need tuning: they found that often, $Q, K, V, O$ (i.e., attention layer) of the transformer model is enough to tune. (This is also the reason why the end result is so small). This repo will follow the same idea.

Now, how would we actually use this to update diffusion model? First, we will use Stable-diffusion from stability-ai. Their model is nicely ported through Huggingface API, so this repo has built various fine-tuning methods around them. In detail, there are three subtle but important distictions in methods to make this work out.

  1. Dreambooth

First, there is LoRA applied to Dreambooth. The idea is to use prior-preservation class images to regularize the training process, and use low-occuring tokens. This will keep the model's generalization capability while keeping high fidelity. If you turn off prior preservation, and train text encoder embedding as well, it will become naive fine tuning.

  1. Textual Inversion

Second, there is Textual inversion. There is no room to apply LoRA here, but it is worth mentioning. The idea is to instantiate new token, and learn the token embedding via gradient descent. This is a very powerful method, and it is worth trying out if your use case is not focused on fidelity but rather on inverting conceptual ideas.

  1. Pivotal Tuning

Last method (although originally proposed for GANs) takes the best of both worlds to further benefit. When combined together, this can be implemented as a strict generalization of both methods. Simply you apply textual inversion to get a matching token embedding. Then, you use the token embedding + prior-preserving class image to fine-tune the model. This two-fold nature make this strict generalization of both methods.

Enough of the lengthy introduction, let's get to the code.

Installation

pip install git+https://github.com/cloneofsimo/lora.git

Getting Started

1. Fine-tuning Stable diffusion with LoRA CLI

If you have over 12 GB of memory, it is recommended to use Pivotal Tuning Inversion CLI provided with lora implementation. They have the best performance, and will be updated many times in the future as well. These are the parameters that worked for various dataset. ALL OF THE EXAMPLE ABOVE WERE TRAINED WITH BELOW PARAMETERS

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="./data/data_disney"
export OUTPUT_DIR="./exps/output_dsn"

lora_pti \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --train_text_encoder \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --scale_lr \
  --learning_rate_unet=1e-4 \
  --learning_rate_text=1e-5 \
  --learning_rate_ti=5e-4 \
  --color_jitter \
  --lr_scheduler="linear" \
  --lr_warmup_steps=0 \
  --placeholder_tokens="<s1>|<s2>" \
  --use_template="style"\
  --save_steps=100 \
  --max_train_steps_ti=1000 \
  --max_train_steps_tuning=1000 \
  --perform_inversion=True \
  --clip_ti_decay \
  --weight_decay_ti=0.000 \
  --weight_decay_lora=0.001\
  --continue_inversion \
  --continue_inversion_lr=1e-4 \
  --device="cuda:0" \
  --lora_rank=1 \
#  --use_face_segmentation_condition\

Check here to see what these parameters mean.

2. Other Options

Basic usage is as follows: prepare sets of $A, B$ matrices in an unet model, and fine-tune them.

from lora_diffusion import inject_trainable_lora, extract_lora_ups_down

...

unet = UNet2DConditionModel.from_pretrained(
    pretrained_model_name_or_path,
    subfolder="unet",
)
unet.requires_grad_(False)
unet_lora_params, train_names = inject_trainable_lora(unet)  # This will
# turn off all of the gradients of unet, except for the trainable LoRA params.
optimizer = optim.Adam(
    itertools.chain(*unet_lora_params, text_encoder.parameters()), lr=1e-4
)

Another example of this, applied on Dreambooth can be found in training_scripts/train_lora_dreambooth.py. Run this example with

training_scripts/run_lora_db.sh

Another dreambooth example, with text_encoder training on can be run with:

training_scripts/run_lora_db_w_text.sh

Loading, merging, and interpolating trained LORAs with CLIs.

We've seen that people have been merging different checkpoints with different ratios, and this seems to be very useful to the community. LoRA is extremely easy to merge.

By the nature of LoRA, one can interpolate between different fine-tuned models by adding different $A, B$ matrices.

Currently, LoRA cli has three options : merge full model with LoRA, merge LoRA with LoRA, or merge full model with LoRA and changes to ckpt format (original format)

SYNOPSIS
    lora_add PATH_1 PATH_2 OUTPUT_PATH <flags>

POSITIONAL ARGUMENTS
    PATH_1
        Type: str
    PATH_2
        Type: str
    OUTPUT_PATH
        Type: str

FLAGS
    --alpha
        Type: float
        Default: 0.5
    --mode
        Type: Literal['upl', 'lpl', 'upl', 'upl-ckpt-v2']
        Default: 'lpl'
    --with_text_lora
        Type: bool
        Default: False

Merging full model with LoRA

$ lora_add PATH_TO_DIFFUSER_FORMAT_MODEL PATH_TO_LORA.safetensors OUTPUT_PATH ALPHA --mode upl

path_1 can be both local path or huggingface model name. When adding LoRA to unet, alpha is the constant as below:

$$ W' = W + \alpha \Delta W $$

So, set alpha to 1.0 to fully add LoRA. If the LoRA seems to have too much effect (i.e., overfitted), set alpha to lower value. If the LoRA seems to have too little effect, set alpha to higher than 1.0. You can tune these values to your needs. This value can be even slightly greater than 1.0!

Example

$ lora_add runwayml/stable-diffusion-v1-5 ./example_loras/lora_krk.safetensors ./output_merged 0.8 --mode upl

Mergigng Full model with LoRA and changing to original CKPT format

Everything same as above, but with mode upl-ckpt-v2 instead of upl.

$ lora_add runwayml/stable-diffusion-v1-5 ./example_loras/lora_krk.safetensors ./output_merged.ckpt 0.7 --mode upl-ckpt-v2

Merging LoRA with LoRA

$ lora_add PATH_TO_LORA1.safetensors PATH_TO_LORA2.safetensors OUTPUT_PATH.safetensors ALPHA_1 ALPHA_2

alpha is the ratio of the first model to the second model. i.e.,

$$ \Delta W = (\alpha_1 A_1 + \alpha_2 A_2) (\alpha_1 B_1 + \alpha_2 B_2)^T $$

Set $\alpha_1 = \alpha_2 = 0.5$ to get the average of the two models. Set $\alpha_1$ close to 1.0 to get more effect of the first model, and set $\alpha_2$ close to 1.0 to get more effect of the second model.

Example

$ lora_add ./example_loras/analog_svd_rank4.safetensors ./example_loras/lora_krk.safetensors ./krk_analog.safetensors 2.0 0.7

Making Text2Img Inference with trained LoRA

Checkout scripts/run_inference.ipynb for an example of how to make inference with LoRA.

Making Img2Img Inference with LoRA

Checkout scripts/run_img2img.ipynb for an example of how to make inference with LoRA.

Merging Lora with Lora, and making inference dynamically using monkeypatch_add_lora.

Checkout scripts/merge_lora_with_lora.ipynb for an example of how to merge Lora with Lora, and make inference dynamically using monkeypatch_add_lora.

Above results are from merging lora_illust.pt with lora_kiriko.pt with both 1.0 as weights and 0.5 as $\alpha$.

$$ W_{unet} \leftarrow W_{unet} + 0.5 (A_{kiriko} + A_{illust})(B_{kiriko} + B_{illust})^T $$

and

$$ W_{clip} \leftarrow W_{clip} + 0.5 A_{kiriko}B_{kiriko}^T $$


Tips and Discussions

Training tips in general

I'm curating a list of tips and discussions here. Feel free to add your own tips and discussions with a PR!

How long should you train?

Effect of fine tuning (both Unet + CLIP) can be seen in the following image, where each image is another 500 steps. Trained with 9 images, with lr of 1e-4 for unet, and 5e-5 for CLIP. (You can adjust this with --learning_rate=1e-4 and --learning_rate_text=5e-5)

"female game character bnha, in a steampunk city, 4K render, trending on artstation, masterpiece". Visualization notebook can be found at scripts/lora_training_process_visualized.ipynb

You can see that with 2500 steps, you already get somewhat good results.

What is a good learning rate for LoRA?

People using dreambooth are used to using lr around 1e-6, but this is way too small for training LoRAs. I've tried using 1e-4, and it is OK. I think these values should be more explored statistically.

What happens to Text Encoder LoRA and Unet LoRA?

Let's see: the following is only using Unet LoRA:

And the following is only using Text Encoder LoRA:

So they learnt different aspect of the dataset, but they are not mutually exclusive. You can use both of them to get better results, and tune them seperately to get even better results.

With LoRA Text Encoder, Unet, all the schedulers, guidance scale, negative prompt etc. etc., you have so much to play around with to get the best result you want. For example, with $\alpha_{unet} = 0.6$, $\alpha_{text} = 0.9$, you get a better result compared to $\alpha_{unet} = 1.0$, $\alpha_{text} = 1.0$ (default). Checkout below:

Left with tuned $\alpha_{unet} = 0.6$, $\alpha_{text} = 0.9$, right with $\alpha_{unet} = 1.0$, $\alpha_{text} = 1.0$.

Here is an extensive visualization on the effect of $\alpha_{unet}$, $\alpha_{text}$, by @brian6091 from his analysis

"a photo of (S*)", trained with 21 images, with rank 16 LoRA. More details can be found here


TODOS

  • Make this more user friendly for non-programmers
  • Make a better documentation
  • Kronecker product, like LoRA [https://arxiv.org/abs/2106.04647]
  • Adaptor-guidance
  • Time-aware fine-tuning.

References

This work was heavily influenced by, and originated from these awesome researches. I'm just applying them here.

@article{roich2022pivotal,
  title={Pivotal tuning for latent-based editing of real images},
  author={Roich, Daniel and Mokady, Ron and Bermano, Amit H and Cohen-Or, Daniel},
  journal={ACM Transactions on Graphics (TOG)},
  volume={42},
  number={1},
  pages={1--13},
  year={2022},
  publisher={ACM New York, NY}
}
@article{ruiz2022dreambooth,
  title={Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation},
  author={Ruiz, Nataniel and Li, Yuanzhen and Jampani, Varun and Pritch, Yael and Rubinstein, Michael and Aberman, Kfir},
  journal={arXiv preprint arXiv:2208.12242},
  year={2022}
}
@article{gal2022image,
  title={An image is worth one word: Personalizing text-to-image generation using textual inversion},
  author={Gal, Rinon and Alaluf, Yuval and Atzmon, Yuval and Patashnik, Or and Bermano, Amit H and Chechik, Gal and Cohen-Or, Daniel},
  journal={arXiv preprint arXiv:2208.01618},
  year={2022}
}
@article{hu2021lora,
  title={Lora: Low-rank adaptation of large language models},
  author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  journal={arXiv preprint arXiv:2106.09685},
  year={2021}
}

lora's People

Contributors

2kpr avatar ak391 avatar anotherjesse avatar brian6091 avatar cloneofsimo avatar davidepaglieri avatar ethansmith2000 avatar hafriedlander avatar hdeezy avatar hysts avatar jacklangerman avatar laksjdjf avatar levi avatar milyiyo avatar oscarnevarezleal avatar timh avatar wsxiaoys avatar zeke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lora's Issues

Apply to Checkpoint

Hi @cloneofsimo. Could you kindly provide an example of applying lora weights to a diffusers model and saving for conversion to a standard ckpt file?

Still having some troubles in this department.

RuntimeError: The size of tensor a (320) must match the size of tensor b (2560) at non-singleton dimension 2

The run_lorpt.ipynb notebook doesn't seem to work anymore, also there's a required parameter --stochastic_attribute which I'm not sure how to use

  0%|                                                    | 0/50 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 10
      7 tune_lora_scale(pipe.unet, 1.00)
      9 torch.manual_seed(0)
---> 10 image = pipe(prompt, num_inference_steps=50, guidance_scale=6).images[0]
     12 image  # Wow ok, now I might have to deal with a lawsuite for this.

File ~/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/miniconda3/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py:550, in StableDiffusionPipeline.__call__(self, prompt, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, generator, latents, output_type, return_dict, callback, callback_steps, **kwargs)
    547 latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
    549 # predict the noise residual
--> 550 noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
    552 # perform guidance
    553 if do_classifier_free_guidance:

File ~/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py:341, in UNet2DConditionModel.forward(self, sample, timestep, encoder_hidden_states, class_labels, return_dict)
    339 for downsample_block in self.down_blocks:
    340     if hasattr(downsample_block, "attentions") and downsample_block.attentions is not None:
--> 341         sample, res_samples = downsample_block(
    342             hidden_states=sample,
    343             temb=emb,
    344             encoder_hidden_states=encoder_hidden_states,
    345         )
    346     else:
    347         sample, res_samples = downsample_block(hidden_states=sample, temb=emb)

File ~/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/lib/python3.9/site-packages/diffusers/models/unet_2d_blocks.py:644, in CrossAttnDownBlock2D.forward(self, hidden_states, temb, encoder_hidden_states)
    642     else:
    643         hidden_states = resnet(hidden_states, temb)
--> 644         hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states).sample
    646     output_states += (hidden_states,)
    648 if self.downsamplers is not None:

File ~/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/lib/python3.9/site-packages/diffusers/models/attention.py:219, in Transformer2DModel.forward(self, hidden_states, encoder_hidden_states, timestep, return_dict)
    217 # 2. Blocks
    218 for block in self.transformer_blocks:
--> 219     hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep)
    221 # 3. Output
    222 if self.is_input_continuous:

File ~/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/lib/python3.9/site-packages/diffusers/models/attention.py:483, in BasicTransformerBlock.forward(self, hidden_states, context, timestep)
    479 # 2. Cross-Attention
    480 norm_hidden_states = (
    481     self.norm2(hidden_states, timestep) if self.use_ada_layer_norm else self.norm2(hidden_states)
    482 )
--> 483 hidden_states = self.attn2(norm_hidden_states, context=context) + hidden_states
    485 # 3. Feed-forward
    486 hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states

File ~/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/lib/python3.9/site-packages/diffusers/models/attention.py:552, in CrossAttention.forward(self, hidden_states, context, mask)
    549 def forward(self, hidden_states, context=None, mask=None):
    550     batch_size, sequence_length, _ = hidden_states.shape
--> 552     query = self.to_q(hidden_states)
    553     context = context if context is not None else hidden_states
    554     key = self.to_k(context)

File ~/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/lib/python3.9/site-packages/lora_diffusion/lora.py:30, in LoraInjectedLinear.forward(self, input)
     29 def forward(self, input):
---> 30     return self.linear(input) + self.lora_up(self.lora_down(input)) * self.scale

RuntimeError: The size of tensor a (320) must match the size of tensor b (2560) at non-singleton dimension 2

Issue on Dev branch v0.0.8 RuntimeError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

My env:

Ubuntu 22
xformers 0.0.14.dev0
torch 1.12.1
diffusers 0.9.0
Python 3.9.12
Cuda 11.7
RTX 3090

Note that this doesn't occur when I uninstall/reinstall 0.0.7 lora_diffusion dev branch

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Traceback (most recent call last):
  File "/home/ian/miniconda3/bin/lora_pti", line 33, in <module>
    sys.exit(load_entry_point('lora-diffusion', 'console_scripts', 'lora_pti')())
  File "/home/ian/projs/lora/lora_diffusion/cli_lora_pti.py", line 738, in main
    fire.Fire(train)
  File "/home/ian/miniconda3/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ian/miniconda3/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ian/miniconda3/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ian/projs/lora/lora_diffusion/cli_lora_pti.py", line 642, in train
    train_inversion(
  File "/home/ian/projs/lora/lora_diffusion/cli_lora_pti.py", line 307, in train_inversion
    loss.backward()
  File "/home/ian/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ian/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ian/miniconda3/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/home/ian/miniconda3/lib/python3.9/site-packages/xformers/ops.py", line 369, in backward
    ) = torch.ops.xformers.efficient_attention_backward_cutlass(
  File "/home/ian/miniconda3/lib/python3.9/site-packages/torch/_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

training issue (CUDA out of memory)

when i try to start training it gives me this error note that im using gtx 1070 8gbs

***** Running training ***** Instance Images: 9 Class Images: 0 Total Examples: 9 Num batches each epoch = 9 Num Epochs = 100 Batch Size Per Device = 1 Gradient Accumulation steps = 1 Total train batch size (w. parallel, distributed & accumulation) = 9 Total optimization steps = 900 Total training steps = 900 Resuming from checkpoint: False First resume epoch: 0 First resume step: 0 Lora: True, Adam: True, Prec: fp16 Gradient Checkpointing: True, Text Enc Steps: -1.0 EMA: False LR: 2e-06) Steps: 0%| | 0/900 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 0/1. Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 86, in decorator return function(batch_size, grad_size, *args, **kwargs) File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 904, in inner_loop accelerator.backward(loss) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 1314, in backward self.scaler.scale(loss).backward(**kwargs) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply return user_fn(self, *args) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 7.23 GiB already allocated; 0 bytes free; 7.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Steps: 0%| | 0/900 [00:14<?, ?it/s] Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\scripts\dreambooth.py", line 569, in start_training result = main(config, use_subdir=use_subdir, lora_model=lora_model_name, File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1024, in main return inner_loop() File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 84, in decorator raise RuntimeError("No executable batch size found, reached zero.") RuntimeError: No executable batch size found, reached zero. Training completed, reloading SD Model. Restored system models. Returning result: Exception training model: No executable batch size found, reached zero.

Preview while training

Hi ! Thank you very much for this implementation !

I'm able to train with higher batch than dreambooth so that's great.

Do you have an easy solution to get a preview of generated images while training to check if it's going well and not overfitting ?

XFormers massively increases performance and improves memory usage

Without xformers, I can only do a batch of 1 (and no prior preservation) when training a unet + text encoder on my 24GB 4090, even with all other tricks (fp16, 8 bit adam, can't use gradient checkpointing because it currently doesn't work correctly).

With xformers enabled, I can do a batch of 2 w/ prior preservation (so 4 images total per batch), and the performance is still the same as the batch of 1 without it.

So, the bad news: backwards support in xformers is bad. It mostly only works on A100s, on other cards it's variable. On my 4090 it works fine for SD2.1 models, but not on SD1.5 models (although I've got it working via a nasty hack
facebookresearch/xformers@27dadbf) - it also doesn't work on Colab.

Anyway, I'll add the option to turn it on in the dreambooth scripts after christmas, but there'll be plenty of issues to sort out. Just wanted to document this in the meantime.

Config.json not found

I got this error:

EntryNotFoundError(f"404 Client Error: Entry Not Found for url: {response.url}") transformers.utils.hub.EntryNotFoundError: 404 Client Error: Entry Not Found for url: https://huggingface.co/stabilityai/stable-diffusion-2-1-base/resolve/main/config.json

I have manually confirmed that the config.json file doesn't exist in the HuggingFace website.

My launch code

!accelerate launch lora/train_lora_dreambooth.py \ --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1-base" \ --instance_data_dir="/kaggle/input/xxxx" \ --class_data_dir="/kaggle/input/class" \ --output_dir="/kaggle/working/checkpoints" \ --instance_prompt="a photo of xxxx" \ --class_prompt="a photo of nnnnn" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=1e-4 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --save_steps=2000 \ --max_train_steps=10000

I would like to know how to solve this. Thx.

How To Do Stable Diffusion LORA Training By Using Web UI On Different Models - Tested SD 1.5, SD 2.1

I hope this video gets added to the FAQ, wiki and stickies.

Appreciate very much.

https://youtu.be/mfaqqL5yOO4

image

content of the video

0:00 Introduction speech
1:07 How to install the LoRA extension to the Stable Diffusion Web UI
2:36 Preparation of training set images by properly sized cropping
2:54 How to crop images using Paint .NET, an open-source image editing software
5:02 What is Low-Rank Adaptation (LoRA)
5:35 Starting preparation for training using the DreamBooth tab - LoRA
6:50 Explanation of all training parameters, settings, and options
8:27 How many training steps equal one epoch
9:09 Save checkpoints frequency
9:48 Save a preview of training images after certain steps or epochs
10:04 What is batch size in training settings
11:56 Where to set LoRA training in SD Web UI
13:45 Explanation of Concepts tab in training section of SD Web UI
14:00 How to set the path for training images
14:28 Classification Dataset Directory
15:22 Training prompt - how to set what to teach the model
15:55 What is Class and Sample Image Prompt in SD training
17:57 What is Image Generation settings and why we need classification image generation in SD training
19:40 Starting the training process
21:03 How and why to tune your Class Prompt (generating generic training images)
22:39 Why we generate regularization generic images by class prompt
23:27 Recap of the setting up process for training parameters, options, and settings
29:23 How much GPU, CPU, and RAM the class regularization image generation uses
29:57 Training process starts after class image generation has been completed
30:04 Displaying the generated class regularization images folder for SD 2.1
30:31 The speed of the training process - how many seconds per iteration on an RTX 3060 GPU
31:19 Where LoRA training checkpoints (weights) are saved
32:36 Where training preview images are saved and our first training preview image
33:10 When we will decide to stop training
34:09 How to resume training after training has crashed or you close it down
36:49 Lifetime vs. session training steps
37:54 After 30 epochs, resembling images start to appear in the preview folder
38:19 The command line printed messages are incorrect in some cases
39:05 Training step speed, a certain number of seconds per iteration (IT)
39:25 Results after 5600 steps (350 epochs) - it was sufficient for SD 2.1
39:44 How I'm picking a checkpoint to generate a full model .ckpt file
40:23 How to generate a full model .ckpt file from a LoRA checkpoint .pt file
41:17 Generated/saved file name is incorrect, but it is generated from the correct selected .pt file
42:01 Doing inference (generating new images) using the text2img tab with our newly trained and generated model
42:47 The results of SD 2.1 Version 768 pixel model after training with the LoRA method and teaching a human face
44:38 Setting up the training parameters/options for SD version 1.5 this time
48:35 Re-generating class regularization images since SD 1.5 uses 512 pixel resolution
49:11 Displaying the generated class regularization images folder for SD 1.5
50:16 Training of Stable Diffusion 1.5 using the LoRA methodology and teaching a face has been completed and the results are displayed
51:09 The inference (text2img) results with SD 1.5 training
51:19 You have to do more inference with LoRA since it has less precision than DreamBooth
51:39 How to give more attention/emphasis to certain keywords in the SD Web UI
52:51 How to generate more than 100 images using the script section of the Web UI
54:46 How to check PNG info to see used prompts and settings
55:24 How to upscale using AI models
56:12 Fixing face image quality, especially eyes, with GFPGAN visibility
56:32 How to batch post-process
57:00 Where batch-generated images are saved
57:18 Conclusion and ending speech

resuming

would be good to specify and resume from a partial training

Just curious why the license was changed from MIT?

Just wanted to see why the license was changed to Apache 2.0 from MIT, a particular reason?

For anyone wondering, all code/commits prior to this license change are still licensed under MIT, just wanted to let anyone know that fancies MIT over Apache 2.0.

And if I'm not mistaken (I've heard this from a few people) all current code contributors needed to agree on the license change, or remove the contributed code?

It's not really that big of a deal to me personally, but I'm just curious overall.

Perhaps a simple fix for a docker container inside runpod.io?

Hello,

I am running a docker container of SD 2.1, but cannot seem to run training for LORA.
Here is the error I get when I try to run the default shell script in bash.

root@51235cb091e3:/workspace/stable-diffusion-webui/lora# 

**bash run_lora_db_w_text.sh** 

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Before training: Unet First Layer lora up tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        ...,
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
Before training: Unet First Layer lora down tensor([[ 2.7575e-02,  4.9739e-05,  3.9807e-02,  ..., -7.6583e-02,
         -3.2650e-03,  8.8336e-02],
        [-1.3945e-02,  3.5099e-02, -1.7838e-02,  ...,  1.0271e-03,
          1.0573e-02,  5.9847e-02],
        [-9.5399e-03,  4.8160e-02, -7.8387e-02,  ..., -6.7026e-02,
         -4.9318e-02, -1.3817e-02],
        [ 6.4708e-02, -7.2586e-02,  2.8864e-02,  ..., -1.0646e-01,
          2.2544e-02,  2.0882e-03]])
Before training: text encoder First Layer lora up tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        ...,
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
Before training: text encoder First Layer lora down tensor([[-0.0550, -0.1452, -0.0761,  ...,  0.0092, -0.1626, -0.0285],
        [-0.0316, -0.0067,  0.0563,  ...,  0.0868, -0.0227,  0.0530],
        [ 0.0371,  0.0766, -0.0804,  ..., -0.0817,  0.0129,  0.0713],
        [-0.0288, -0.0431,  0.0423,  ..., -0.0268,  0.0986,  0.0533]])
/venv/lib/python3.10/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
***** Running training *****
  Num examples = 8
  Num batches each epoch = 8
  Num Epochs = 1250
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 10000
Steps:   0%|              | 4/10000 [00:17<11:40:33,  4.21s/it]Traceback (most recent call last):
  File "/workspace/stable-diffusion-webui/lora/train_lora_dreambooth.py", line 958, in <module>
    main(args)
  File "/workspace/stable-diffusion-webui/lora/train_lora_dreambooth.py", line 784, in main
    for step, batch in enumerate(train_dataloader):
  File "/venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 383, in __iter__
    next_batch = next(dataloader_iter)
  File "/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/venv/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
IsADirectoryError: Caught IsADirectoryError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/workspace/stable-diffusion-webui/lora/train_lora_dreambooth.py", line 110, in __getitem__
    instance_image = Image.open(
  File "/venv/lib/python3.10/site-packages/PIL/Image.py", line 3092, in open
    fp = builtins.open(filename, "rb")
IsADirectoryError: [Errno 21] Is a directory: '/workspace/stable-diffusion-webui/lora/input/.ipynb_checkpoints'

Steps:   0%|              | 4/10000 [00:18<12:31:18,  4.51s/it]
Traceback (most recent call last):
  File "/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/venv/bin/python', 'train_lora_dreambooth.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1-base', '--instance_data_dir=./input', '--output_dir=./output', '--instance_prompt=game character a22a', '--train_text_encoder', '--resolution=768', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=1e-4', '--learning_rate_text=5e-5', '--color_jitter', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=10000']' returned non-zero exit status 1.

wrong time calculation

your calculation of time is based on steps, but you need to multiply steps times accumulations to get the correct eta estimate,

so currently accumulation 2, and steps 1000, it gives eta of 1000, but actual is 2000.

Error while merging to LORA checkpoints

File "c:\users\---\documents\github\---\venv\lib\site-packages\lora_diffusion\cli_lora_add.py", line 51, in add
    torch.save(out_list, output_path)
UnboundLocalError: local variable 'out_list' referenced before assignment

produces this error when trying to merge

Can't pickle the collate_fn function

Useful part of the traceback seems to be:
AttributeError: Can't pickle local object 'main..collate_fn'

I'm running within the automatic1111 virtual env on windows if that helps.

Gradient checkpointing blocks LoRA weight updates

Running with gradient checkpointing prevents LoRA weight updates.

Description: Ubuntu 18.04.6 LTS
diffusers==0.10.2
lora-diffusion==0.0.3
torchvision @ https://download.pytorch.org/whl/cu116/torchvision-0.14.0%2Bcu116-cp38-cp38-linux_x86_64.whl
transformers==4.25.1
xformers @ https://github.com/brian6091/xformers-wheels/releases/download/0.0.15.dev0%2B4c06c79/xformers-0.0.15.dev0+4c06c79.d20221205-cp38-cp38-linux_x86_64.whl

Accelerate version: 0.15.0
Platform: Linux-5.10.133+-x86_64-with-glibc2.27
Python version: 3.8.16
Numpy version: 1.21.6
PyTorch version (GPU?): 1.13.0+cu116 (True)

With gradient checkpointing enabled,

!accelerate launch
--mixed_precision="fp16"
lora/train_lora_dreambooth.py
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5"
--instance_data_dir="$INSTANCE_DIR"
--output_dir="$OUTPUT_DIR"
--instance_prompt="$INSTANCE_PROMPT"
--train_text_encoder
--resolution=512
--use_8bit_adam
--seed=1234
--mixed_precision="fp16"
--train_batch_size=4
--gradient_accumulation_steps=1
--gradient_checkpointing
--learning_rate=1e-4
--lr_scheduler="constant"

produces

Before training: Unet First Layer lora up tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
...,
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
Before training: Unet First Layer lora down tensor([[-0.0125, 0.0331, 0.0198, ..., 0.0715, 0.0393, -0.1777],
[-0.0442, 0.0572, 0.0026, ..., 0.0876, 0.0085, 0.0050],
[ 0.0410, -0.0777, 0.0313, ..., -0.0613, -0.0111, -0.0451],
[ 0.0202, -0.0079, 0.1156, ..., -0.0167, 0.0915, 0.0737]])
Before training: text encoder First Layer lora up tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
...,
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
Before training: text encoder First Layer lora down tensor([[ 0.0175, 0.0458, 0.1019, ..., -0.1470, 0.1538, 0.0120],
[-0.0307, -0.1303, 0.0911, ..., 0.0317, 0.0829, 0.0084],
[-0.0016, 0.1495, -0.1105, ..., -0.0781, 0.0122, 0.0272],
[ 0.0182, -0.0064, -0.0268, ..., 0.0800, 0.0745, 0.0231]])

and after some iterations

First Unet Layer's Up Weight is now : tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
...,
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], device='cuda:0')
First Unet Layer's Down Weight is now : tensor([[-0.0125, 0.0331, 0.0198, ..., 0.0715, 0.0393, -0.1777],
[-0.0442, 0.0572, 0.0026, ..., 0.0876, 0.0085, 0.0050],
[ 0.0410, -0.0777, 0.0313, ..., -0.0613, -0.0111, -0.0451],
[ 0.0202, -0.0079, 0.1156, ..., -0.0167, 0.0915, 0.0737]],
device='cuda:0')
First Text Encoder Layer's Up Weight is now : tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
...,
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], device='cuda:0')
First Text Encoder Layer's Down Weight is now : tensor([[ 0.0175, 0.0458, 0.1019, ..., -0.1470, 0.1538, 0.0120],
[-0.0307, -0.1303, 0.0911, ..., 0.0317, 0.0829, 0.0084],
[-0.0016, 0.1495, -0.1105, ..., -0.0781, 0.0122, 0.0272],
[ 0.0182, -0.0064, -0.0268, ..., 0.0800, 0.0745, 0.0231]],
device='cuda:0')

Disabling gradient checkpointing seems to work fine

Before training: Unet First Layer lora up tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
...,
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
Before training: Unet First Layer lora down tensor([[-0.0125, 0.0331, 0.0198, ..., 0.0715, 0.0393, -0.1777],
[-0.0442, 0.0572, 0.0026, ..., 0.0876, 0.0085, 0.0050],
[ 0.0410, -0.0777, 0.0313, ..., -0.0613, -0.0111, -0.0451],
[ 0.0202, -0.0079, 0.1156, ..., -0.0167, 0.0915, 0.0737]])
Before training: text encoder First Layer lora up tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
...,
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
Before training: text encoder First Layer lora down tensor([[ 0.0175, 0.0458, 0.1019, ..., -0.1470, 0.1538, 0.0120],
[-0.0307, -0.1303, 0.0911, ..., 0.0317, 0.0829, 0.0084],
[-0.0016, 0.1495, -0.1105, ..., -0.0781, 0.0122, 0.0272],
[ 0.0182, -0.0064, -0.0268, ..., 0.0800, 0.0745, 0.0231]])

and after some iterations

First Unet Layer's Up Weight is now : tensor([[-2.6580e-03, 7.4147e-04, 1.7193e-03, 1.7760e-04],
[ 1.1531e-03, 3.4420e-04, 1.6359e-03, -2.6158e-05],
[-2.0090e-04, 9.7763e-04, 9.0458e-04, -1.2152e-03],
...,
[ 1.3022e-03, -1.6245e-03, 1.3225e-03, -2.2149e-03],
[-4.9904e-04, 7.6633e-04, -1.1046e-03, 8.2197e-04],
[ 2.1200e-03, -7.4285e-04, -2.7083e-03, 7.7677e-04]], device='cuda:0')
First Unet Layer's Down Weight is now : tensor([[-0.0120, 0.0336, 0.0196, ..., 0.0711, 0.0415, -0.1778],
[-0.0446, 0.0560, 0.0023, ..., 0.0877, 0.0085, 0.0037],
[ 0.0402, -0.0783, 0.0297, ..., -0.0606, -0.0117, -0.0448],
[ 0.0169, -0.0067, 0.1153, ..., -0.0172, 0.0923, 0.0734]],
device='cuda:0')
First Text Encoder Layer's Up Weight is now : tensor([[ 2.7144e-05, 4.5192e-05, -4.2374e-05, 3.5689e-05],
[ 1.5236e-04, 2.1131e-04, 2.3639e-04, 1.8105e-04],
[ 1.6095e-04, -1.8110e-04, -6.2436e-05, 1.2356e-04],
...,
[ 1.3739e-04, -1.1521e-04, -1.0960e-04, 1.2269e-04],
[ 1.6732e-05, -1.3146e-05, -2.5539e-04, 1.7016e-04],
[ 2.5715e-04, -3.0459e-04, -1.9317e-04, -2.3927e-04]], device='cuda:0')
First Text Encoder Layer's Down Weight is now : tensor([[ 0.0172, 0.0456, 0.1016, ..., -0.1473, 0.1535, 0.0117],
[-0.0304, -0.1304, 0.0914, ..., 0.0319, 0.0832, 0.0087],
[-0.0014, 0.1495, -0.1103, ..., -0.0783, 0.0124, 0.0275],
[ 0.0183, -0.0065, -0.0266, ..., 0.0801, 0.0748, 0.0234]],
device='cuda:0')

depth to image models?

hi, do you think you could take a look if it would be possible to train depth2image models with lora?

Text encoder still not working correctly with LoRa Dreambooth training script

Hello, I am getting much better results using the --train_text_encoder flag with the Dreambooth script. However, the actual outputed LoRa .pt files from models trained with train_text_encoder gives very bad results after using monkeypatch to generate images. I suspect that the text encoder's weights are still not saved properly. I tried to save the pipeline directly after each epoch from within the training script, but loading it using diffusers gives me strange errors about torch not being able to parse the linear layers. Does anyone have similar experiences with training the text encoder or have any idea why this is happening?

Images sampled from within the training loop (train_text_encoder enabled) :
2
2e
3

Images sampled after model was monkeypatch with the trained LoRa weights (train_text_encoder enabled) :
bad1
bad2
bad3

The images doesn't seem to correlate with the samples generated while training and has very little cohesiveness with the training images used.

RuntimeError: Input type (c10::Half) and bias type (float) should be the same

First of all, congrats on the great work!

I got this error in the middle of training on T4 in a colab:

***** Running training *****
  Num examples = 16
  Num batches each epoch = 16
  Num Epochs = 938
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 15000
Steps:  53% 8000/15000 [1:06:51<59:27,  1.96it/s, loss=0.215, lr=0.0001]
Fetching 12 files: 100% 12/12 [00:00<00:00, 9088.42it/s]
Steps:  53% 8000/15000 [1:06:54<59:27,  1.96it/s, loss=0.831, lr=0.0001]Traceback (most recent call last):
  File "train_lora_dreambooth.py", line 964, in <module>
    main(args)
  File "train_lora_dreambooth.py", line 864, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_condition.py", line 375, in forward
    sample = self.conv_in(sample)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
Steps:  53% 8000/15000 [1:06:55<58:33,  1.99it/s, loss=0.831, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_lora_dreambooth.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1-base', '--instance_data_dir=./data_example', '--output_dir=./output_example', '--instance_prompt=ghblx style', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=15000', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_checkpointing']' returned non-zero exit status 1.

Works fine with fewer steps. Not sure why this is happening.

works but could use guidance

I seem to be one of the few to get it working

here are some thoughts and issues

  1. might require bleeding edge diffusers and transformers and accelerate - i always have them as current git head, others are probably using older or much older versions.
  2. for working on 6GB requires xformers and bnb
  3. I had to downgrade xformers due to latest having intermittent incompatibility (xformers cutlass backward isn't compatible with 30xx GPUs and maybe not 40xx GPUs, but it defaults to it sometimes to cutlass in current xformers for memory efficient attention)
  4. the learning rate has to be massively higher than expected (1e-4 !!!!) people are used to learning rates of 1e-6 or so and it learns essentially nothing.
  5. people don't understand what LORA is doing, might want to provide a brief explanation of why it can do dreambooth with such a small file difference.

Perhaps do a 1.4 example also since people are more familiar with it than 2.1.

You might want to do a tweaked example showing the corgi working (what I posted on reddit works, but isn't very good)

https://huggingface.co/docs/diffusers/training/dreambooth

https://drive.google.com/drive/folders/1BO_dyz-p65qhBRRMRA4TbZ8qW4rB99JZ

You might want to have the end of the example turn it into a .ckpt file so people get something they are used to using.

Also I'd recommend an example trained on a person.

Need some small explanations if possible, unet, text encoder, how to use with existing model - for course on stable diffusion

Hello. I am working on a full course for stable diffusion. Of course I should also make a lecture for Lora

Can you explain to me
unet vs text encoder?
they produce same file output?
what is the difference?
difference between Lora and dreambooth?

so dreambooth modifies the trained model weights and assign them to the new prompt you generated. What does lora do?

and the most important part perhaps

will lora be available on automatic11111 web-ui? my lecture will be centered on that
also lets say someone generated unet and text encoder. they produce different outputs right? so how they can use them in web ui?

by the way this is my first tutorial video with dreambooth : https://www.youtube.com/watch?v=mnCY8uM7E50

image

`OrderedDict mutated during iteration` during inference

Hi @cloneofsimo ,

Thanks for sharing the great repo.

I got the following runtime error when changing the checkpoint.

It can be solved by restarting the Jupiter notebook. Why can't I run the same cell twice with the same code?

from lora_diffusion import monkeypatch_lora, tune_lora_scale

monkeypatch_lora(pipe.unet, torch.load("../lora_weight_e3499_s17500.pt"))
tune_lora_scale(pipe.unet, 1.00)

torch.manual_seed(1)
image = pipe(prompt, num_inference_steps=50, guidance_scale=7).images[0]
# image.save("../contents/disney_lora.jpg")
image
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_3708385/3204262117.py in <module>
      1 checkpoint = torch.load("..//output_example/lora_weight_e3499_s17500.pt")
----> 2 monkeypatch_lora(pipe.unet, checkpoint)
      3 tune_lora_scale(pipe.unet, 1.00)
      4 
      5 torch.manual_seed(1)

~/Documents/I2I/lora/lora_diffusion/lora.py in monkeypatch_lora(model, loras, target_replace_module)
    130     for _module in model.modules():
    131         if _module.__class__.__name__ in target_replace_module:
--> 132             for name, _child_module in _module.named_modules():
    133                 if _child_module.__class__.__name__ == "Linear":
    134 

~/anaconda3/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py in named_modules(self, memo, prefix, remove_duplicate)
   1879                 memo.add(self)
   1880             yield prefix, self
-> 1881             for name, module in self._modules.items():
   1882                 if module is None:
   1883                     continue

RuntimeError: OrderedDict mutated during iteration

AttributeError: Can't pickle local object 'main.<locals>.collate_fn'

***** Running training *****
Num examples = 6
Num batches each epoch = 6
Num Epochs = 834
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Gradient Accumulation steps = 1
Total optimization steps = 5000
Steps: 0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "D:...\lora\train_lora_w_ti.py", line 1095, in
main(args)
File "D:...\lora\train_lora_w_ti.py", line 901, in main
for step, batch in enumerate(train_dataloader):
File "D:...\venv_diffusers_sd_2\lib\site-packages\accelerate\data_loader.py", line 373, in iter
dataloader_iter = super().iter()
File "D:...\venv_diffusers_sd_2\lib\site-packages\torch\utils\data\dataloader.py", line 444, in iter
return self._get_iterator()
File "D:...\venv_diffusers_sd_2\lib\site-packages\torch\utils\data\dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "D:...\venv_diffusers_sd_2\lib\site-packages\torch\utils\data\dataloader.py", line 1077, in init
w.start()
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\multiprocessing\context.py", line 224, in _Popen
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 126, in _main
return _default_context.get_context().Process._Popen(process_obj)
self = reduction.pickle.load(from_parent)
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\multiprocessing\context.py", line 336, in _Popen
EOFError: Ran out of input
return Popen(process_obj)
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main..collate_fn'
Steps: 0%| | 0/5000 [00:01<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\myusername\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:...\venv_diffusers_sd_2\Scripts\accelerate.exe_main
.py", line 7, in
File "D:...\venv_diffusers_sd_2\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "D:...\venv_diffusers_sd_2\lib\site-packages\accelerate\commands\launch.py", line 1069, in launch_command
simple_launcher(args)
File "D:...\venv_diffusers_sd_2\lib\site-packages\accelerate\commands\launch.py", line 551, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\...\venv_diffusers_sd_2\Scripts\python.exe', 'code\resources\lora\train_lora_w_ti.py', '--pretrained_model_name_or_path=models\SD-2-1-512', '--instance_data_dir=data/instance_images/myboy_512/subject_images', '--output_dir=models_out/lora_512/', '--train_text_encoder', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=5e-5', '--learning_rate_text=5e-5', '--learning_rate_ti=5e-4', '--color_jitter', '--lr_scheduler=constant', '--lr_warmup_steps=100', '--max_train_steps=5000', '--save_steps=500', '--unfreeze_lora_step=2000', '--placeholder_token=', '--learnable_property=object', '--initializer_token=boy']' returned non-zero exit status 1.

not working with 1.4?

worked fine with 2.1 but nit working with 1.4, current head diffusers, transformers, accelerate, and xformers'


./run_lora_db.sh
Parameter containing:
tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        ...,
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]], requires_grad=True)
Parameter containing:
tensor([[ 0.0533,  0.0358, -0.0984,  ..., -0.0165, -0.1115, -0.0122],
        [ 0.0298, -0.0776,  0.0743,  ...,  0.0782, -0.1170,  0.0126],
        [-0.0119,  0.0010, -0.0604,  ...,  0.0374,  0.0687, -0.0075],
        [ 0.0462, -0.0886, -0.0334,  ...,  0.0006, -0.0117, -0.0383]],
       requires_grad=True)
***** Running training *****
  Num examples = 5
  Num batches each epoch = 5
  Num Epochs = 80
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 400
Steps:   0%|                                                                                                                        | 0/400 [00:00<?, ?it/s]โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /mnt/c/Users/tommu/lora/train_lora_dreambooth.py:964 in <module>                                 โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   961                                                                                            โ”‚
โ”‚   962 if __name__ == "__main__":                                                                 โ”‚
โ”‚   963 โ”‚   args = parse_args()                                                                    โ”‚
โ”‚ โฑ 964 โ”‚   main(args)                                                                             โ”‚
โ”‚   965                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /mnt/c/Users/tommu/lora/train_lora_dreambooth.py:898 in main                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   895 โ”‚   โ”‚   โ”‚   else:                                                                          โ”‚
โ”‚   896 โ”‚   โ”‚   โ”‚   โ”‚   loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")    โ”‚
โ”‚   897 โ”‚   โ”‚   โ”‚                                                                                  โ”‚
โ”‚ โฑ 898 โ”‚   โ”‚   โ”‚   accelerator.backward(loss)                                                     โ”‚
โ”‚   899 โ”‚   โ”‚   โ”‚   if accelerator.sync_gradients:                                                 โ”‚
โ”‚   900 โ”‚   โ”‚   โ”‚   โ”‚   params_to_clip = (                                                         โ”‚
โ”‚   901 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   itertools.chain(unet.parameters(), text_encoder.parameters())          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /mnt/c/Users/tommu/accelerate/src/accelerate/accelerator.py:1314 in backward                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1311 โ”‚   โ”‚   elif self.distributed_type == DistributedType.MEGATRON_LM:                        โ”‚
โ”‚   1312 โ”‚   โ”‚   โ”‚   return                                                                        โ”‚
โ”‚   1313 โ”‚   โ”‚   elif self.scaler is not None:                                                     โ”‚
โ”‚ โฑ 1314 โ”‚   โ”‚   โ”‚   self.scaler.scale(loss).backward(**kwargs)                                    โ”‚
โ”‚   1315 โ”‚   โ”‚   else:                                                                             โ”‚
โ”‚   1316 โ”‚   โ”‚   โ”‚   loss.backward(**kwargs)                                                       โ”‚
โ”‚   1317                                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/_tensor.py:487 in     โ”‚
โ”‚ backward                                                                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    484 โ”‚   โ”‚   โ”‚   โ”‚   create_graph=create_graph,                                                โ”‚
โ”‚    485 โ”‚   โ”‚   โ”‚   โ”‚   inputs=inputs,                                                            โ”‚
โ”‚    486 โ”‚   โ”‚   โ”‚   )                                                                             โ”‚
โ”‚ โฑ  487 โ”‚   โ”‚   torch.autograd.backward(                                                          โ”‚
โ”‚    488 โ”‚   โ”‚   โ”‚   self, gradient, retain_graph, create_graph, inputs=inputs                     โ”‚
โ”‚    489 โ”‚   โ”‚   )                                                                                 โ”‚
โ”‚    490                                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/__init__.py: โ”‚
โ”‚ 197 in backward                                                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   194 โ”‚   # The reason we repeat same the comment below is that                                  โ”‚
โ”‚   195 โ”‚   # some Python versions print out the first line of a multi-line function               โ”‚
โ”‚   196 โ”‚   # calls in the traceback and some print out the last line                              โ”‚
โ”‚ โฑ 197 โ”‚   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   โ”‚
โ”‚   198 โ”‚   โ”‚   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        โ”‚
โ”‚   199 โ”‚   โ”‚   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   โ”‚
โ”‚   200                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/function.py: โ”‚
โ”‚ 267 in apply                                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   264 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚      "Function is not allowed. You should only implement one "   โ”‚
โ”‚   265 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚      "of them.")                                                 โ”‚
โ”‚   266 โ”‚   โ”‚   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    โ”‚
โ”‚ โฑ 267 โ”‚   โ”‚   return user_fn(self, *args)                                                        โ”‚
โ”‚   268 โ”‚                                                                                          โ”‚
โ”‚   269 โ”‚   def apply_jvp(self, *args):                                                            โ”‚
โ”‚   270 โ”‚   โ”‚   # _forward_cls is defined by derived class                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/utils/checkpoint.py:1 โ”‚
โ”‚ 57 in backward                                                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   154 โ”‚   โ”‚   โ”‚   raise RuntimeError(                                                            โ”‚
โ”‚   155 โ”‚   โ”‚   โ”‚   โ”‚   "none of output has requires_grad=True,"                                   โ”‚
โ”‚   156 โ”‚   โ”‚   โ”‚   โ”‚   " this checkpoint() is not necessary")                                     โ”‚
โ”‚ โฑ 157 โ”‚   โ”‚   torch.autograd.backward(outputs_with_grad, args_with_grad)                         โ”‚
โ”‚   158 โ”‚   โ”‚   grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None                  โ”‚
โ”‚   159 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚     for inp in detached_inputs)                                          โ”‚
โ”‚   160                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/__init__.py: โ”‚
โ”‚ 197 in backward                                                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   194 โ”‚   # The reason we repeat same the comment below is that                                  โ”‚
โ”‚   195 โ”‚   # some Python versions print out the first line of a multi-line function               โ”‚
โ”‚   196 โ”‚   # calls in the traceback and some print out the last line                              โ”‚
โ”‚ โฑ 197 โ”‚   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   โ”‚
โ”‚   198 โ”‚   โ”‚   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        โ”‚
โ”‚   199 โ”‚   โ”‚   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   โ”‚
โ”‚   200                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/function.py: โ”‚
โ”‚ 267 in apply                                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   264 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚      "Function is not allowed. You should only implement one "   โ”‚
โ”‚   265 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚      "of them.")                                                 โ”‚
โ”‚   266 โ”‚   โ”‚   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    โ”‚
โ”‚ โฑ 267 โ”‚   โ”‚   return user_fn(self, *args)                                                        โ”‚
โ”‚   268 โ”‚                                                                                          โ”‚
โ”‚   269 โ”‚   def apply_jvp(self, *args):                                                            โ”‚
โ”‚   270 โ”‚   โ”‚   # _forward_cls is defined by derived class                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /mnt/c/Users/tommu/xformers/xformers/ops/memory_efficient_attention.py:422 in backward           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   419 โ”‚   โ”‚   โ”‚   grad_q,                                                                        โ”‚
โ”‚   420 โ”‚   โ”‚   โ”‚   grad_k,                                                                        โ”‚
โ”‚   421 โ”‚   โ”‚   โ”‚   grad_v,                                                                        โ”‚
โ”‚ โฑ 422 โ”‚   โ”‚   ) = torch.ops.xformers.efficient_attention_backward_cutlass(                       โ”‚
โ”‚   423 โ”‚   โ”‚   โ”‚   grad.to(dtype),                                                                โ”‚
โ”‚   424 โ”‚   โ”‚   โ”‚   query,                                                                         โ”‚
โ”‚   425 โ”‚   โ”‚   โ”‚   key,                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/_ops.py:442 in        โ”‚
โ”‚ __call__                                                                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   439 โ”‚   โ”‚   # is still callable from JIT                                                       โ”‚
โ”‚   440 โ”‚   โ”‚   # We save the function ptr as the `op` attribute on                                โ”‚
โ”‚   441 โ”‚   โ”‚   # OpOverloadPacket to access it here.                                              โ”‚
โ”‚ โฑ 442 โ”‚   โ”‚   return self._op(*args, **kwargs or {})                                             โ”‚
โ”‚   443 โ”‚                                                                                          โ”‚
โ”‚   444 โ”‚   # TODO: use this to make a __dir__                                                     โ”‚
โ”‚   445 โ”‚   def overloads(self):                                                                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Steps:   0%|                                                                                                                        | 0/400 [00:07<?, ?it/s]
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /home/username/anaconda3/envs/diffusers/bin/accelerate:33 in <module>                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   30                                                                                             โ”‚
โ”‚   31 if __name__ == '__main__':                                                                  โ”‚
โ”‚   32 โ”‚   sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])                       โ”‚
โ”‚ โฑ 33 โ”‚   sys.exit(load_entry_point('accelerate', 'console_scripts', 'accelerate')())             โ”‚
โ”‚   34                                                                                             โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /mnt/c/Users/tommu/accelerate/src/accelerate/commands/accelerate_cli.py:45 in main               โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   42 โ”‚   โ”‚   exit(1)                                                                             โ”‚
โ”‚   43 โ”‚                                                                                           โ”‚
โ”‚   44 โ”‚   # Run                                                                                   โ”‚
โ”‚ โฑ 45 โ”‚   args.func(args)                                                                         โ”‚
โ”‚   46                                                                                             โ”‚
โ”‚   47                                                                                             โ”‚
โ”‚   48 if __name__ == "__main__":                                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /mnt/c/Users/tommu/accelerate/src/accelerate/commands/launch.py:1120 in launch_command           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1117 โ”‚   elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA  โ”‚
โ”‚   1118 โ”‚   โ”‚   sagemaker_launcher(defaults, args)                                                โ”‚
โ”‚   1119 โ”‚   else:                                                                                 โ”‚
โ”‚ โฑ 1120 โ”‚   โ”‚   simple_launcher(args)                                                             โ”‚
โ”‚   1121                                                                                           โ”‚
โ”‚   1122                                                                                           โ”‚
โ”‚   1123 def main():                                                                               โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /mnt/c/Users/tommu/accelerate/src/accelerate/commands/launch.py:574 in simple_launcher           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    571 โ”‚   process.wait()                                                                        โ”‚
โ”‚    572 โ”‚   if process.returncode != 0:                                                           โ”‚
โ”‚    573 โ”‚   โ”‚   if not args.quiet:                                                                โ”‚
โ”‚ โฑ  574 โ”‚   โ”‚   โ”‚   raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)   โ”‚
โ”‚    575 โ”‚   โ”‚   else:                                                                             โ”‚
โ”‚    576 โ”‚   โ”‚   โ”‚   sys.exit(1)                                                                   โ”‚
โ”‚    577                                                                                           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
CalledProcessError: Command '['/home/username/anaconda3/envs/diffusers/bin/python', 'train_lora_dreambooth.py',
'--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=./data_example', '--output_dir=./output_example', '--instance_prompt=a
photo of sks dog', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--use_8bit_adam',
'--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=400']' returned non-zero exit status 1.


Code license?

Hi, @cloneofsimo
Thank you for open sourcing this repo. It seems no license file is included, what is the license of this repo? Could you add a LICENSE file to this repo?

fine-tune stable diffusion model on customized text-image dataset.

Hi @cloneofsimo ,

Thanks again for sharing the awesome work.

Would it be possible for you to share an example to fine-tune the model on customized datasets?

For example, the pokemon dataset https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions

Or any suggestions on how to modify this file for the customized text-image dataset (each image has its own text)

--instance_prompt="game character bnha" \

monkeypatch_lora does not correctly replace to_out Linear

monkeypatch_lora finds CrossAttention blocks and looks for any Linears

It uses named_modules which will find any descendants, either immediate children or subchildren, But then it sets the replacement LoraInjectedLinear on the CrossAttention block directly

CrossAttention has a ModuleList for to_out that contains a Linear.

Because of the above, the to_out Linear is not replaced correctly. Instead, an (unused) LoraInjectedLinear model with the name of to_out.0 is set on the CrossAttention block.

You can tell by looking at the module names on the CrossAttention block after patching.

Before patching:

to_q
to_k
to_v
to_out

After patching:

to_q
to_k
to_v
to_out
to_out.0

pt in automatic 1111

I would like to know if is it possible to use the files produced with lora on the automatic version 1111?

Change to or include Apache2 license

See #12

Just an FYI:
To stay completely legit, you should license this repo as Apache2 -OR- at include the Apache2 license in the repo. This is because train_lora_dreambooth.py is a derivative work of https://github.com/huggingface/diffusers, which is licensed Apache2. Source with an apache2 license requires distributing a copy of that license.

Moving from MIT to Apache2 doesn't change your rights, or any downstream users of this repo. Commercial use, etc, are still allowed. The only thing that changes is a requirement to distribute a copy of that license.

Installation problem

Installed Lora on automatic 1111 via extensions. Writes that it is installed. But the dreamboot tab did not appear . What could be the problem ?


image


image

Un-inject values from Unet

Right now, if we save_pretrained with the unet modifed for lora training, diffusion pipeline will refuse to load the unet later because of some "Linear" values.

Is there a way to un-inject the LORA values from unet before doing pipeline.save_pretrained?

Training textencoder

Currently only the Unet is trained with lora, but especially with dreambooth it's most common to also train the text encoder at the same time as it gives vastly better results . Is it possible to also apply the lora method to the textencoder to get the same benefits there?

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

Hello.
I am new in all this and i am not even sure if this is a bug or expected behavior... but I am getting this error
"RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'"
in

lora_diffusion/lora.py", line 348, in weight_apply_lora
    weight = weight + alpha * (up_weight @ down_weight).type(weight.dtype)

when merging the lora weight with runwayml/stable-diffusion-v1-5 on branch fp16 with mode upl-ckpt-v2 , it works fine on branch main (i assume because it doesn't use half floats?)

Doesn't seem to work on Windows

Just wondering if this is another Linux only version of Low VRAM DreamBooth. Seems to do what the other implementations do and just quits after 1 step or less. This log was pulled from running the training directly, as accelerate complains that there is no GPU while there in fact is an RTX 3070 present.

I should add that this is Windows Native, no WSL of any kind.

Steps: 0%| | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last): File "H:\Utilities\CUDA\Dreambooth\lora\train_lora_dreambooth.py", line 964, in <module> main(args) File "H:\Utilities\CUDA\Dreambooth\lora\train_lora_dreambooth.py", line 836, in main for step, batch in enumerate(train_dataloader): File "H:\Utilities\CUDA\Dreambooth\dreambooth\lib\site-packages\accelerate\data_loader.py", line 345, in __iter__ dataloader_iter = super().__iter__() File "H:\Utilities\CUDA\Dreambooth\dreambooth\lib\site-packages\torch\utils\data\dataloader.py", line 435, in __iter__ return self._get_iterator() File "H:\Utilities\CUDA\Dreambooth\dreambooth\lib\site-packages\torch\utils\data\dataloader.py", line 381, in _get_iterator Traceback (most recent call last): return _MultiProcessingDataLoaderIter(self) File "<string>", line 1, in <module> File "H:\Utilities\CUDA\Dreambooth\dreambooth\lib\site-packages\torch\utils\data\dataloader.py", line 1034, in __init__ File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main w.start() exitcode = _main(fd, parent_sentinel) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 121, in start File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 126, in _main self._popen = self._Popen(self) self = reduction.pickle.load(from_parent) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\multiprocessing\context.py", line 224, in _Popen EOFError: Ran out of input return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\multiprocessing\context.py", line 336, in _Popen return Popen(process_obj) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__ reduction.dump(process_obj, to_child) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'main.<locals>.collate_fn'

EOFError when start training, versioning problem?

keep getting this error when start training:

Steps: 0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "\Github\lora\train_lora_dreambooth.py", line 998, in
main(args)
File "
\Github\lora\train_lora_dreambooth.py", line 809, in main
for step, batch in enumerate(train_dataloader):
File "\Python\Python310\lib\site-packages\accelerate\data_loader.py", line 345, in iter
dataloader_iter = super().iter()
File "
\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 444, in iter
return self._get_iterator()
File "\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "
\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 1077, in init
w.start()
File "\Python\Python310\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "
\Python\Python310\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "\Python\Python310\lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
File "
\Python\Python310\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "\Python\Python310\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'DreamBoothDataset.init..'
Traceback (most recent call last):
File "", line 1, in
File "
\Python\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "****\Python\Python310\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

Seems to be a versioning problem of /lib/multiprocessing?

qq about code re porting to flax

dude this is awesome! I've been messing with textual inversion for a while but it's not as precise as i want and this looks like the better way!
ok so i'd like to help extend this to the flax method which runs way faster than torch, on TPUs and even GPU, but since I'm not familiar with the dreambooth and automatic 111 codes, can you point me to the parts in the training script that you modified? or i guess i can just try to diff the repos... which was the starting one you forked from? any gotchas to watch for, or anyone already started on this front?
also i noticed the pytorch checkpoints have different weight/layer names, hoping anyone reading can point to how we can map across...

Data to reproduce the results

I can't get good results fine-tuning on faces, maybe there is a bug there, would be good to have data to reproduce your results from Readme with your training settings

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.