victorchall / everydream2trainer Goto Github PK

View Code? Open in Web Editor NEW

788.0 788.0 116.0 5.01 MB

License: Other

Batchfile 0.25% Python 86.96% Jupyter Notebook 11.87% Dockerfile 0.73% Shell 0.19% PowerShell 0.01%

everydream2trainer's People

Contributors

Stargazers

Watchers

Forkers

cian0 digburn tjennings cma42041 cohan8999 sagnik1511 ctorx jangerritsen plucked crackfoo fkunn1326 damian0815 qslug c00renut vicdanis nitsgarg janekm noprompt harshvardhan96 jloupdef archonoff pratik-behera ivg-design justinmerrell ianpinder yuan6785 platform-kit spaablauw xjdeng alleniver rakeshchowdhury victorhallsai juanigsrz gessyoo weiljimmer nseq treksis amirothman shreyazomato haqthat g2asell2019 idlebg mykeehu aigc-dao kgonia finesseus fuciuss mossbraker nachopicchu guillermojacquez ansonkao mooreliving777 davidpicsart qfaqir alex-gz indoorshark djgould yankawayu 1r00t tony109060581 amorporkian inevitable-2031 kelvie aznctryboi87 savosru mfalex by321 moonryul mooonservice pent a-l-e-x-d-s-9 luisgabrielroldan minienglish1 ciizamista4to1 teila angeloshredder stanwarp nawnie frangin2003 jackfrostforevergroup timlawrenz gobomart tangtc1981 erichocean reijerh fai247 ratribiana wipwai enixile hshallucinations jaaron-gunpot luffythefox kyung01 isurajboss ali-husain proxytype 2kpr xujinbao282 coralreefman py85252876

everydream2trainer's Issues

Validation no output

Hi!
Using the default validation config on google colab produces no output.
When an epoch finishes it will validate for a few steps but there is no loss/val output in the console.

Saving every N epochs worked once but has not worked for me since.

Checkpoint saving worked the very first time I ran this, but then doesn't occur since.
Not doing anything complex- only 3 images in "input" and the following args:

python train.py --resume_ckpt "sd_v1-5_vae" ^
--max_epochs 500 ^
--resolution 512 ^
--data_root "input" ^
--lr_scheduler constant ^
--project_name mfrkr_sd15 ^
--batch_size 3 ^
--sample_steps 25 ^
--lr 1e-6 ^
--shuffle_tags ^
--save_every_n_epochs 25 ^
--amp ^
--useadam8bit ^
--ed1_mode

It makes the sample every 25 OK, but doesn't save a checkpoint at epoch 25. Only at the end.

Epochs: 100%|██| 500/500 [15:15<00:00,  9.02s/it, img/s=5.14, loss/log_step=0.138, lr=1e-6, vram=20075/24564 MB gs:499] * Saving diffusers model to logs\mfrkr_sd15_20230131-154030/ckpts/last-mfrkr_sd15-ep499-gs00500
 * Saving SD model to .\last-mfrgx_sd15-ep499-gs00500.ckpt

And that's the only time the console prints anything about a checkpoint save.
What am I doing wrong please? :)

What is EveryDream2trainer?

I am reading readme but no info available

Is this a new method like DreamBooth or else? What is it

What is the advantage compared to DreamBooth?

ty for answers

Let's Work Together

Hey there! Developer of the dreambooth extension for Auto1111 (and soon to be a stand-alone application) here.

I was recently made aware of some of enhancements you've added to the Dreambooth training process, and would cordially like to invite you to a discussion regarding a collaboration between you, myself, and some of the other prominent developers working on Dreambooth/SD training, with an end goal of coming together to create one new uniform training method with all the bells and whistles.

If you're interested, I've got a discord over here where I'll be inviting some of the other devs to come and chat. It'd be awesome to talk with you and discuss how we can work together. ;)

https://discord.gg/q8dtpfRD5w

linux or docker instructions

Hi there.

Will there be and doc/guide on how to install on linux or make a docker container for the trainer? ( ideally the second )

I tried myself but I get errors on installing the packages and I couldn't get it to run.

Thank you.

Corrupted Model after training

I'm running a everydream trainer with configs and it seems to be giving corrupted models.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 97, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
  File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 777, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 282, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 103, in load_state_dict
    if f.read().startswith("version"):
  File "/usr/local/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/EveryDream2trainer/scripts/txt2img.py", line 101, in <module>
    main(args)
  File "/content/EveryDream2trainer/scripts/txt2img.py", line 68, in main
    unet = UNet2DConditionModel.from_pretrained(args.diffusers_path, subfolder="unet")
  File "/usr/local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 489, in from_pretrained
    state_dict = load_state_dict(model_file)
  File "/usr/local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 115, in load_state_dict
    raise OSError(
OSError: Unable to load weights from checkpoint file for '/content/EveryDream2trainer/logs/test-ep99-gs01400/ckpts/last-test-ep99-gs01400/unet/diffusion_pytorch_model.bin' at '/content/EveryDream2trainer/logs/test-ep99-gs01400/ckpts/last-test-ep99-gs01400/unet/diffusion_pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Training Command:

python train.py \
--resume_ckpt runwayml/stable-diffusion-v1-5 \
--gradient_checkpointing \
--amp \
--batch_size 6 \
--grad_accum 1 \
--cond_dropout 0.00 \
--data_root EveryDream2trainer/images \
--flip_p 0.00 \
--lr 1e-06 \
--lr_decay_steps 0 \
--lr_scheduler constant \
--lr_warmup_steps 0 \
--max_epochs 100 \
--project_name test \
--resolution 512 \
--sample_prompts sample_prompts.txt \
--sample_steps 300 \
--save_every_n_epoch 0 \
--seed 555 \
--shuffle_tags \
--useadam8bit

Model not training on runpod and colab

Runpod and Google Colab training is not working for the past few days.

I am not getting any errors but after training when I test the model it seems like it's not trained at all.
I didn't get my style/ person /object even by using caption tokens.
With the same setting and same data, it used to work on Colab and run pod.
I am using default settings only.

For eg. Earlier prompt "doctor in rdanimestyle" is used to give a doctored image in the trained style.
Earlier even "doctor" prompts were used to give outputs in a trained style.
Now the same prompts are just given to a normal outputs.

I used all naming methods, like

0001.jpg , 0001.txt
rdanime boy_.jpg , rdanime boy_.txt
using caption as the image name etc.

When testing using runpod and colab test section I am pretty sure the image results are coming from base SD 1.5v model only not the trained model.

I also downloaded last-....ckpt and tested it with other colab still results are like base mode. So I think the issue is with training, not the testing code.

Would really appreciate your help on this.

Converting to diffuser format a SD 1.5 based ckpt

Hello,

when i try to convert to diffuser format a ckpt file with the command:
python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim ^
--original_config_file v1-inference.yaml ^
--image_size 512 ^
--checkpoint_path dreamshaper_331BakedVae.ckpt ^
--prediction_type epsilon ^
--upcast_attn False ^
--dump_path "ckpt_cache/dreamshaper_331BakedVae"

(dreamshaper_331BakedVae.ckpt is available on civitai and is based on SD 1,5)

I receive the following answer:

global_step key not found in model
load_state_dict from directly
Checkpoint dreamshaper_331BakedVae.ckpt has both EMA and non-EMA weights.
In this conversion only the non-EMA weights are extracted. If you want to instead extract the EMA weights (usually better for inference), please make sure to add the --extract_ema flag.
Traceback (most recent call last):
File "D:\download\AI\EveryDream2trainer\utils\convert_original_stable_diffusion_to_diffusers.py", line 963, in
unet.load_state_dict(converted_unet_checkpoint)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel:
Missing key(s) in state_dict: "up_blocks.0.upsamplers.0.conv.weight", "up_blocks.0.upsamplers.0.conv.bias", "up_blocks.1.upsamplers.0.conv.weight", "up_blocks.1.upsamplers.0.conv.bias", "up_blocks.2.upsamplers.0.conv.weight", "up_blocks.2.upsamplers.0.conv.bias".
Unexpected key(s) in state_dict: "up_blocks.0.attentions.2.conv.bias", "up_blocks.0.attentions.2.conv.weight".

Is there is a way to fix this ?

best learning rate

so I made some experiments changing the lr and I found out something strange.
The trainings seems to improve when the lr is as low as 1e-7
The loss decreases steadly and the sample images are more consistent.
Is it normal? Am I doing something wrong?

setup:
gpu rtx 4090 48vcpu, 124 gb ram
batch size: 10
lr_scheduler: cosine
lr: 1.2e-7
dataset size: 5k
captions type: tags

Also I'm about to train with an A100, 50k training set, and I'm not sure about the batch size and how can i finetune hyperparameters to take advantage of the entire gpu.

Training stops when generating samples

Everything seems to work fine up until the first sample generation and then it stops with this:

D:\EveryDream2trainer\venv\lib\site-packages\torch\utils\checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Something went wrong, attempting to save model

Saving diffusers model to logs\Original_Inked_Artv15_20230218-184037/ckpts/errored-Original_Inked_Artv15-ep00-gs00091
Saving SD model to D:\TrainedCheckpoints\errored-Original_Inked_Artv15-ep00-gs00091.ckpt
Traceback (most recent call last):
File "D:\EveryDream2trainer\train.py", line 944, in
main(args)
File "D:\EveryDream2trainer\train.py", line 871, in main
raise ex
File "D:\EveryDream2trainer\train.py", line 813, in main
sample_generator.generate_samples(inference_pipe, global_step)
File "D:\EveryDream2trainer\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\EveryDream2trainer\utils\sample_generator.py", line 189, in generate_samples
images = pipe(prompt=prompts,
File "D:\EveryDream2trainer\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\EveryDream2trainer\venv\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py", line 494, in call
latents = self.prepare_latents(
File "D:\EveryDream2trainer\venv\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py", line 386, in prepare_latents
latents = torch.randn(shape, generator=generator, device=device, dtype=dtype)
TypeError: randn() received an invalid combination of arguments - got (tuple, generator=list, dtype=torch.dtype, device=torch.device), but expected one of:
(tuple of ints size, *, torch.Generator generator, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
(tuple of ints size, *, torch.Generator generator, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
(tuple of ints size, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
(tuple of ints size, *, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

This is my first time using EveryDream so it is likely it's something silly done on my end, but I am stuck. I tried to see if anyone else had reported the issue but didn't find anything to go off of. The first msg about "none of the inputs" seems to happen at every new epoch and doesn't seem related to everything else but I included it just in case...

First samples for SD 2.1 training on RunPod aren't correct

These sort of images appear for the first samples, but the following ones are perfectly fine. Training on SD 2.1 on RunPod. Config file is trainSD21.json and these extra params:

--max_epochs 90
--sample_steps 93
--save_every_n_epochs 15
--lr 1.2e-6
--lr_scheduler constant
--save_full_precision

suggestion: add "--dont_save_until_x_epochs" param

Sometimes you already know when you start training that the model won't be any good until many epochs have passed, so there's no point in saving the model until training has passed that point. But after those n epochs, you don't already know when overfitting will occur. So you want the "--save_every_n_epochs" param to be low. But if it's low, you end up with many unneeded saved models, and you have limited storage.

Adding a "--dont_save_until_x_epochs" param would fix that by not saving any of the first x-1 epochs, but saving x and afterwards saving every n epochs.

[BUG] LR set in config json does not seem to have any effect

These are two different runs, overlaid on top of each other - one with LR 4e-6, the other one with LR 7e-6:

I've done another one immeditely after at 1e-6 - exact same loss.
Is the optimizer.json LR overwriting the config file one by any chance?

[ERROR]

Runtime environment (please complete the following information):

OS: Linux
Is this your local computer or a cloud host? Google Colab, free tier

Additional context
When I run the training cell, I receive the following error:

[BUG] Negative loss/stabilize-train and loss/val

For some reason, i keep getting negative loss/stabilize-train and loss/val:

It starts off at 0, and goes into negative value each epoch.
I'm not sure why.

I'm training on 1065 images, 10% split off for validation.
Stabilize split proportion set to 0.10 too.
1 batch.

Use of "_" and "," to delineate different prompt areas

Hi, i am just preparing my dataset for use with Everydream, however i found some issues due to the unfortunate use of common characters like "," and "_" to delineate different sections of the training prompt.

For Tag shuffling "," is used to separate the first chunk (caption that is not shuffled) from all other chunks (tags that are shuffled). However this means that the caption can not contain a "," as this would declare everything after the "," to be tags and include it in shuffling.

A similar issue is the use of the underscore to separate the training promt from everything that is ignored at the end of the prompt. This means the tags can not contain a underscore as anything behind the first "_" would be ignored by the trainer. Unfortionately underscores are common in original booru-tags so those can not be used for training with everydream (without replacement of all underscores by spaces).

A suggestion would be to use more uncommon characters or escape-sequences to delineate the different sections of the training prompt (the "training section", the "tag section" and the "ignore section"). As an example you could use two points ".." to separate the caption section from the tags section and two underscores "__" to separate the tag section from the ignore section. Then the two issues mentioned above would not exist.

Move shuffle_tags to a property on ImageCaption class and shuffle on get instead of on epoch boundary

With multiply.txt, images can be repeated >1 in an epoch.

Currently shuffle_tags makes the captions shuffle every epoch.

It would be better to make shuffle_tags a bool in ImageCaption, and when the ImageCaption is read, shuffle the tags in the getter, so the tags are always shuffled, and it doesn't need to happen during the epoch shuffle.

Feature: Generate yaml file alongside SD ckpt

It would be benefitial for SD2 models if the generation of the SD dumps would include the copying of the v2-inference.yaml config file. This way users could directly point inference toolkits like A1111 directly to the ED2Trainer folder. Right now I'm manually copying yaml files everytime I want to test a model coming out of ED2Trainer.

## A1111 needs the yaml file to be the same name as the ckpt file
cp last-2301_MYPROJ_fp16_adam_3rd_run_200-ep199-gs03000.yaml last-2301_MYPROJ_fp16_4th_512_single_instance_image_200-ep199-gs00200.yaml

## A1111 then can load the model directly from the ED2Trainer folder
  python launch.py --ckpt /finetuning/EveryDream2trainer/last-2301_MYPROJ_fp16_4th_512_single_instance_image_200-ep199-gs00200.ckpt

Input/Output Error

During the Training using Google Colab version I'm getting the following error:

Traceback (most recent call last): File "/content/EveryDream2trainer/train.py", line 957, in <module> main(args) File "/content/EveryDream2trainer/train.py", line 883, in main raise ex File "/content/EveryDream2trainer/train.py", line 804, in main log_writer.add_scalar(tag="hyperparamater/lr", scalar_value=curr_lr, global_step=global_step) File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 391, in add_scalar self._get_file_writer().add_summary(summary, global_step, walltime) File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 113, in add_summary self.add_event(event, global_step, walltime) File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 98, in add_event self.event_writer.add_event(event) File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 117, in add_event self._async_writer.write(event.SerializeToString()) File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 171, in write self._check_worker_status() File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 212, in _check_worker_status raise exception File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244, in run self._run() File "/usr/local/lib/python3.10/ site-packages/tensorboard/summary/writer/event_file_writer.py", line 275, in _run self._record_writer.write(data) File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write self._writer.write(header + header_crc + data + footer_crc) File "/usr/local/lib/python3.10/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 773, in write self.fs.append(self.filename, file_content, self.binary_mode) File "/usr/local/lib/python3.10/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 167, in append self._write(filename, file_content, "ab" if binary_mode else "a") File "/usr/local/lib/python3.10/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 171, in _write with io.open(filename, mode, encoding=encoding) as f: OSError: [Errno 5] Input/output error

I tought change the name of the images to process will solve it but didnt works

Maybe in this pic you can see the error better:

Custom VAE support?

HI there 👋
A lot of models seem to require the following VAE: vae-ft-mse-840000-ema-pruned.ckpt.

Is it possible/required to specify a separate VAE when training using EveryDream in such cases?

Fatal Error loading image - unrecognized data stream contents when reading image file

looks like it was auto or notepad++ breaking some images during replace token process

I did redo all of the images and it is still giving this issue. I have cleared directories and reinstalled as well.

I am pre-processing in automatic then using BRU and N++ to change values to match folder and tags for subjects respectively.

It DID work with this.

What automation processes regarding theses steps have you all found to be most efficient and effective?

It DID work with four folders of subjects the first go so I know it can. I have the model and it's children that work. I just want it to hit harder.

Great work. I'm so happy to see 8bit on windows.

01/17/2023 05:37:49 AM ED1 mode: Overiding disable_xformers to True
01/17/2023 05:37:49 AM Seed: 555
01/17/2023 05:37:49 AM Logging to logs\01DSU90 - GMM&co_20230117-053749
01/17/2023 05:37:49 AM unet attention_head_dim: [8, 8, 8, 8]
01/17/2023 05:37:51 AM xformers not available or disabled
01/17/2023 05:37:51 AM * Using FP32 *
01/17/2023 05:37:53 AM �[36m * Training Text Encoder *�[0m
01/17/2023 05:37:53 AM �[36m * Using AdamW 8-bit Optimizer *�[0m
01/17/2023 05:37:53 AM �[36m * Optimizer: AdamW8bit *�[0m
01/17/2023 05:37:53 AM betas: (0.9, 0.999), epsilon: 1e-08 *�[0m
01/17/2023 05:37:53 AM * Creating new dataloader singleton
01/17/2023 05:37:53 AM * DLMA resolution 512, buckets: [[512, 512], [576, 448], [448, 576], [640, 384], [384, 640], [768, 320], [320, 768], [896, 256], [256, 896], [1024, 256], [256, 1024]]
01/17/2023 05:37:53 AM Preloading images...
01/17/2023 05:37:53 AM ** Trainer Set: 4, num_images: 38, batch_size: 10
01/17/2023 05:37:53 AM Pretraining GPU Memory: 11340 / 49140 MB
01/17/2023 05:37:53 AM saving ckpts every 1000000000.0 minutes
01/17/2023 05:37:53 AM saving ckpts every 100 epochs
01/17/2023 05:37:53 AM unet device: cuda:0, precision: torch.float32, training: True
01/17/2023 05:37:53 AM text_encoder device: cuda:0, precision: torch.float32, training: True
01/17/2023 05:37:53 AM vae device: cuda:0, precision: torch.float32, training: False
01/17/2023 05:37:53 AM scheduler: <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>
01/17/2023 05:37:53 AM �[32mProject name: �[0m�[92m01DSU90 - GMM&co�[0m
01/17/2023 05:37:53 AM �[32mgrad_accum: �[0m�[92m1�[0m
01/17/2023 05:37:53 AM �[32mbatch_size: �[0m�[92m10�[0m
01/17/2023 05:37:53 AM �[32mepoch_len: �[92m4�[0m
01/17/2023 05:38:22 AM Fatal Error loading image: C:\Users\TemporalLabSol\Desktop\EveryDream2trainer-main\EveryDream2trainer-main\SUBJECT\01DSU90\01DSU90 3-1-318632221_8171851539551764_21628442878249915_n.png:
01/17/2023 05:38:22 AM unrecognized data stream contents when reading image file

EL JSON
{
"amp": false,
"batch_size": 10,
"ckpt_every_n_minutes": null,
"clip_grad_norm": null,
"clip_skip": 0,
"cond_dropout": 0.04,
"data_root": "C:\Users\TemporalLabSol\Desktop\EveryDream2trainer-main\EveryDream2trainer-main\SUBJECT",
"disable_textenc_training": false,
"disable_xformers": false,
"ed1_mode": true,
"flip_p": 0.0,
"gpuid": 0,
"gradient_checkpointing": true,
"grad_accum": 1,
"logdir": "logs",
"log_step": 25,
"lowvram": false,
"lr": 3.5e-06,
"lr_decay_steps": 0,
"lr_scheduler": "constant",
"lr_warmup_steps": null,
"max_epochs": 90,
"notebook": false,
"project_name": "01DSU90 - GMM&co",
"resolution": 512,
"resume_ckpt": "sd_v1-5_vae",
"sample_prompts": "sample_prompts.txt",
"sample_steps": 300,
"save_ckpt_dir": null,
"save_every_n_epochs": 100,
"save_optimizer": false,
"scale_lr": false,
"seed": 555,
"shuffle_tags": false,
"useadam8bit": true,
"wandb": false,
"write_schedule": false,
"rated_dataset": false,
"rated_dataset_target_dropout_rate": 50,
"save_full_precision": false,
"disable_unet_training": false,
"mixed_precision": "fp32",
"rated_dataset_target_dropout_percent": 50,
"config": "train.json"
}
`

Manually Patch Token Limit

Manually Patch Token Limits

Hello 👋 I'm looking for a way to manually patch ED2 to support tokens > 75. Is this possible, and is it stable?

Thank you :)

windows_setup.cmd fails with fatal error

Installed python 3.10, ran windows_setup.cmd, downloaded requirements, ended with this:

` *** Applying bitsandbytes patch for windows ***
*** Already patched!

*** bitsandbytes windows patch applied, attempting import ***
Somethnig went wrong trying to patch bitsandbytes, aborting
make sure your venv is activated and try again
or if activated try:
pip install bitsandbytes==0.35.0
Traceback (most recent call last):
File "C:\repo\EveryDream2trainer\utils\patch_bnb.py", line 122, in main
import bitsandbytes
File "C:\repo\EveryDream2trainer\venv\lib\site-packages\bitsandbytes_init_.py", line 6, in
from .autograd.functions import (
File "C:\repo\EveryDream2trainer\venv\lib\site-packages\bitsandbytes\autograd_functions.py", line 4, in
import torch
File "C:\repo\EveryDream2trainer\venv\lib\site-packages\torch_init.py", line 219, in
raise ImportError(textwrap.dedent('''
ImportError: Failed to load PyTorch C extensions:
It appears that PyTorch has loaded the torch/_C folder
of the PyTorch repository rather than the C extensions which
are expected in the torch._C namespace. This can occur when
using the install workflow. e.g.
$ python setup.py install && python -c "import torch"`

This error can generally be solved using the `develop` workflow
    $ python setup.py develop && python -c "import torch"  # This should succeed
or by running Python from a different directory.

`During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\repo\EveryDream2trainer\utils\patch_bnb.py", line 133, in
main()
File "C:\repo\EveryDream2trainer\utils\patch_bnb.py", line 125, in main
error()
File "C:\repo\EveryDream2trainer\utils\patch_bnb.py", line 94, in error
raise RuntimeError("** FATAL ERROR: unable to patch bitsandbytes for Windows env")
RuntimeError: ** FATAL ERROR: unable to patch bitsandbytes for Windows env
downloaded: v2-inference-v.yaml
downloaded: v2-inference.yaml
downloaded: v1-inference.yaml
SD1.x and SD2.x yamls downloaded

(venv) C:\repo\EveryDream2trainer>python --version
Python 3.10.10`

[ERROR] --shuffle_tags is broken (error occurs on runpod)

Attach log and cfg

Runtime environment:

OS: Some Linux distro
Is this your local computer or a cloud host? Runpod
GPU: RTX 3090

Additional context

--shuffle_tags consistently fails to work on runpod (i.e. training doesn't start).

The exact output of the error is:

  No model to save, something likely blew up on startup, not saving
Traceback (most recent call last):
  File "/workspace/EveryDream2trainer/train.py", line 999, in <module>
    main(args)
  File "/workspace/EveryDream2trainer/train.py", line 924, in main
    raise ex
  File "/workspace/EveryDream2trainer/train.py", line 809, in main
    for step, batch in enumerate(train_dataloader):
  File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/workspace/venv/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/workspace/EveryDream2trainer/data/every_dream.py", line 101, in __getitem__
    example["caption"] = train_item["caption"].get_shuffled_caption(self.seed)
  File "/workspace/EveryDream2trainer/data/image_train_item.py", line 70, in get_shuffled_caption
    max_target_tag_length = self.__max_target_length - len(self.__main_prompt)
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

Epochs:   0%|          | 0/150 [00:00<?, ?it/s, vram=5133/24576 MB gs:0]

Is it possibile to train .safetensors instead of .ckpt?

most models from civitai are .safetensors, and I'd need to train one. Is it possibile wth everyDream? How?

[ERROR] stuck at converting sd 15 ckpt

I followed your guide and wanted to start the conversion of the 1_5 model but that is not working, I converted the sd_2-1 768 no problem whatsoever, but the 15 is stuck.

Left it running over night has not done anything.

Conversion command as follows:

python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim ^
--original_config_file v1-inference.yaml ^
--image_size 512 ^
--checkpoint_path sd_v1-5_vae.ckpt ^
--prediction_type epsilon ^
--upcast_attn False ^
--dump_path "ckpt_cache/sd_v1-5_vae"

Please add running docker locally instructions to the README.md

Something like
$ docker run -it -p 8888:8888 -p 6006:6006 --gpus all -e JUPYTER_PASSWORD=test1234 -t ghcr.io/victorchall/everydream2trainer:nightly

That should let someone run the docker container locally since it requires GPU access for CUDA and the env variable JUPYTER_PASSWORD set to run Jupyter and Tensor Board exposing both ports.

Error in model conversion on colab for SD 2.1

I think there is a small error in the colab-notebook. If you try to convert the 2.1 model, by default it's still using the 'epsilon' prediction type, which will result in pretty deep-fried results (see below).

😄 Thank you for the great repository! I'm having a lot of fun with it! 😄

!python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim  \
--original_config_file {inference_yaml} \
--image_size {img_size} \
--checkpoint_path {base_path} \
--prediction_type epsilon \ <-- This should change.
--upcast_attn False \
--dump_path {save_name}

Here's a sample of the results... 💀

Training crash on large dataset

I am attempting to train a model using 100k plus images.

I appear to be trying to use more VRAM then I have after about 4 hours, which does not make a lot of sense to me.

Stack Trace attached.
Any insight would be much appreciated!

error.txt

RuntimeError: CUDA error: invalid argument with 1.5 and amp

I get this error if I use SD 1.5 with amp
RuntimeError: CUDA error: invalid argument

Is amp compatible with 1.5 ?
Training setup
{
"amp": true,
"batch_size": 10,
"ckpt_every_n_minutes": null,
"clip_grad_norm": null,
"clip_skip": 0,
"cond_dropout": 0.04,
"data_root": "captioned",
"disable_textenc_training": false,
"disable_xformers": false,
"flip_p": 0.0,
"gpuid": 0,
"gradient_checkpointing": true,
"grad_accum": 1,
"logdir": "logs",
"log_step": 25,
"lowvram": false,
"lr": 1.5e-06,
"lr_decay_steps": 0,
"lr_scheduler": "constant",
"lr_warmup_steps": null,
"max_epochs": 60,
"notebook": false,
"project_name": "prjctname",
"resolution": 576,
"resume_ckpt": "sd_v1-5_vae",
"sample_prompts": "sample_prompts.txt",
"sample_steps": 300,
"save_ckpt_dir": null,
"save_ckpts_from_n_epochs": 0,
"save_every_n_epochs": 10,
"save_optimizer": false,
"scale_lr": false,
"seed": 555,
"shuffle_tags": false,
"useadam8bit": true,
"validation_config": "validation_default.json",
"wandb": true,
"write_schedule": false,
"rated_dataset": false,
"rated_dataset_target_dropout_percent": 50,
"zero_frequency_noise_ratio": 0.0
}

full traceback

Traceback (most recent call last):
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/train.py", line 1044, in
main(args)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/train.py", line 971, in main
raise ex
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/train.py", line 860, in main
scaler.scale(loss).backward()
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 146, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/function.py", line 399, in wrapper
outputs = fn(ctx, *args)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/xformers/ops/fmha/init.py", line 107, in backward
grads = _memory_efficient_attention_backward(
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/xformers/ops/fmha/init.py", line 380, in _memory_efficient_attention_backward
grads = op.apply(ctx, inp, grad)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/xformers/ops/fmha/cutlass.py", line 283, in apply
(grad_q, grad_k, grad_v, grad_bias) = cls.OPERATOR(
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/_ops.py", line 143, in call
return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Convert Error

This trainer experiences errors when training most animate-style models, for example, derived from NAI Anime models.
EveryDream Version 1 is the same.
It cannot be used normally without modification of the code.

RuntimeError: Error(s) in loading state_dict for CLIPTextModel:
Missing key(s) in state_dict: "text_model.embeddings.position_ids", "text_model.embeddings.token_embedding.weight", "text_model.embeddings.position_embedding.weight", "text_model.encoder.layers.0.self_attn.k_proj.weight", "text_model.encoder.layers.0.self_attn.k_proj.bias", "text_model.encoder.layers.0.self_attn.v_proj.weight", "text_model.encoder.layers.0.self_attn.v_proj.bias", "text_model.encoder.layers.0.self_attn.q_proj.weight", "text_model.encoder.layers.0.self_attn.q_proj.bias", "text_model.encoder.layers.0.self_attn.out_proj.weight", "text_model.encoder.layers.0.self_attn.out_proj.bias", "text_model.encoder.layers.0.layer_norm1.weight", "text_model.encoder.layers.0.layer_norm1.bias", "text_model.encoder.layers.0.mlp.fc1.weight", "text_model.encoder.layers.0.mlp.fc1.bias", "text_model.encoder.layers.0.mlp.fc2.weight", "text_model.encoder.layers.0.mlp.fc2.bias", "text_model.encoder.layers.0.layer_norm2.weight", "text_model.encoder.layers.0.layer_norm2.bias", "text_model.encoder.layers.1.self_attn.k_proj.weight", "text_model.encoder.layers.1.self_attn.k_proj.bias", "text_model.encoder.layers.1.self_attn.v_proj.weight", "text_model.encoder.layers.1.self_attn.v_proj.bias", "text_model.encoder.layers.1.self_attn.q_proj.weight", "text_model.encoder.layers.1.self_attn.q_proj.bias", "text_model.encoder.layers.1.self_attn.out_proj.weight", "text_model.encoder.layers.1.self_attn.out_proj.bias", "text_model.encoder.layers.1.layer_norm1.weight", "text_model.encoder.layers.1.layer_norm1.bias", "text_model.encoder.layers.1.mlp.fc1.weight", "text_model.encoder.layers.1.mlp.fc1.bias", "text_model.encoder.layers.1.mlp.fc2.weight", "text_model.encoder.layers.1.mlp.fc2.bias", "text_model.encoder.layers.1.layer_norm2.weight", "text_model.encoder.layers.1.layer_norm2.bias", "text_model.encoder.layers.2.self_attn.k_proj.weight", "text_model.encoder.layers.2.self_attn.k_proj.bias", "text_model.encoder.layers.2.self_attn.v_proj.weight", "text_model.encoder.layers.2.self_attn.v_proj.bias", "text_model.encoder.layers.2.self_attn.q_proj.weight", "text_model.encoder.layers.2.self_attn.q_proj.bias", "text_model.encoder.layers.2.self_attn.out_proj.weight", "text_model.encoder.layers.2.self_attn.out_proj.bias", "text_model.encoder.layers.2.layer_norm1.weight", "text_model.encoder.layers.2.layer_norm1.bias", "text_model.encoder.layers.2.mlp.fc1.weight", "text_model.encoder.layers.2.mlp.fc1.bias", "text_model.encoder.layers.2.mlp.fc2.weight", "text_model.encoder.layers.2.mlp.fc2.bias", "text_model.encoder.layers.2.layer_norm2.weight", "text_model.encoder.layers.2.layer_norm2.bias", "text_model.encoder.layers.3.self_attn.k_proj.weight", "text_model.encoder.layers.3.self_attn.k_proj.bias", "text_model.encoder.layers.3.self_attn.v_proj.weight", "text_model.encoder.layers.3.self_attn.v_proj.bias", "text_model.encoder.layers.3.self_attn.q_proj.weight", "text_model.encoder.layers.3.self_attn.q_proj.bias", "text_model.encoder.layers.3.self_attn.out_proj.weight", "text_model.encoder.layers.3.self_attn.out_proj.bias", "text_model.encoder.layers.3.layer_norm1.weight", "text_model.encoder.layers.3.layer_norm1.bias", "text_model.encoder.layers.3.mlp.fc1.weight", "text_model.encoder.layers.3.mlp.fc1.bias", "text_model.encoder.layers.3.mlp.fc2.weight", "text_model.encoder.layers.3.mlp.fc2.bias", "text_model.encoder.layers.3.layer_norm2.weight", "text_model.encoder.layers.3.layer_norm2.bias", "text_model.encoder.layers.4.self_attn.k_proj.weight", "text_model.encoder.layers.4.self_attn.k_proj.bias", "text_model.encoder.layers.4.self_attn.v_proj.weight", "text_model.encoder.layers.4.self_attn.v_proj.bias", "text_model.encoder.layers.4.self_attn.q_proj.weight", "text_model.encoder.layers.4.self_attn.q_proj.bias", "text_model.encoder.layers.4.self_attn.out_proj.weight", "text_model.encoder.layers.4.self_attn.out_proj.bias", "text_model.encoder.layers.4.layer_norm1.weight", "text_model.encoder.layers.4.layer_norm1.bias", "text_model.encoder.layers.4.mlp.fc1.weight", "text_model.encoder.layers.4.mlp.fc1.bias", "text_model.encoder.layers.4.mlp.fc2.weight", "text_model.encoder.layers.4.mlp.fc2.bias", "text_model.encoder.layers.4.layer_norm2.weight", "text_model.encoder.layers.4.layer_norm2.bias", "text_model.encoder.layers.5.self_attn.k_proj.weight", "text_model.encoder.layers.5.self_attn.k_proj.bias", "text_model.encoder.layers.5.self_attn.v_proj.weight", "text_model.encoder.layers.5.self_attn.v_proj.bias", "text_model.encoder.layers.5.self_attn.q_proj.weight", "text_model.encoder.layers.5.self_attn.q_proj.bias", "text_model.encoder.layers.5.self_attn.out_proj.weight", "text_model.encoder.layers.5.self_attn.out_proj.bias", "text_model.encoder.layers.5.layer_norm1.weight", "text_model.encoder.layers.5.layer_norm1.bias", "text_model.encoder.layers.5.mlp.fc1.weight", "text_model.encoder.layers.5.mlp.fc1.bias", "text_model.encoder.layers.5.mlp.fc2.weight", "text_model.encoder.layers.5.mlp.fc2.bias", "text_model.encoder.layers.5.layer_norm2.weight", "text_model.encoder.layers.5.layer_norm2.bias", "text_model.encoder.layers.6.self_attn.k_proj.weight", "text_model.encoder.layers.6.self_attn.k_proj.bias", "text_model.encoder.layers.6.self_attn.v_proj.weight", "text_model.encoder.layers.6.self_attn.v_proj.bias", "text_model.encoder.layers.6.self_attn.q_proj.weight", "text_model.encoder.layers.6.self_attn.q_proj.bias", "text_model.encoder.layers.6.self_attn.out_proj.weight", "text_model.encoder.layers.6.self_attn.out_proj.bias", "text_model.encoder.layers.6.layer_norm1.weight", "text_model.encoder.layers.6.layer_norm1.bias", "text_model.encoder.layers.6.mlp.fc1.weight", "text_model.encoder.layers.6.mlp.fc1.bias", "text_model.encoder.layers.6.mlp.fc2.weight", "text_model.encoder.layers.6.mlp.fc2.bias", "text_model.encoder.layers.6.layer_norm2.weight", "text_model.encoder.layers.6.layer_norm2.bias", "text_model.encoder.layers.7.self_attn.k_proj.weight", "text_model.encoder.layers.7.self_attn.k_proj.bias", "text_model.encoder.layers.7.self_attn.v_proj.weight", "text_model.encoder.layers.7.self_attn.v_proj.bias", "text_model.encoder.layers.7.self_attn.q_proj.weight", "text_model.encoder.layers.7.self_attn.q_proj.bias", "text_model.encoder.layers.7.self_attn.out_proj.weight", "text_model.encoder.layers.7.self_attn.out_proj.bias", "text_model.encoder.layers.7.layer_norm1.weight", "text_model.encoder.layers.7.layer_norm1.bias", "text_model.encoder.layers.7.mlp.fc1.weight", "text_model.encoder.layers.7.mlp.fc1.bias", "text_model.encoder.layers.7.mlp.fc2.weight", "text_model.encoder.layers.7.mlp.fc2.bias", "text_model.encoder.layers.7.layer_norm2.weight", "text_model.encoder.layers.7.layer_norm2.bias", "text_model.encoder.layers.8.self_attn.k_proj.weight", "text_model.encoder.layers.8.self_attn.k_proj.bias", "text_model.encoder.layers.8.self_attn.v_proj.weight", "text_model.encoder.layers.8.self_attn.v_proj.bias", "text_model.encoder.layers.8.self_attn.q_proj.weight", "text_model.encoder.layers.8.self_attn.q_proj.bias", "text_model.encoder.layers.8.self_attn.out_proj.weight", "text_model.encoder.layers.8.self_attn.out_proj.bias", "text_model.encoder.layers.8.layer_norm1.weight", "text_model.encoder.layers.8.layer_norm1.bias", "text_model.encoder.layers.8.mlp.fc1.weight", "text_model.encoder.layers.8.mlp.fc1.bias", "text_model.encoder.layers.8.mlp.fc2.weight", "text_model.encoder.layers.8.mlp.fc2.bias", "text_model.encoder.layers.8.layer_norm2.weight", "text_model.encoder.layers.8.layer_norm2.bias", "text_model.encoder.layers.9.self_attn.k_proj.weight", "text_model.encoder.layers.9.self_attn.k_proj.bias", "text_model.encoder.layers.9.self_attn.v_proj.weight", "text_model.encoder.layers.9.self_attn.v_proj.bias", "text_model.encoder.layers.9.self_attn.q_proj.weight", "text_model.encoder.layers.9.self_attn.q_proj.bias", "text_model.encoder.layers.9.self_attn.out_proj.weight", "text_model.encoder.layers.9.self_attn.out_proj.bias", "text_model.encoder.layers.9.layer_norm1.weight", "text_model.encoder.layers.9.layer_norm1.bias", "text_model.encoder.layers.9.mlp.fc1.weight", "text_model.encoder.layers.9.mlp.fc1.bias", "text_model.encoder.layers.9.mlp.fc2.weight", "text_model.encoder.layers.9.mlp.fc2.bias", "text_model.encoder.layers.9.layer_norm2.weight", "text_model.encoder.layers.9.layer_norm2.bias", "text_model.encoder.layers.10.self_attn.k_proj.weight", "text_model.encoder.layers.10.self_attn.k_proj.bias", "text_model.encoder.layers.10.self_attn.v_proj.weight", "text_model.encoder.layers.10.self_attn.v_proj.bias", "text_model.encoder.layers.10.self_attn.q_proj.weight", "text_model.encoder.layers.10.self_attn.q_proj.bias", "text_model.encoder.layers.10.self_attn.out_proj.weight", "text_model.encoder.layers.10.self_attn.out_proj.bias", "text_model.encoder.layers.10.layer_norm1.weight", "text_model.encoder.layers.10.layer_norm1.bias", "text_model.encoder.layers.10.mlp.fc1.weight", "text_model.encoder.layers.10.mlp.fc1.bias", "text_model.encoder.layers.10.mlp.fc2.weight", "text_model.encoder.layers.10.mlp.fc2.bias", "text_model.encoder.layers.10.layer_norm2.weight", "text_model.encoder.layers.10.layer_norm2.bias", "text_model.encoder.layers.11.self_attn.k_proj.weight", "text_model.encoder.layers.11.self_attn.k_proj.bias", "text_model.encoder.layers.11.self_attn.v_proj.weight", "text_model.encoder.layers.11.self_attn.v_proj.bias", "text_model.encoder.layers.11.self_attn.q_proj.weight", "text_model.encoder.layers.11.self_attn.q_proj.bias", "text_model.encoder.layers.11.self_attn.out_proj.weight", "text_model.encoder.layers.11.self_attn.out_proj.bias", "text_model.encoder.layers.11.layer_norm1.weight", "text_model.encoder.layers.11.layer_norm1.bias", "text_model.encoder.layers.11.mlp.fc1.weight", "text_model.encoder.layers.11.mlp.fc1.bias", "text_model.encoder.layers.11.mlp.fc2.weight", "text_model.encoder.layers.11.mlp.fc2.bias", "text_model.encoder.layers.11.layer_norm2.weight", "text_model.encoder.layers.11.layer_norm2.bias", "text_model.final_layer_norm.weight", "text_model.final_layer_norm.bias".
Unexpected key(s) in state_dict: "embeddings.position_ids", "embeddings.token_embedding.weight", "embeddings.position_embedding.weight", "encoder.layers.0.self_attn.k_proj.weight", "encoder.layers.0.self_attn.k_proj.bias", "encoder.layers.0.self_attn.v_proj.weight", "encoder.layers.0.self_attn.v_proj.bias", "encoder.layers.0.self_attn.q_proj.weight", "encoder.layers.0.self_attn.q_proj.bias", "encoder.layers.0.self_attn.out_proj.weight", "encoder.layers.0.self_attn.out_proj.bias", "encoder.layers.0.layer_norm1.weight", "encoder.layers.0.layer_norm1.bias", "encoder.layers.0.mlp.fc1.weight", "encoder.layers.0.mlp.fc1.bias", "encoder.layers.0.mlp.fc2.weight", "encoder.layers.0.mlp.fc2.bias", "encoder.layers.0.layer_norm2.weight", "encoder.layers.0.layer_norm2.bias", "encoder.layers.1.self_attn.k_proj.weight", "encoder.layers.1.self_attn.k_proj.bias", "encoder.layers.1.self_attn.v_proj.weight", "encoder.layers.1.self_attn.v_proj.bias", "encoder.layers.1.self_attn.q_proj.weight", "encoder.layers.1.self_attn.q_proj.bias", "encoder.layers.1.self_attn.out_proj.weight", "encoder.layers.1.self_attn.out_proj.bias", "encoder.layers.1.layer_norm1.weight", "encoder.layers.1.layer_norm1.bias", "encoder.layers.1.mlp.fc1.weight", "encoder.layers.1.mlp.fc1.bias", "encoder.layers.1.mlp.fc2.weight", "encoder.layers.1.mlp.fc2.bias", "encoder.layers.1.layer_norm2.weight", "encoder.layers.1.layer_norm2.bias", "encoder.layers.2.self_attn.k_proj.weight", "encoder.layers.2.self_attn.k_proj.bias", "encoder.layers.2.self_attn.v_proj.weight", "encoder.layers.2.self_attn.v_proj.bias", "encoder.layers.2.self_attn.q_proj.weight", "encoder.layers.2.self_attn.q_proj.bias", "encoder.layers.2.self_attn.out_proj.weight", "encoder.layers.2.self_attn.out_proj.bias", "encoder.layers.2.layer_norm1.weight", "encoder.layers.2.layer_norm1.bias", "encoder.layers.2.mlp.fc1.weight", "encoder.layers.2.mlp.fc1.bias", "encoder.layers.2.mlp.fc2.weight", "encoder.layers.2.mlp.fc2.bias", "encoder.layers.2.layer_norm2.weight", "encoder.layers.2.layer_norm2.bias", "encoder.layers.3.self_attn.k_proj.weight", "encoder.layers.3.self_attn.k_proj.bias", "encoder.layers.3.self_attn.v_proj.weight", "encoder.layers.3.self_attn.v_proj.bias", "encoder.layers.3.self_attn.q_proj.weight", "encoder.layers.3.self_attn.q_proj.bias", "encoder.layers.3.self_attn.out_proj.weight", "encoder.layers.3.self_attn.out_proj.bias", "encoder.layers.3.layer_norm1.weight", "encoder.layers.3.layer_norm1.bias", "encoder.layers.3.mlp.fc1.weight", "encoder.layers.3.mlp.fc1.bias", "encoder.layers.3.mlp.fc2.weight", "encoder.layers.3.mlp.fc2.bias", "encoder.layers.3.layer_norm2.weight", "encoder.layers.3.layer_norm2.bias", "encoder.layers.4.self_attn.k_proj.weight", "encoder.layers.4.self_attn.k_proj.bias", "encoder.layers.4.self_attn.v_proj.weight", "encoder.layers.4.self_attn.v_proj.bias", "encoder.layers.4.self_attn.q_proj.weight", "encoder.layers.4.self_attn.q_proj.bias", "encoder.layers.4.self_attn.out_proj.weight", "encoder.layers.4.self_attn.out_proj.bias", "encoder.layers.4.layer_norm1.weight", "encoder.layers.4.layer_norm1.bias", "encoder.layers.4.mlp.fc1.weight", "encoder.layers.4.mlp.fc1.bias", "encoder.layers.4.mlp.fc2.weight", "encoder.layers.4.mlp.fc2.bias", "encoder.layers.4.layer_norm2.weight", "encoder.layers.4.layer_norm2.bias", "encoder.layers.5.self_attn.k_proj.weight", "encoder.layers.5.self_attn.k_proj.bias", "encoder.layers.5.self_attn.v_proj.weight", "encoder.layers.5.self_attn.v_proj.bias", "encoder.layers.5.self_attn.q_proj.weight", "encoder.layers.5.self_attn.q_proj.bias", "encoder.layers.5.self_attn.out_proj.weight", "encoder.layers.5.self_attn.out_proj.bias", "encoder.layers.5.layer_norm1.weight", "encoder.layers.5.layer_norm1.bias", "encoder.layers.5.mlp.fc1.weight", "encoder.layers.5.mlp.fc1.bias", "encoder.layers.5.mlp.fc2.weight", "encoder.layers.5.mlp.fc2.bias", "encoder.layers.5.layer_norm2.weight", "encoder.layers.5.layer_norm2.bias", "encoder.layers.6.self_attn.k_proj.weight", "encoder.layers.6.self_attn.k_proj.bias", "encoder.layers.6.self_attn.v_proj.weight", "encoder.layers.6.self_attn.v_proj.bias", "encoder.layers.6.self_attn.q_proj.weight", "encoder.layers.6.self_attn.q_proj.bias", "encoder.layers.6.self_attn.out_proj.weight", "encoder.layers.6.self_attn.out_proj.bias", "encoder.layers.6.layer_norm1.weight", "encoder.layers.6.layer_norm1.bias", "encoder.layers.6.mlp.fc1.weight", "encoder.layers.6.mlp.fc1.bias", "encoder.layers.6.mlp.fc2.weight", "encoder.layers.6.mlp.fc2.bias", "encoder.layers.6.layer_norm2.weight", "encoder.layers.6.layer_norm2.bias", "encoder.layers.7.self_attn.k_proj.weight", "encoder.layers.7.self_attn.k_proj.bias", "encoder.layers.7.self_attn.v_proj.weight", "encoder.layers.7.self_attn.v_proj.bias", "encoder.layers.7.self_attn.q_proj.weight", "encoder.layers.7.self_attn.q_proj.bias", "encoder.layers.7.self_attn.out_proj.weight", "encoder.layers.7.self_attn.out_proj.bias", "encoder.layers.7.layer_norm1.weight", "encoder.layers.7.layer_norm1.bias", "encoder.layers.7.mlp.fc1.weight", "encoder.layers.7.mlp.fc1.bias", "encoder.layers.7.mlp.fc2.weight", "encoder.layers.7.mlp.fc2.bias", "encoder.layers.7.layer_norm2.weight", "encoder.layers.7.layer_norm2.bias", "encoder.layers.8.self_attn.k_proj.weight", "encoder.layers.8.self_attn.k_proj.bias", "encoder.layers.8.self_attn.v_proj.weight", "encoder.layers.8.self_attn.v_proj.bias", "encoder.layers.8.self_attn.q_proj.weight", "encoder.layers.8.self_attn.q_proj.bias", "encoder.layers.8.self_attn.out_proj.weight", "encoder.layers.8.self_attn.out_proj.bias", "encoder.layers.8.layer_norm1.weight", "encoder.layers.8.layer_norm1.bias", "encoder.layers.8.mlp.fc1.weight", "encoder.layers.8.mlp.fc1.bias", "encoder.layers.8.mlp.fc2.weight", "encoder.layers.8.mlp.fc2.bias", "encoder.layers.8.layer_norm2.weight", "encoder.layers.8.layer_norm2.bias", "encoder.layers.9.self_attn.k_proj.weight", "encoder.layers.9.self_attn.k_proj.bias", "encoder.layers.9.self_attn.v_proj.weight", "encoder.layers.9.self_attn.v_proj.bias", "encoder.layers.9.self_attn.q_proj.weight", "encoder.layers.9.self_attn.q_proj.bias", "encoder.layers.9.self_attn.out_proj.weight", "encoder.layers.9.self_attn.out_proj.bias", "encoder.layers.9.layer_norm1.weight", "encoder.layers.9.layer_norm1.bias", "encoder.layers.9.mlp.fc1.weight", "encoder.layers.9.mlp.fc1.bias", "encoder.layers.9.mlp.fc2.weight", "encoder.layers.9.mlp.fc2.bias", "encoder.layers.9.layer_norm2.weight", "encoder.layers.9.layer_norm2.bias", "encoder.layers.10.self_attn.k_proj.weight", "encoder.layers.10.self_attn.k_proj.bias", "encoder.layers.10.self_attn.v_proj.weight", "encoder.layers.10.self_attn.v_proj.bias", "encoder.layers.10.self_attn.q_proj.weight", "encoder.layers.10.self_attn.q_proj.bias", "encoder.layers.10.self_attn.out_proj.weight", "encoder.layers.10.self_attn.out_proj.bias", "encoder.layers.10.layer_norm1.weight", "encoder.layers.10.layer_norm1.bias", "encoder.layers.10.mlp.fc1.weight", "encoder.layers.10.mlp.fc1.bias", "encoder.layers.10.mlp.fc2.weight", "encoder.layers.10.mlp.fc2.bias", "encoder.layers.10.layer_norm2.weight", "encoder.layers.10.layer_norm2.bias", "encoder.layers.11.self_attn.k_proj.weight", "encoder.layers.11.self_attn.k_proj.bias", "encoder.layers.11.self_attn.v_proj.weight", "encoder.layers.11.self_attn.v_proj.bias", "encoder.layers.11.self_attn.q_proj.weight", "encoder.layers.11.self_attn.q_proj.bias", "encoder.layers.11.self_attn.out_proj.weight", "encoder.layers.11.self_attn.out_proj.bias", "encoder.layers.11.layer_norm1.weight", "encoder.layers.11.layer_norm1.bias", "encoder.layers.11.mlp.fc1.weight", "encoder.layers.11.mlp.fc1.bias", "encoder.layers.11.mlp.fc2.weight", "encoder.layers.11.mlp.fc2.bias", "encoder.layers.11.layer_norm2.weight", "encoder.layers.11.layer_norm2.bias", "final_layer_norm.weight", "final_layer_norm.bias".

What is difference between EveryDream and DreamBooth?

I may be wrong, but I look through docs but there's no details about the theory of EveryDream.
DreamBooth uses one prompt to describe the whole training dataset.
EveryDream uses a fully-labelled dataset creating by BLIP.
Is the above thought correct?

Memory allocation

My graphic card has only 6GB of memory and when i try to run a training session, i receive the following answer:

omething went wrong, attempting to save model | 0/12 [00:00<?, ?it/s]
No model to save, something likely blew up on startup, not saving
Traceback (most recent call last):
File "D:\download\AI\EveryDream2trainer\train.py", line 1050, in
main(args)
File "D:\download\AI\EveryDream2trainer\train.py", line 977, in main
raise ex
File "D:\download\AI\EveryDream2trainer\train.py", line 854, in main
model_pred, target = get_model_prediction_and_target(batch["image"], batch["tokens"], args.zero_frequency_noise_ratio)
File "D:\download\AI\EveryDream2trainer\train.py", line 779, in get_model_prediction_and_target
latents = vae.encode(pixel_values, return_dict=False)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\diffusers\models\vae.py", line 566, in encode
h = self.encoder(x)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\diffusers\models\vae.py", line 134, in forward
sample = down_block(sample)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\diffusers\models\unet_2d_blocks.py", line 755, in forward
hidden_states = resnet(hidden_states, temb=None)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\diffusers\models\resnet.py", line 450, in forward
hidden_states = self.norm1(hidden_states)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\normalization.py", line 272, in forward
return F.group_norm(
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\functional.py", line 2516, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA out of memory. Tried to allocate 720.00 MiB (GPU 0; 6.00 GiB total capacity; 4.71 GiB already allocated; 0 bytes free; 4.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is there any way to fix the problem ?
What solution do i have ?

GPU ID 0 is hardcoded in some places

In utils/gpu.py the GPU 0 is hardcoded:

gpu_used_mem = int(gpu_query['gpu'][0]['fb_memory_usage']['used'])
gpu_total_mem = int(gpu_query['gpu'][0]['fb_memory_usage']['total'])

Also in train.py:

def get_gpu_memory(nvsmi):
    """
    returns the gpu memory usage
    """
    gpu_query = nvsmi.DeviceQuery('memory.used, memory.total')
    gpu_used_mem = int(gpu_query['gpu'][0]['fb_memory_usage']['used'])
    gpu_total_mem = int(gpu_query['gpu'][0]['fb_memory_usage']['total'])
    return gpu_used_mem, gpu_total_mem

So if you run the trainer on gpuid : 1, the logs will show the memory for the incorrect GPU.

Even fixing these issues and setting gpuid:1 in train.json, I'm still having trouble getting it to run correctly on GPU 1. Starting the trainer seems to immediately max out memory on GPU 0 and throw CUDA memory error.

LR Defaults to 1e-6

I am not sure if I missed a ReadMe or push note, but the latest repo as of today keeps defaulting the LR to 1e-6. I went back to c8c658d and it works perfectly. Not sure if this is a glitch on my end and I need a fresh install. Thought I'd mention it, incase it ends up being real.

Edit: I found the note about setting the LR in optimizer.json etc. I was setting LR directly from the CLI and my train.json was also not 1e-6; therefore my LR was not being considered.

Colab error - Get a base model

Hi,

When I try to download an another model in th "Get a base model", I have this error :

Stable UnCLIP - training

Just finished training directly on https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip
(directly clone the diffusers)

but it exports it with v2-inference-v.yaml

Automatic should support it soon too
AUTOMATIC1111/stable-diffusion-webui@8a34671

config_unclip = os.path.join(sd_repo_configs_path, "v2-1-stable-unclip-l-inference.yaml")
config_unopenclip = os.path.join(sd_repo_configs_path, "v2-1-stable-unclip-h-inference.yaml")

Just a head up if we can fix this and be ahead of others ;)

ModuleNotFoundError: No module named 'debug'

Fresh install of new version 2, followed the instructions and all was well until I tried to start some training:

Traceback (most recent call last):
File "G:\ed2\train.py", line 53, in
import debug
ModuleNotFoundError: No module named 'debug'

Initial checkpoint conversion error

I get this when installing via YT tutorial. All other steps were followed. I have the correct files in the root dir.

(venv) D:\EveryDream2\EveryDream2trainer>python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim ^
More? --original_config_file v1-inference.yaml ^
More? --image_size 512 ^
More? --checkpoint_path sd_v1-5_vae.ckpt ^
More? --prediction_type epsilon ^
More? --upcast_attn False ^
More? --dump_path "ckpt_cache/sd_v1-5_vae"
Traceback (most recent call last):
File "D:\EveryDream2\EveryDream2trainer\utils\convert_original_stable_diffusion_to_diffusers.py", line 854, in
checkpoint = torch.load(args.checkpoint_path)
File "D:\EveryDream2\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "D:\EveryDream2\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 920, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

[ERROR] Error when the training reaches the generation of samples

Attach log and cfg
project-20230326-183311.log
project-20230326-183311_cfg.txt

Runtime environment (please complete the following information):

OS: Some Linux distro
Is this your local computer or a cloud host? Vast
GPU A5000

Additional context
In the beginning of my train, I receive the following warning:

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.0.0+cu118 with CUDA 1108 (you have 1.13.1+cu117)
Python 3.10.10 (you have 3.10.6)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details

I am getting an error when the training reaches the generation of samples:

NotImplementedError Traceback (most recent call last)
File /workspace/EveryDream2trainer/train.py:1010
1008 print(f" Args:")
1009 pprint.pprint(vars(args))
-> 1010 main(args)

File /workspace/EveryDream2trainer/train.py:936, in main(args)
934 save_path = os.path.join(f"{log_folder}/ckpts/errored-{args.project_name}-ep{epoch:02}-gs{global_step:05}")
935 __save_model(save_path, unet, text_encoder, tokenizer, noise_scheduler, vae, args.save_ckpt_dir, yaml, args.save_full_precision)
--> 936 raise ex
938 logging.info(f"{Fore.LIGHTWHITE_EX} ***************************{Style.RESET_ALL}")
939 logging.info(f"{Fore.LIGHTWHITE_EX} **** Finished training ****{Style.RESET_ALL}")

File /workspace/EveryDream2trainer/train.py:882, in main(args)
879 torch.cuda.empty_cache()
881 if (global_step + 1) % sample_generator.sample_steps == 0:
--> 882 generate_samples(global_step=global_step, batch=batch)
884 min_since_last_ckpt = (time.time() - last_epoch_saved_time) / 60
886 if args.ckpt_every_n_minutes is not None and (min_since_last_ckpt > args.ckpt_every_n_minutes):

File /workspace/EveryDream2trainer/train.py:787, in main..generate_samples(global_step, batch)
784 print(f" * SampleGenerator config changed, now generating images samples every " +
785 f"{sample_generator.sample_steps} training steps (next={next_sample_step})")
786 sample_generator.update_random_captions(batch["captions"])
--> 787 inference_pipe = sample_generator.create_inference_pipe(unet=unet,
788 text_encoder=text_encoder,
789 tokenizer=tokenizer,
790 vae=vae,
791 diffusers_scheduler_config=reference_scheduler.config
792 ).to(device)
793 sample_generator.generate_samples(inference_pipe, global_step)
795 del inference_pipe

File /workspace/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.call..decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)

File /workspace/EveryDream2trainer/utils/sample_generator.py:294, in SampleGenerator.create_inference_pipe(self, unet, text_encoder, tokenizer, vae, diffusers_scheduler_config)
283 pipe = StableDiffusionPipeline(
284 vae=vae,
285 text_encoder=text_encoder,
(...)
291 feature_extractor=None, # must be None if no safety checker
292 )
293 if self.use_xformers:
--> 294 pipe.enable_xformers_memory_efficient_attention()
295 return pipe

File /workspace/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:1080, in DiffusionPipeline.enable_xformers_memory_efficient_attention(self, attention_op)
1050 def enable_xformers_memory_efficient_attention(self, attention_op: Optional[Callable] = None):
1051 r"""
1052 Enable memory efficient attention as implemented in xformers.
1053
(...)
1078 ```
1079 """
-> 1080 self.set_use_memory_efficient_attention_xformers(True, attention_op)

File /workspace/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:1105, in DiffusionPipeline.set_use_memory_efficient_attention_xformers(self, valid, attention_op)
1103 module = getattr(self, module_name)
1104 if isinstance(module, torch.nn.Module):
-> 1105 fn_recursive_set_mem_eff(module)

File /workspace/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:1096, in DiffusionPipeline.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
1094 def fn_recursive_set_mem_eff(module: torch.nn.Module):
1095 if hasattr(module, "set_use_memory_efficient_attention_xformers"):
-> 1096 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
1098 for child in module.children():
1099 fn_recursive_set_mem_eff(child)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:219, in ModelMixin.set_use_memory_efficient_attention_xformers(self, valid, attention_op)
217 for module in self.children():
218 if isinstance(module, torch.nn.Module):
--> 219 fn_recursive_set_mem_eff(module)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:212, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
210 def fn_recursive_set_mem_eff(module: torch.nn.Module):
211 if hasattr(module, "set_use_memory_efficient_attention_xformers"):
--> 212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
215 fn_recursive_set_mem_eff(child)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:219, in ModelMixin.set_use_memory_efficient_attention_xformers(self, valid, attention_op)
217 for module in self.children():
218 if isinstance(module, torch.nn.Module):
--> 219 fn_recursive_set_mem_eff(module)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:212, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
210 def fn_recursive_set_mem_eff(module: torch.nn.Module):
211 if hasattr(module, "set_use_memory_efficient_attention_xformers"):
--> 212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
215 fn_recursive_set_mem_eff(child)

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/cross_attention.py:146, in CrossAttention.set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers, attention_op)
140 _ = xformers.ops.memory_efficient_attention(
141 torch.randn((1, 2, 40), device="cuda"),
142 torch.randn((1, 2, 40), device="cuda"),
143 torch.randn((1, 2, 40), device="cuda"),
144 )
145 except Exception as e:
--> 146 raise e
148 if is_lora:
149 processor = LoRAXFormersCrossAttnProcessor(
150 hidden_size=self.processor.hidden_size,
151 cross_attention_dim=self.processor.cross_attention_dim,
152 rank=self.processor.rank,
153 attention_op=attention_op,
154 )

File /workspace/venv/lib/python3.10/site-packages/diffusers/models/cross_attention.py:140, in CrossAttention.set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers, attention_op)
137 else:
138 try:
139 # Make sure we can run the memory efficient attention
--> 140 _ = xformers.ops.memory_efficient_attention(
141 torch.randn((1, 2, 40), device="cuda"),
142 torch.randn((1, 2, 40), device="cuda"),
143 torch.randn((1, 2, 40), device="cuda"),
144 )
145 except Exception as e:
146 raise e

File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/init.py:196, in memory_efficient_attention(query, key, value, attn_bias, p, scale, op)
115 def memory_efficient_attention(
116 query: torch.Tensor,
117 key: torch.Tensor,
(...)
123 op: Optional[AttentionOp] = None,
124 ) -> torch.Tensor:
125 """Implements the memory-efficient attention mechanism following
126 "Self-Attention Does Not Need O(n^2) Memory" <[http://arxiv.org/abs/2112.05682>](http://arxiv.org/abs/2112.05682%3E%60).
127
(...)
194 :return: multi-head attention Tensor with shape [B, Mq, H, Kv]
195 """
--> 196 return _memory_efficient_attention(
197 Inputs(
198 query=query, key=key, value=value, p=p, attn_bias=attn_bias, scale=scale
199 ),
200 op=op,
201 )

File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/init.py:294, in _memory_efficient_attention(inp, op)
289 def _memory_efficient_attention(
290 inp: Inputs, op: Optional[AttentionOp] = None
291 ) -> torch.Tensor:
292 # fast-path that doesn't require computing the logsumexp for backward computation
293 if all(x.requires_grad is False for x in [inp.query, inp.key, inp.value]):
--> 294 return _memory_efficient_attention_forward(
295 inp, op=op[0] if op is not None else None
296 )
298 output_shape = inp.normalize_bmhk()
299 return _fMHA.apply(
300 op, inp.query, inp.key, inp.value, inp.attn_bias, inp.p, inp.scale
301 ).reshape(output_shape)

File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/init.py:310, in _memory_efficient_attention_forward(inp, op)
308 output_shape = inp.normalize_bmhk()
309 if op is None:
--> 310 op = _dispatch_fw(inp)
311 else:
312 _ensure_op_supports_or_raise(ValueError, "memory_efficient_attention", op, inp)

File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:98, in _dispatch_fw(inp)
96 priority_list_ops.remove(triton.FwOp)
97 priority_list_ops.insert(0, triton.FwOp)
---> 98 return _run_priority_list(
99 "memory_efficient_attention_forward", priority_list_ops, inp
100 )

File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:73, in _run_priority_list(name, priority_list, inp)
71 for op, not_supported in zip(priority_list, not_supported_reasons):
72 msg += "\n" + _format_not_supported_reasons(op, not_supported)
---> 73 raise NotImplementedError(msg)

NotImplementedError: No operator found for memory_efficient_attention_forward with inputs:
query : shape=(1, 2, 1, 40) (torch.float32)
key : shape=(1, 2, 1, 40) (torch.float32)
value : shape=(1, 2, 1, 40) (torch.float32)
attn_bias : <class 'NoneType'>
p : 0.0
flshattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
tritonflashattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
requires A100 GPU
cutlassF is not supported because:
xFormers wasn't build with CUDA support
smallkF is not supported because:
xFormers wasn't build with CUDA support
max(query.shape[-1] != value.shape[-1]) > 32
unsupported embed per head: 40

Holding `CTRL+ALT+PAGEUP` does not kick off sampling

As the title says, holding CTRL+ALT+PAGEUP does not kick off sampling, even after pressing this shortcut for over 30 seconds.

Just to make it evidently clear I'm pressing (and holding) these buttons:
https://i.imgur.com/3ka60uF.png

I am running EveryDream2trainer on Windows 10 in the command prompt.

I cant get it to run on 3060 12GB :(

Anything I should be doing differently to get it to run? Able to run SD_extension and shivam repo.

python train.py --resume_ckpt "sd_v1-5_vae" --data_root "input" --max_epochs 10 --lr_scheduler constant --project_name testo --batch_size 1 --sample_steps 20 --lr 1e-6 --resolution 512 --clip_grad_norm 1 --ckpt_every_n_minutes 30 --useadam8bit --amp --mixed_precision fp16

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 12.00 GiB total capacity; 11.26 GiB already allocated; 0 bytes free; 11.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epochs: 0%| | 0/10 [00:56<?, ?it/s, vram=4867/12288 MB gs:0]
Steps: 1%|█▎

Really looking forward to using your repo, it looks awesome. Thanks for all your hard work!

Cloud Guide for Google Vertex AI Workbench

Would love to give this a try in the Google Cloud Platform ecosystem (with logging).

Sample generated images are always identical at every training step

While I'm training the model, the sample generated images are always identical (at a pixel level) at every step.
It seems that they are generated by using the same weights instead from the updated model.

Does this work with 12GB

I am getting Cuda out of memory :(

FileNotFoundError: [Errno 2] No such file or directory: 'v2-1_768-nonema-pruned.ckpt'

After going through the install, I had some issues setting up. The directories that you have in the image aren't there yet, so I decided to input the commands as you have them typed out and it looks like there's an issue with that directory not yet existing. I'm most interested in training SD2.1

(venv) C:\Users\micha\Documents\EveryDream2trainer>python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim ^
More? --original_config_file v2-inference-v.yaml ^
More? --image_size 768 ^
More? --checkpoint_path v2-1_768-nonema-pruned.ckpt ^
More? --prediction_type v_prediction ^
More? --upcast_attn False ^
More? --dump_path "ckpt_cache/v2-1_768-nonema-pruned"
Traceback (most recent call last):
File "C:\Users\micha\Documents\EveryDream2trainer\utils\convert_original_stable_diffusion_to_diffusers.py", line 855, in
checkpoint = torch.load(args.checkpoint_path)
File "C:\Users\micha\Documents\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "C:\Users\micha\Documents\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "C:\Users\micha\Documents\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'v2-1_768-nonema-pruned.ckpt'

Resulting model can't be loaded with an inpainting pipeline due to shape mismatch

when loading a fine tuned model with diffusers

device = "cuda"
pipe = StableDiffusionInpaintPipeline.from_pretrained(
    '/finetuning/EveryDream2trainer/logs/2301_MYPROJ_fp16_adam_nonema_20230116-155427/ckpts/2301_MYPROJ_fp16_adam_nonema-ep300-gs04201' ,
    revision="fp16",
    safety_checker=None,
    requires_safety_checker=False,
    feature_extractor=None,
    torch_dtype=torch.float16,
    use_auth_token=False
).to(device)

with autocast("cuda"):
    images = pipe(prompt="my amazing prompt", num_inference_steps=100, negative_prompt="negativity", image=init_image, mask_image=mask)["images"]

I get a size mismatch

incorrect configuration settings! The config of `pipeline.unet`: FrozenDict([...]) expects 4 but received `num_channels_latents`: 4 + `num_channels_mask`: 1 + `num_channels_masked_image`: 4 = 9. Please verify the config of `pipeline.unet` or your `mask_image` or `image` input.

So far I have been able to use inpainting with my previous models and understood the inpainting models are just "better" at inpainting (for instance I can always use inpainting inside A1111, no matter if my model is from ).

Black sample images

I'm running my first training run with ED2Trainer. I auto-captioned my instance images, used the replace tool to introduce a custom class word (a photo of a jet -> a photo of a mtVdDv jet) and started the training. I have only 20 images and therefore train for 2000 epochs.

I have added 3 prompts (one each line) into sample_prompts.txt and now started training. The trainer produces sample images, but they all turn out complelety black, only cfg scale is visible.

The current training is yet only at 195 Epochs, but the images should not be black right?

EDIT: my images are 100% SFW

training command


python3.10 train.py --resume_ckpt "v2-1_768-nonema-pruned" \
--data_root "/finetuning/instances/myproject/w-captions/" \
--max_epochs 2000 \
--lr_scheduler cosine \
--lr_decay_steps 1500 \
--lr_warmup_steps 20 \
--project_name myproject \
--batch_size 2 \
--sample_steps 100 \
--lr 1.5e-6 \
--save_every_n_epochs 500 \
--useadam8bit

How are batches/epoch calculated?

Coming from Koyha's trainer, it seems that a batch size of 4 at 131 epochs does 17 batches/epoch versus with Kohya, I was getting 13 batches/epoch.

This leads to more training than I might want so I'd be interested how it works under the hood. I see you are doing a len calculation on line 655 in train.py so if you could elaborate on the parameters and how that affects batches/epoch, I would greatly appreciate it.

Also thanks for making a better trainer, this blows Kohya's out of the water with his accelerate bug.

Runpod issue

You gotta take a look at your runpod .ipynb file. It won't run with the 2.1.10 docker you specified.
There are couple of issues with it, but your docs are incomplete (firstly your hf_downloader downloads at the wrong place, then it is missing config files, etc.)
Please fix it.