victorchall / everydream2trainer Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
Hi!
Using the default validation config on google colab produces no output.
When an epoch finishes it will validate for a few steps but there is no loss/val output in the console.
Checkpoint saving worked the very first time I ran this, but then doesn't occur since.
Not doing anything complex- only 3 images in "input" and the following args:
python train.py --resume_ckpt "sd_v1-5_vae" ^
--max_epochs 500 ^
--resolution 512 ^
--data_root "input" ^
--lr_scheduler constant ^
--project_name mfrkr_sd15 ^
--batch_size 3 ^
--sample_steps 25 ^
--lr 1e-6 ^
--shuffle_tags ^
--save_every_n_epochs 25 ^
--amp ^
--useadam8bit ^
--ed1_mode
It makes the sample every 25 OK, but doesn't save a checkpoint at epoch 25. Only at the end.
Epochs: 100%|██| 500/500 [15:15<00:00, 9.02s/it, img/s=5.14, loss/log_step=0.138, lr=1e-6, vram=20075/24564 MB gs:499] * Saving diffusers model to logs\mfrkr_sd15_20230131-154030/ckpts/last-mfrkr_sd15-ep499-gs00500
* Saving SD model to .\last-mfrgx_sd15-ep499-gs00500.ckpt
And that's the only time the console prints anything about a checkpoint save.
What am I doing wrong please? :)
I am reading readme but no info available
Is this a new method like DreamBooth or else? What is it
What is the advantage compared to DreamBooth?
ty for answers
Hey there! Developer of the dreambooth extension for Auto1111 (and soon to be a stand-alone application) here.
I was recently made aware of some of enhancements you've added to the Dreambooth training process, and would cordially like to invite you to a discussion regarding a collaboration between you, myself, and some of the other prominent developers working on Dreambooth/SD training, with an end goal of coming together to create one new uniform training method with all the bells and whistles.
If you're interested, I've got a discord over here where I'll be inviting some of the other devs to come and chat. It'd be awesome to talk with you and discuss how we can work together. ;)
Hi there.
Will there be and doc/guide on how to install on linux or make a docker container for the trainer? ( ideally the second )
I tried myself but I get errors on installing the packages and I couldn't get it to run.
Thank you.
I'm running a everydream trainer with configs and it seems to be giving corrupted models.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 97, in load_state_dict
return torch.load(checkpoint_file, map_location="cpu")
File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 777, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 282, in __init__
super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 103, in load_state_dict
if f.read().startswith("version"):
File "/usr/local/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/EveryDream2trainer/scripts/txt2img.py", line 101, in <module>
main(args)
File "/content/EveryDream2trainer/scripts/txt2img.py", line 68, in main
unet = UNet2DConditionModel.from_pretrained(args.diffusers_path, subfolder="unet")
File "/usr/local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 489, in from_pretrained
state_dict = load_state_dict(model_file)
File "/usr/local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 115, in load_state_dict
raise OSError(
OSError: Unable to load weights from checkpoint file for '/content/EveryDream2trainer/logs/test-ep99-gs01400/ckpts/last-test-ep99-gs01400/unet/diffusion_pytorch_model.bin' at '/content/EveryDream2trainer/logs/test-ep99-gs01400/ckpts/last-test-ep99-gs01400/unet/diffusion_pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
Training Command:
python train.py \
--resume_ckpt runwayml/stable-diffusion-v1-5 \
--gradient_checkpointing \
--amp \
--batch_size 6 \
--grad_accum 1 \
--cond_dropout 0.00 \
--data_root EveryDream2trainer/images \
--flip_p 0.00 \
--lr 1e-06 \
--lr_decay_steps 0 \
--lr_scheduler constant \
--lr_warmup_steps 0 \
--max_epochs 100 \
--project_name test \
--resolution 512 \
--sample_prompts sample_prompts.txt \
--sample_steps 300 \
--save_every_n_epoch 0 \
--seed 555 \
--shuffle_tags \
--useadam8bit
Runpod and Google Colab training is not working for the past few days.
I am not getting any errors but after training when I test the model it seems like it's not trained at all.
I didn't get my style/ person /object even by using caption tokens.
With the same setting and same data, it used to work on Colab and run pod.
I am using default settings only.
For eg. Earlier prompt "doctor in rdanimestyle" is used to give a doctored image in the trained style.
Earlier even "doctor" prompts were used to give outputs in a trained style.
Now the same prompts are just given to a normal outputs.
I used all naming methods, like
When testing using runpod and colab test section I am pretty sure the image results are coming from base SD 1.5v model only not the trained model.
I also downloaded last-....ckpt and tested it with other colab still results are like base mode. So I think the issue is with training, not the testing code.
Would really appreciate your help on this.
Hello,
when i try to convert to diffuser format a ckpt file with the command:
python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim ^
--original_config_file v1-inference.yaml ^
--image_size 512 ^
--checkpoint_path dreamshaper_331BakedVae.ckpt ^
--prediction_type epsilon ^
--upcast_attn False ^
--dump_path "ckpt_cache/dreamshaper_331BakedVae"
(dreamshaper_331BakedVae.ckpt is available on civitai and is based on SD 1,5)
I receive the following answer:
global_step key not found in model
load_state_dict from directly
Checkpoint dreamshaper_331BakedVae.ckpt has both EMA and non-EMA weights.
In this conversion only the non-EMA weights are extracted. If you want to instead extract the EMA weights (usually better for inference), please make sure to add the --extract_ema
flag.
Traceback (most recent call last):
File "D:\download\AI\EveryDream2trainer\utils\convert_original_stable_diffusion_to_diffusers.py", line 963, in
unet.load_state_dict(converted_unet_checkpoint)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel:
Missing key(s) in state_dict: "up_blocks.0.upsamplers.0.conv.weight", "up_blocks.0.upsamplers.0.conv.bias", "up_blocks.1.upsamplers.0.conv.weight", "up_blocks.1.upsamplers.0.conv.bias", "up_blocks.2.upsamplers.0.conv.weight", "up_blocks.2.upsamplers.0.conv.bias".
Unexpected key(s) in state_dict: "up_blocks.0.attentions.2.conv.bias", "up_blocks.0.attentions.2.conv.weight".
Is there is a way to fix this ?
so I made some experiments changing the lr and I found out something strange.
The trainings seems to improve when the lr is as low as 1e-7
The loss decreases steadly and the sample images are more consistent.
Is it normal? Am I doing something wrong?
setup:
gpu rtx 4090 48vcpu, 124 gb ram
batch size: 10
lr_scheduler: cosine
lr: 1.2e-7
dataset size: 5k
captions type: tags
Also I'm about to train with an A100, 50k training set, and I'm not sure about the batch size and how can i finetune hyperparameters to take advantage of the entire gpu.
Everything seems to work fine up until the first sample generation and then it stops with this:
D:\EveryDream2trainer\venv\lib\site-packages\torch\utils\checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Something went wrong, attempting to save model
This is my first time using EveryDream so it is likely it's something silly done on my end, but I am stuck. I tried to see if anyone else had reported the issue but didn't find anything to go off of. The first msg about "none of the inputs" seems to happen at every new epoch and doesn't seem related to everything else but I included it just in case...
Sometimes you already know when you start training that the model won't be any good until many epochs have passed, so there's no point in saving the model until training has passed that point. But after those n epochs, you don't already know when overfitting will occur. So you want the "--save_every_n_epochs" param to be low. But if it's low, you end up with many unneeded saved models, and you have limited storage.
Adding a "--dont_save_until_x_epochs" param would fix that by not saving any of the first x-1 epochs, but saving x and afterwards saving every n epochs.
Hi, i am just preparing my dataset for use with Everydream, however i found some issues due to the unfortunate use of common characters like "," and "_" to delineate different sections of the training prompt.
For Tag shuffling "," is used to separate the first chunk (caption that is not shuffled) from all other chunks (tags that are shuffled). However this means that the caption can not contain a "," as this would declare everything after the "," to be tags and include it in shuffling.
A similar issue is the use of the underscore to separate the training promt from everything that is ignored at the end of the prompt. This means the tags can not contain a underscore as anything behind the first "_" would be ignored by the trainer. Unfortionately underscores are common in original booru-tags so those can not be used for training with everydream (without replacement of all underscores by spaces).
A suggestion would be to use more uncommon characters or escape-sequences to delineate the different sections of the training prompt (the "training section", the "tag section" and the "ignore section"). As an example you could use two points ".." to separate the caption section from the tags section and two underscores "__" to separate the tag section from the ignore section. Then the two issues mentioned above would not exist.
With multiply.txt, images can be repeated >1 in an epoch.
Currently shuffle_tags makes the captions shuffle every epoch.
It would be better to make shuffle_tags a bool in ImageCaption, and when the ImageCaption is read, shuffle the tags in the getter, so the tags are always shuffled, and it doesn't need to happen during the epoch shuffle.
It would be benefitial for SD2 models if the generation of the SD dumps would include the copying of the v2-inference.yaml config file. This way users could directly point inference toolkits like A1111 directly to the ED2Trainer folder. Right now I'm manually copying yaml files everytime I want to test a model coming out of ED2Trainer.
## A1111 needs the yaml file to be the same name as the ckpt file
cp last-2301_MYPROJ_fp16_adam_3rd_run_200-ep199-gs03000.yaml last-2301_MYPROJ_fp16_4th_512_single_instance_image_200-ep199-gs00200.yaml
## A1111 then can load the model directly from the ED2Trainer folder
python launch.py --ckpt /finetuning/EveryDream2trainer/last-2301_MYPROJ_fp16_4th_512_single_instance_image_200-ep199-gs00200.ckpt
During the Training using Google Colab version I'm getting the following error:
Traceback (most recent call last): File "/content/EveryDream2trainer/train.py", line 957, in <module> main(args) File "/content/EveryDream2trainer/train.py", line 883, in main raise ex File "/content/EveryDream2trainer/train.py", line 804, in main log_writer.add_scalar(tag="hyperparamater/lr", scalar_value=curr_lr, global_step=global_step) File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 391, in add_scalar self._get_file_writer().add_summary(summary, global_step, walltime) File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 113, in add_summary self.add_event(event, global_step, walltime) File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 98, in add_event self.event_writer.add_event(event) File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 117, in add_event self._async_writer.write(event.SerializeToString()) File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 171, in write self._check_worker_status() File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 212, in _check_worker_status raise exception File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244, in run self._run() File "/usr/local/lib/python3.10/ site-packages/tensorboard/summary/writer/event_file_writer.py", line 275, in _run self._record_writer.write(data) File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write self._writer.write(header + header_crc + data + footer_crc) File "/usr/local/lib/python3.10/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 773, in write self.fs.append(self.filename, file_content, self.binary_mode) File "/usr/local/lib/python3.10/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 167, in append self._write(filename, file_content, "ab" if binary_mode else "a") File "/usr/local/lib/python3.10/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 171, in _write with io.open(filename, mode, encoding=encoding) as f: OSError: [Errno 5] Input/output error
I tought change the name of the images to process will solve it but didnt works
HI there 👋
A lot of models seem to require the following VAE: vae-ft-mse-840000-ema-pruned.ckpt.
Is it possible/required to specify a separate VAE when training using EveryDream in such cases?
looks like it was auto or notepad++ breaking some images during replace token process
I did redo all of the images and it is still giving this issue. I have cleared directories and reinstalled as well.
I am pre-processing in automatic then using BRU and N++ to change values to match folder and tags for subjects respectively.
It DID work with this.
What automation processes regarding theses steps have you all found to be most efficient and effective?
It DID work with four folders of subjects the first go so I know it can. I have the model and it's children that work. I just want it to hit harder.
Great work. I'm so happy to see 8bit on windows.
01/17/2023 05:37:49 AM ED1 mode: Overiding disable_xformers to True
01/17/2023 05:37:49 AM Seed: 555
01/17/2023 05:37:49 AM Logging to logs\01DSU90 - GMM&co_20230117-053749
01/17/2023 05:37:49 AM unet attention_head_dim: [8, 8, 8, 8]
01/17/2023 05:37:51 AM xformers not available or disabled
01/17/2023 05:37:51 AM * Using FP32 *
01/17/2023 05:37:53 AM �[36m * Training Text Encoder *�[0m
01/17/2023 05:37:53 AM �[36m * Using AdamW 8-bit Optimizer *�[0m
01/17/2023 05:37:53 AM �[36m * Optimizer: AdamW8bit *�[0m
01/17/2023 05:37:53 AM betas: (0.9, 0.999), epsilon: 1e-08 *�[0m
01/17/2023 05:37:53 AM * Creating new dataloader singleton
01/17/2023 05:37:53 AM * DLMA resolution 512, buckets: [[512, 512], [576, 448], [448, 576], [640, 384], [384, 640], [768, 320], [320, 768], [896, 256], [256, 896], [1024, 256], [256, 1024]]
01/17/2023 05:37:53 AM Preloading images...
01/17/2023 05:37:53 AM ** Trainer Set: 4, num_images: 38, batch_size: 10
01/17/2023 05:37:53 AM Pretraining GPU Memory: 11340 / 49140 MB
01/17/2023 05:37:53 AM saving ckpts every 1000000000.0 minutes
01/17/2023 05:37:53 AM saving ckpts every 100 epochs
01/17/2023 05:37:53 AM unet device: cuda:0, precision: torch.float32, training: True
01/17/2023 05:37:53 AM text_encoder device: cuda:0, precision: torch.float32, training: True
01/17/2023 05:37:53 AM vae device: cuda:0, precision: torch.float32, training: False
01/17/2023 05:37:53 AM scheduler: <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>
01/17/2023 05:37:53 AM �[32mProject name: �[0m�[92m01DSU90 - GMM&co�[0m
01/17/2023 05:37:53 AM �[32mgrad_accum: �[0m�[92m1�[0m
01/17/2023 05:37:53 AM �[32mbatch_size: �[0m�[92m10�[0m
01/17/2023 05:37:53 AM �[32mepoch_len: �[92m4�[0m
01/17/2023 05:38:22 AM Fatal Error loading image: C:\Users\TemporalLabSol\Desktop\EveryDream2trainer-main\EveryDream2trainer-main\SUBJECT\01DSU90\01DSU90 3-1-318632221_8171851539551764_21628442878249915_n.png:
01/17/2023 05:38:22 AM unrecognized data stream contents when reading image file
EL JSON
{
"amp": false,
"batch_size": 10,
"ckpt_every_n_minutes": null,
"clip_grad_norm": null,
"clip_skip": 0,
"cond_dropout": 0.04,
"data_root": "C:\Users\TemporalLabSol\Desktop\EveryDream2trainer-main\EveryDream2trainer-main\SUBJECT",
"disable_textenc_training": false,
"disable_xformers": false,
"ed1_mode": true,
"flip_p": 0.0,
"gpuid": 0,
"gradient_checkpointing": true,
"grad_accum": 1,
"logdir": "logs",
"log_step": 25,
"lowvram": false,
"lr": 3.5e-06,
"lr_decay_steps": 0,
"lr_scheduler": "constant",
"lr_warmup_steps": null,
"max_epochs": 90,
"notebook": false,
"project_name": "01DSU90 - GMM&co",
"resolution": 512,
"resume_ckpt": "sd_v1-5_vae",
"sample_prompts": "sample_prompts.txt",
"sample_steps": 300,
"save_ckpt_dir": null,
"save_every_n_epochs": 100,
"save_optimizer": false,
"scale_lr": false,
"seed": 555,
"shuffle_tags": false,
"useadam8bit": true,
"wandb": false,
"write_schedule": false,
"rated_dataset": false,
"rated_dataset_target_dropout_rate": 50,
"save_full_precision": false,
"disable_unet_training": false,
"mixed_precision": "fp32",
"rated_dataset_target_dropout_percent": 50,
"config": "train.json"
}
`
Hello 👋 I'm looking for a way to manually patch ED2 to support tokens > 75. Is this possible, and is it stable?
Thank you :)
Installed python 3.10, ran windows_setup.cmd, downloaded requirements, ended with this:
` *** Applying bitsandbytes patch for windows ***
*** Already patched!
*** bitsandbytes windows patch applied, attempting import ***
Somethnig went wrong trying to patch bitsandbytes, aborting
make sure your venv is activated and try again
or if activated try:
pip install bitsandbytes==0.35.0
Traceback (most recent call last):
File "C:\repo\EveryDream2trainer\utils\patch_bnb.py", line 122, in main
import bitsandbytes
File "C:\repo\EveryDream2trainer\venv\lib\site-packages\bitsandbytes_init_.py", line 6, in
from .autograd.functions import (
File "C:\repo\EveryDream2trainer\venv\lib\site-packages\bitsandbytes\autograd_functions.py", line 4, in
import torch
File "C:\repo\EveryDream2trainer\venv\lib\site-packages\torch_init.py", line 219, in
raise ImportError(textwrap.dedent('''
ImportError: Failed to load PyTorch C extensions:
It appears that PyTorch has loaded the torch/_C
folder
of the PyTorch repository rather than the C extensions which
are expected in the torch._C
namespace. This can occur when
using the install
workflow. e.g.
$ python setup.py install && python -c "import torch"`
This error can generally be solved using the `develop` workflow
$ python setup.py develop && python -c "import torch" # This should succeed
or by running Python from a different directory.
`During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\repo\EveryDream2trainer\utils\patch_bnb.py", line 133, in
main()
File "C:\repo\EveryDream2trainer\utils\patch_bnb.py", line 125, in main
error()
File "C:\repo\EveryDream2trainer\utils\patch_bnb.py", line 94, in error
raise RuntimeError("** FATAL ERROR: unable to patch bitsandbytes for Windows env")
RuntimeError: ** FATAL ERROR: unable to patch bitsandbytes for Windows env
downloaded: v2-inference-v.yaml
downloaded: v2-inference.yaml
downloaded: v1-inference.yaml
SD1.x and SD2.x yamls downloaded
(venv) C:\repo\EveryDream2trainer>python --version
Python 3.10.10`
--shuffle_tags consistently fails to work on runpod (i.e. training doesn't start).
The exact output of the error is:
No model to save, something likely blew up on startup, not saving
Traceback (most recent call last):
File "/workspace/EveryDream2trainer/train.py", line 999, in <module>
main(args)
File "/workspace/EveryDream2trainer/train.py", line 924, in main
raise ex
File "/workspace/EveryDream2trainer/train.py", line 809, in main
for step, batch in enumerate(train_dataloader):
File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/workspace/venv/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/workspace/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/workspace/EveryDream2trainer/data/every_dream.py", line 101, in __getitem__
example["caption"] = train_item["caption"].get_shuffled_caption(self.seed)
File "/workspace/EveryDream2trainer/data/image_train_item.py", line 70, in get_shuffled_caption
max_target_tag_length = self.__max_target_length - len(self.__main_prompt)
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'
Epochs: 0%| | 0/150 [00:00<?, ?it/s, vram=5133/24576 MB gs:0]
most models from civitai are .safetensors, and I'd need to train one. Is it possibile wth everyDream? How?
I followed your guide and wanted to start the conversion of the 1_5 model but that is not working, I converted the sd_2-1 768 no problem whatsoever, but the 15 is stuck.
Left it running over night has not done anything.
Conversion command as follows:
python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim ^
--original_config_file v1-inference.yaml ^
--image_size 512 ^
--checkpoint_path sd_v1-5_vae.ckpt ^
--prediction_type epsilon ^
--upcast_attn False ^
--dump_path "ckpt_cache/sd_v1-5_vae"
Something like
$ docker run -it -p 8888:8888 -p 6006:6006 --gpus all -e JUPYTER_PASSWORD=test1234 -t ghcr.io/victorchall/everydream2trainer:nightly
That should let someone run the docker container locally since it requires GPU access for CUDA and the env variable JUPYTER_PASSWORD set to run Jupyter and Tensor Board exposing both ports.
I think there is a small error in the colab-notebook. If you try to convert the 2.1 model, by default it's still using the 'epsilon' prediction type, which will result in pretty deep-fried results (see below).
😄 Thank you for the great repository! I'm having a lot of fun with it! 😄
!python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim \
--original_config_file {inference_yaml} \
--image_size {img_size} \
--checkpoint_path {base_path} \
--prediction_type epsilon \ <-- This should change.
--upcast_attn False \
--dump_path {save_name}
I am attempting to train a model using 100k plus images.
I appear to be trying to use more VRAM then I have after about 4 hours, which does not make a lot of sense to me.
Stack Trace attached.
Any insight would be much appreciated!
I get this error if I use SD 1.5 with amp
RuntimeError: CUDA error: invalid argument
Is amp compatible with 1.5 ?
Training setup
{
"amp": true,
"batch_size": 10,
"ckpt_every_n_minutes": null,
"clip_grad_norm": null,
"clip_skip": 0,
"cond_dropout": 0.04,
"data_root": "captioned",
"disable_textenc_training": false,
"disable_xformers": false,
"flip_p": 0.0,
"gpuid": 0,
"gradient_checkpointing": true,
"grad_accum": 1,
"logdir": "logs",
"log_step": 25,
"lowvram": false,
"lr": 1.5e-06,
"lr_decay_steps": 0,
"lr_scheduler": "constant",
"lr_warmup_steps": null,
"max_epochs": 60,
"notebook": false,
"project_name": "prjctname",
"resolution": 576,
"resume_ckpt": "sd_v1-5_vae",
"sample_prompts": "sample_prompts.txt",
"sample_steps": 300,
"save_ckpt_dir": null,
"save_ckpts_from_n_epochs": 0,
"save_every_n_epochs": 10,
"save_optimizer": false,
"scale_lr": false,
"seed": 555,
"shuffle_tags": false,
"useadam8bit": true,
"validation_config": "validation_default.json",
"wandb": true,
"write_schedule": false,
"rated_dataset": false,
"rated_dataset_target_dropout_percent": 50,
"zero_frequency_noise_ratio": 0.0
}
full traceback
Traceback (most recent call last):
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/train.py", line 1044, in
main(args)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/train.py", line 971, in main
raise ex
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/train.py", line 860, in main
scaler.scale(loss).backward()
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 146, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/autograd/function.py", line 399, in wrapper
outputs = fn(ctx, *args)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/xformers/ops/fmha/init.py", line 107, in backward
grads = _memory_efficient_attention_backward(
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/xformers/ops/fmha/init.py", line 380, in _memory_efficient_attention_backward
grads = op.apply(ctx, inp, grad)
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/xformers/ops/fmha/cutlass.py", line 283, in apply
(grad_q, grad_k, grad_v, grad_bias) = cls.OPERATOR(
File "/media/chris/Elements/ML_LAB/Everydream2.0/EveryDream2trainer/env/lib/python3.10/site-packages/torch/_ops.py", line 143, in call
return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
This trainer experiences errors when training most animate-style models, for example, derived from NAI Anime models.
EveryDream Version 1 is the same.
It cannot be used normally without modification of the code.
RuntimeError: Error(s) in loading state_dict for CLIPTextModel:
Missing key(s) in state_dict: "text_model.embeddings.position_ids", "text_model.embeddings.token_embedding.weight", "text_model.embeddings.position_embedding.weight", "text_model.encoder.layers.0.self_attn.k_proj.weight", "text_model.encoder.layers.0.self_attn.k_proj.bias", "text_model.encoder.layers.0.self_attn.v_proj.weight", "text_model.encoder.layers.0.self_attn.v_proj.bias", "text_model.encoder.layers.0.self_attn.q_proj.weight", "text_model.encoder.layers.0.self_attn.q_proj.bias", "text_model.encoder.layers.0.self_attn.out_proj.weight", "text_model.encoder.layers.0.self_attn.out_proj.bias", "text_model.encoder.layers.0.layer_norm1.weight", "text_model.encoder.layers.0.layer_norm1.bias", "text_model.encoder.layers.0.mlp.fc1.weight", "text_model.encoder.layers.0.mlp.fc1.bias", "text_model.encoder.layers.0.mlp.fc2.weight", "text_model.encoder.layers.0.mlp.fc2.bias", "text_model.encoder.layers.0.layer_norm2.weight", "text_model.encoder.layers.0.layer_norm2.bias", "text_model.encoder.layers.1.self_attn.k_proj.weight", "text_model.encoder.layers.1.self_attn.k_proj.bias", "text_model.encoder.layers.1.self_attn.v_proj.weight", "text_model.encoder.layers.1.self_attn.v_proj.bias", "text_model.encoder.layers.1.self_attn.q_proj.weight", "text_model.encoder.layers.1.self_attn.q_proj.bias", "text_model.encoder.layers.1.self_attn.out_proj.weight", "text_model.encoder.layers.1.self_attn.out_proj.bias", "text_model.encoder.layers.1.layer_norm1.weight", "text_model.encoder.layers.1.layer_norm1.bias", "text_model.encoder.layers.1.mlp.fc1.weight", "text_model.encoder.layers.1.mlp.fc1.bias", "text_model.encoder.layers.1.mlp.fc2.weight", "text_model.encoder.layers.1.mlp.fc2.bias", "text_model.encoder.layers.1.layer_norm2.weight", "text_model.encoder.layers.1.layer_norm2.bias", "text_model.encoder.layers.2.self_attn.k_proj.weight", "text_model.encoder.layers.2.self_attn.k_proj.bias", "text_model.encoder.layers.2.self_attn.v_proj.weight", "text_model.encoder.layers.2.self_attn.v_proj.bias", "text_model.encoder.layers.2.self_attn.q_proj.weight", "text_model.encoder.layers.2.self_attn.q_proj.bias", "text_model.encoder.layers.2.self_attn.out_proj.weight", "text_model.encoder.layers.2.self_attn.out_proj.bias", "text_model.encoder.layers.2.layer_norm1.weight", "text_model.encoder.layers.2.layer_norm1.bias", "text_model.encoder.layers.2.mlp.fc1.weight", "text_model.encoder.layers.2.mlp.fc1.bias", "text_model.encoder.layers.2.mlp.fc2.weight", "text_model.encoder.layers.2.mlp.fc2.bias", "text_model.encoder.layers.2.layer_norm2.weight", "text_model.encoder.layers.2.layer_norm2.bias", "text_model.encoder.layers.3.self_attn.k_proj.weight", "text_model.encoder.layers.3.self_attn.k_proj.bias", "text_model.encoder.layers.3.self_attn.v_proj.weight", "text_model.encoder.layers.3.self_attn.v_proj.bias", "text_model.encoder.layers.3.self_attn.q_proj.weight", "text_model.encoder.layers.3.self_attn.q_proj.bias", "text_model.encoder.layers.3.self_attn.out_proj.weight", "text_model.encoder.layers.3.self_attn.out_proj.bias", "text_model.encoder.layers.3.layer_norm1.weight", "text_model.encoder.layers.3.layer_norm1.bias", "text_model.encoder.layers.3.mlp.fc1.weight", "text_model.encoder.layers.3.mlp.fc1.bias", "text_model.encoder.layers.3.mlp.fc2.weight", "text_model.encoder.layers.3.mlp.fc2.bias", "text_model.encoder.layers.3.layer_norm2.weight", "text_model.encoder.layers.3.layer_norm2.bias", "text_model.encoder.layers.4.self_attn.k_proj.weight", "text_model.encoder.layers.4.self_attn.k_proj.bias", "text_model.encoder.layers.4.self_attn.v_proj.weight", "text_model.encoder.layers.4.self_attn.v_proj.bias", "text_model.encoder.layers.4.self_attn.q_proj.weight", "text_model.encoder.layers.4.self_attn.q_proj.bias", "text_model.encoder.layers.4.self_attn.out_proj.weight", "text_model.encoder.layers.4.self_attn.out_proj.bias", "text_model.encoder.layers.4.layer_norm1.weight", "text_model.encoder.layers.4.layer_norm1.bias", "text_model.encoder.layers.4.mlp.fc1.weight", "text_model.encoder.layers.4.mlp.fc1.bias", "text_model.encoder.layers.4.mlp.fc2.weight", "text_model.encoder.layers.4.mlp.fc2.bias", "text_model.encoder.layers.4.layer_norm2.weight", "text_model.encoder.layers.4.layer_norm2.bias", "text_model.encoder.layers.5.self_attn.k_proj.weight", "text_model.encoder.layers.5.self_attn.k_proj.bias", "text_model.encoder.layers.5.self_attn.v_proj.weight", "text_model.encoder.layers.5.self_attn.v_proj.bias", "text_model.encoder.layers.5.self_attn.q_proj.weight", "text_model.encoder.layers.5.self_attn.q_proj.bias", "text_model.encoder.layers.5.self_attn.out_proj.weight", "text_model.encoder.layers.5.self_attn.out_proj.bias", "text_model.encoder.layers.5.layer_norm1.weight", "text_model.encoder.layers.5.layer_norm1.bias", "text_model.encoder.layers.5.mlp.fc1.weight", "text_model.encoder.layers.5.mlp.fc1.bias", "text_model.encoder.layers.5.mlp.fc2.weight", "text_model.encoder.layers.5.mlp.fc2.bias", "text_model.encoder.layers.5.layer_norm2.weight", "text_model.encoder.layers.5.layer_norm2.bias", "text_model.encoder.layers.6.self_attn.k_proj.weight", "text_model.encoder.layers.6.self_attn.k_proj.bias", "text_model.encoder.layers.6.self_attn.v_proj.weight", "text_model.encoder.layers.6.self_attn.v_proj.bias", "text_model.encoder.layers.6.self_attn.q_proj.weight", "text_model.encoder.layers.6.self_attn.q_proj.bias", "text_model.encoder.layers.6.self_attn.out_proj.weight", "text_model.encoder.layers.6.self_attn.out_proj.bias", "text_model.encoder.layers.6.layer_norm1.weight", "text_model.encoder.layers.6.layer_norm1.bias", "text_model.encoder.layers.6.mlp.fc1.weight", "text_model.encoder.layers.6.mlp.fc1.bias", "text_model.encoder.layers.6.mlp.fc2.weight", "text_model.encoder.layers.6.mlp.fc2.bias", "text_model.encoder.layers.6.layer_norm2.weight", "text_model.encoder.layers.6.layer_norm2.bias", "text_model.encoder.layers.7.self_attn.k_proj.weight", "text_model.encoder.layers.7.self_attn.k_proj.bias", "text_model.encoder.layers.7.self_attn.v_proj.weight", "text_model.encoder.layers.7.self_attn.v_proj.bias", "text_model.encoder.layers.7.self_attn.q_proj.weight", "text_model.encoder.layers.7.self_attn.q_proj.bias", "text_model.encoder.layers.7.self_attn.out_proj.weight", "text_model.encoder.layers.7.self_attn.out_proj.bias", "text_model.encoder.layers.7.layer_norm1.weight", "text_model.encoder.layers.7.layer_norm1.bias", "text_model.encoder.layers.7.mlp.fc1.weight", "text_model.encoder.layers.7.mlp.fc1.bias", "text_model.encoder.layers.7.mlp.fc2.weight", "text_model.encoder.layers.7.mlp.fc2.bias", "text_model.encoder.layers.7.layer_norm2.weight", "text_model.encoder.layers.7.layer_norm2.bias", "text_model.encoder.layers.8.self_attn.k_proj.weight", "text_model.encoder.layers.8.self_attn.k_proj.bias", "text_model.encoder.layers.8.self_attn.v_proj.weight", "text_model.encoder.layers.8.self_attn.v_proj.bias", "text_model.encoder.layers.8.self_attn.q_proj.weight", "text_model.encoder.layers.8.self_attn.q_proj.bias", "text_model.encoder.layers.8.self_attn.out_proj.weight", "text_model.encoder.layers.8.self_attn.out_proj.bias", "text_model.encoder.layers.8.layer_norm1.weight", "text_model.encoder.layers.8.layer_norm1.bias", "text_model.encoder.layers.8.mlp.fc1.weight", "text_model.encoder.layers.8.mlp.fc1.bias", "text_model.encoder.layers.8.mlp.fc2.weight", "text_model.encoder.layers.8.mlp.fc2.bias", "text_model.encoder.layers.8.layer_norm2.weight", "text_model.encoder.layers.8.layer_norm2.bias", "text_model.encoder.layers.9.self_attn.k_proj.weight", "text_model.encoder.layers.9.self_attn.k_proj.bias", "text_model.encoder.layers.9.self_attn.v_proj.weight", "text_model.encoder.layers.9.self_attn.v_proj.bias", "text_model.encoder.layers.9.self_attn.q_proj.weight", "text_model.encoder.layers.9.self_attn.q_proj.bias", "text_model.encoder.layers.9.self_attn.out_proj.weight", "text_model.encoder.layers.9.self_attn.out_proj.bias", "text_model.encoder.layers.9.layer_norm1.weight", "text_model.encoder.layers.9.layer_norm1.bias", "text_model.encoder.layers.9.mlp.fc1.weight", "text_model.encoder.layers.9.mlp.fc1.bias", "text_model.encoder.layers.9.mlp.fc2.weight", "text_model.encoder.layers.9.mlp.fc2.bias", "text_model.encoder.layers.9.layer_norm2.weight", "text_model.encoder.layers.9.layer_norm2.bias", "text_model.encoder.layers.10.self_attn.k_proj.weight", "text_model.encoder.layers.10.self_attn.k_proj.bias", "text_model.encoder.layers.10.self_attn.v_proj.weight", "text_model.encoder.layers.10.self_attn.v_proj.bias", "text_model.encoder.layers.10.self_attn.q_proj.weight", "text_model.encoder.layers.10.self_attn.q_proj.bias", "text_model.encoder.layers.10.self_attn.out_proj.weight", "text_model.encoder.layers.10.self_attn.out_proj.bias", "text_model.encoder.layers.10.layer_norm1.weight", "text_model.encoder.layers.10.layer_norm1.bias", "text_model.encoder.layers.10.mlp.fc1.weight", "text_model.encoder.layers.10.mlp.fc1.bias", "text_model.encoder.layers.10.mlp.fc2.weight", "text_model.encoder.layers.10.mlp.fc2.bias", "text_model.encoder.layers.10.layer_norm2.weight", "text_model.encoder.layers.10.layer_norm2.bias", "text_model.encoder.layers.11.self_attn.k_proj.weight", "text_model.encoder.layers.11.self_attn.k_proj.bias", "text_model.encoder.layers.11.self_attn.v_proj.weight", "text_model.encoder.layers.11.self_attn.v_proj.bias", "text_model.encoder.layers.11.self_attn.q_proj.weight", "text_model.encoder.layers.11.self_attn.q_proj.bias", "text_model.encoder.layers.11.self_attn.out_proj.weight", "text_model.encoder.layers.11.self_attn.out_proj.bias", "text_model.encoder.layers.11.layer_norm1.weight", "text_model.encoder.layers.11.layer_norm1.bias", "text_model.encoder.layers.11.mlp.fc1.weight", "text_model.encoder.layers.11.mlp.fc1.bias", "text_model.encoder.layers.11.mlp.fc2.weight", "text_model.encoder.layers.11.mlp.fc2.bias", "text_model.encoder.layers.11.layer_norm2.weight", "text_model.encoder.layers.11.layer_norm2.bias", "text_model.final_layer_norm.weight", "text_model.final_layer_norm.bias".
Unexpected key(s) in state_dict: "embeddings.position_ids", "embeddings.token_embedding.weight", "embeddings.position_embedding.weight", "encoder.layers.0.self_attn.k_proj.weight", "encoder.layers.0.self_attn.k_proj.bias", "encoder.layers.0.self_attn.v_proj.weight", "encoder.layers.0.self_attn.v_proj.bias", "encoder.layers.0.self_attn.q_proj.weight", "encoder.layers.0.self_attn.q_proj.bias", "encoder.layers.0.self_attn.out_proj.weight", "encoder.layers.0.self_attn.out_proj.bias", "encoder.layers.0.layer_norm1.weight", "encoder.layers.0.layer_norm1.bias", "encoder.layers.0.mlp.fc1.weight", "encoder.layers.0.mlp.fc1.bias", "encoder.layers.0.mlp.fc2.weight", "encoder.layers.0.mlp.fc2.bias", "encoder.layers.0.layer_norm2.weight", "encoder.layers.0.layer_norm2.bias", "encoder.layers.1.self_attn.k_proj.weight", "encoder.layers.1.self_attn.k_proj.bias", "encoder.layers.1.self_attn.v_proj.weight", "encoder.layers.1.self_attn.v_proj.bias", "encoder.layers.1.self_attn.q_proj.weight", "encoder.layers.1.self_attn.q_proj.bias", "encoder.layers.1.self_attn.out_proj.weight", "encoder.layers.1.self_attn.out_proj.bias", "encoder.layers.1.layer_norm1.weight", "encoder.layers.1.layer_norm1.bias", "encoder.layers.1.mlp.fc1.weight", "encoder.layers.1.mlp.fc1.bias", "encoder.layers.1.mlp.fc2.weight", "encoder.layers.1.mlp.fc2.bias", "encoder.layers.1.layer_norm2.weight", "encoder.layers.1.layer_norm2.bias", "encoder.layers.2.self_attn.k_proj.weight", "encoder.layers.2.self_attn.k_proj.bias", "encoder.layers.2.self_attn.v_proj.weight", "encoder.layers.2.self_attn.v_proj.bias", "encoder.layers.2.self_attn.q_proj.weight", "encoder.layers.2.self_attn.q_proj.bias", "encoder.layers.2.self_attn.out_proj.weight", "encoder.layers.2.self_attn.out_proj.bias", "encoder.layers.2.layer_norm1.weight", "encoder.layers.2.layer_norm1.bias", "encoder.layers.2.mlp.fc1.weight", "encoder.layers.2.mlp.fc1.bias", "encoder.layers.2.mlp.fc2.weight", "encoder.layers.2.mlp.fc2.bias", "encoder.layers.2.layer_norm2.weight", "encoder.layers.2.layer_norm2.bias", "encoder.layers.3.self_attn.k_proj.weight", "encoder.layers.3.self_attn.k_proj.bias", "encoder.layers.3.self_attn.v_proj.weight", "encoder.layers.3.self_attn.v_proj.bias", "encoder.layers.3.self_attn.q_proj.weight", "encoder.layers.3.self_attn.q_proj.bias", "encoder.layers.3.self_attn.out_proj.weight", "encoder.layers.3.self_attn.out_proj.bias", "encoder.layers.3.layer_norm1.weight", "encoder.layers.3.layer_norm1.bias", "encoder.layers.3.mlp.fc1.weight", "encoder.layers.3.mlp.fc1.bias", "encoder.layers.3.mlp.fc2.weight", "encoder.layers.3.mlp.fc2.bias", "encoder.layers.3.layer_norm2.weight", "encoder.layers.3.layer_norm2.bias", "encoder.layers.4.self_attn.k_proj.weight", "encoder.layers.4.self_attn.k_proj.bias", "encoder.layers.4.self_attn.v_proj.weight", "encoder.layers.4.self_attn.v_proj.bias", "encoder.layers.4.self_attn.q_proj.weight", "encoder.layers.4.self_attn.q_proj.bias", "encoder.layers.4.self_attn.out_proj.weight", "encoder.layers.4.self_attn.out_proj.bias", "encoder.layers.4.layer_norm1.weight", "encoder.layers.4.layer_norm1.bias", "encoder.layers.4.mlp.fc1.weight", "encoder.layers.4.mlp.fc1.bias", "encoder.layers.4.mlp.fc2.weight", "encoder.layers.4.mlp.fc2.bias", "encoder.layers.4.layer_norm2.weight", "encoder.layers.4.layer_norm2.bias", "encoder.layers.5.self_attn.k_proj.weight", "encoder.layers.5.self_attn.k_proj.bias", "encoder.layers.5.self_attn.v_proj.weight", "encoder.layers.5.self_attn.v_proj.bias", "encoder.layers.5.self_attn.q_proj.weight", "encoder.layers.5.self_attn.q_proj.bias", "encoder.layers.5.self_attn.out_proj.weight", "encoder.layers.5.self_attn.out_proj.bias", "encoder.layers.5.layer_norm1.weight", "encoder.layers.5.layer_norm1.bias", "encoder.layers.5.mlp.fc1.weight", "encoder.layers.5.mlp.fc1.bias", "encoder.layers.5.mlp.fc2.weight", "encoder.layers.5.mlp.fc2.bias", "encoder.layers.5.layer_norm2.weight", "encoder.layers.5.layer_norm2.bias", "encoder.layers.6.self_attn.k_proj.weight", "encoder.layers.6.self_attn.k_proj.bias", "encoder.layers.6.self_attn.v_proj.weight", "encoder.layers.6.self_attn.v_proj.bias", "encoder.layers.6.self_attn.q_proj.weight", "encoder.layers.6.self_attn.q_proj.bias", "encoder.layers.6.self_attn.out_proj.weight", "encoder.layers.6.self_attn.out_proj.bias", "encoder.layers.6.layer_norm1.weight", "encoder.layers.6.layer_norm1.bias", "encoder.layers.6.mlp.fc1.weight", "encoder.layers.6.mlp.fc1.bias", "encoder.layers.6.mlp.fc2.weight", "encoder.layers.6.mlp.fc2.bias", "encoder.layers.6.layer_norm2.weight", "encoder.layers.6.layer_norm2.bias", "encoder.layers.7.self_attn.k_proj.weight", "encoder.layers.7.self_attn.k_proj.bias", "encoder.layers.7.self_attn.v_proj.weight", "encoder.layers.7.self_attn.v_proj.bias", "encoder.layers.7.self_attn.q_proj.weight", "encoder.layers.7.self_attn.q_proj.bias", "encoder.layers.7.self_attn.out_proj.weight", "encoder.layers.7.self_attn.out_proj.bias", "encoder.layers.7.layer_norm1.weight", "encoder.layers.7.layer_norm1.bias", "encoder.layers.7.mlp.fc1.weight", "encoder.layers.7.mlp.fc1.bias", "encoder.layers.7.mlp.fc2.weight", "encoder.layers.7.mlp.fc2.bias", "encoder.layers.7.layer_norm2.weight", "encoder.layers.7.layer_norm2.bias", "encoder.layers.8.self_attn.k_proj.weight", "encoder.layers.8.self_attn.k_proj.bias", "encoder.layers.8.self_attn.v_proj.weight", "encoder.layers.8.self_attn.v_proj.bias", "encoder.layers.8.self_attn.q_proj.weight", "encoder.layers.8.self_attn.q_proj.bias", "encoder.layers.8.self_attn.out_proj.weight", "encoder.layers.8.self_attn.out_proj.bias", "encoder.layers.8.layer_norm1.weight", "encoder.layers.8.layer_norm1.bias", "encoder.layers.8.mlp.fc1.weight", "encoder.layers.8.mlp.fc1.bias", "encoder.layers.8.mlp.fc2.weight", "encoder.layers.8.mlp.fc2.bias", "encoder.layers.8.layer_norm2.weight", "encoder.layers.8.layer_norm2.bias", "encoder.layers.9.self_attn.k_proj.weight", "encoder.layers.9.self_attn.k_proj.bias", "encoder.layers.9.self_attn.v_proj.weight", "encoder.layers.9.self_attn.v_proj.bias", "encoder.layers.9.self_attn.q_proj.weight", "encoder.layers.9.self_attn.q_proj.bias", "encoder.layers.9.self_attn.out_proj.weight", "encoder.layers.9.self_attn.out_proj.bias", "encoder.layers.9.layer_norm1.weight", "encoder.layers.9.layer_norm1.bias", "encoder.layers.9.mlp.fc1.weight", "encoder.layers.9.mlp.fc1.bias", "encoder.layers.9.mlp.fc2.weight", "encoder.layers.9.mlp.fc2.bias", "encoder.layers.9.layer_norm2.weight", "encoder.layers.9.layer_norm2.bias", "encoder.layers.10.self_attn.k_proj.weight", "encoder.layers.10.self_attn.k_proj.bias", "encoder.layers.10.self_attn.v_proj.weight", "encoder.layers.10.self_attn.v_proj.bias", "encoder.layers.10.self_attn.q_proj.weight", "encoder.layers.10.self_attn.q_proj.bias", "encoder.layers.10.self_attn.out_proj.weight", "encoder.layers.10.self_attn.out_proj.bias", "encoder.layers.10.layer_norm1.weight", "encoder.layers.10.layer_norm1.bias", "encoder.layers.10.mlp.fc1.weight", "encoder.layers.10.mlp.fc1.bias", "encoder.layers.10.mlp.fc2.weight", "encoder.layers.10.mlp.fc2.bias", "encoder.layers.10.layer_norm2.weight", "encoder.layers.10.layer_norm2.bias", "encoder.layers.11.self_attn.k_proj.weight", "encoder.layers.11.self_attn.k_proj.bias", "encoder.layers.11.self_attn.v_proj.weight", "encoder.layers.11.self_attn.v_proj.bias", "encoder.layers.11.self_attn.q_proj.weight", "encoder.layers.11.self_attn.q_proj.bias", "encoder.layers.11.self_attn.out_proj.weight", "encoder.layers.11.self_attn.out_proj.bias", "encoder.layers.11.layer_norm1.weight", "encoder.layers.11.layer_norm1.bias", "encoder.layers.11.mlp.fc1.weight", "encoder.layers.11.mlp.fc1.bias", "encoder.layers.11.mlp.fc2.weight", "encoder.layers.11.mlp.fc2.bias", "encoder.layers.11.layer_norm2.weight", "encoder.layers.11.layer_norm2.bias", "final_layer_norm.weight", "final_layer_norm.bias".
I may be wrong, but I look through docs but there's no details about the theory of EveryDream.
DreamBooth uses one prompt to describe the whole training dataset.
EveryDream uses a fully-labelled dataset creating by BLIP.
Is the above thought correct?
My graphic card has only 6GB of memory and when i try to run a training session, i receive the following answer:
omething went wrong, attempting to save model | 0/12 [00:00<?, ?it/s]
No model to save, something likely blew up on startup, not saving
Traceback (most recent call last):
File "D:\download\AI\EveryDream2trainer\train.py", line 1050, in
main(args)
File "D:\download\AI\EveryDream2trainer\train.py", line 977, in main
raise ex
File "D:\download\AI\EveryDream2trainer\train.py", line 854, in main
model_pred, target = get_model_prediction_and_target(batch["image"], batch["tokens"], args.zero_frequency_noise_ratio)
File "D:\download\AI\EveryDream2trainer\train.py", line 779, in get_model_prediction_and_target
latents = vae.encode(pixel_values, return_dict=False)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\diffusers\models\vae.py", line 566, in encode
h = self.encoder(x)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\diffusers\models\vae.py", line 134, in forward
sample = down_block(sample)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\diffusers\models\unet_2d_blocks.py", line 755, in forward
hidden_states = resnet(hidden_states, temb=None)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\diffusers\models\resnet.py", line 450, in forward
hidden_states = self.norm1(hidden_states)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\normalization.py", line 272, in forward
return F.group_norm(
File "D:\download\AI\EveryDream2trainer\venv\lib\site-packages\torch\nn\functional.py", line 2516, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA out of memory. Tried to allocate 720.00 MiB (GPU 0; 6.00 GiB total capacity; 4.71 GiB already allocated; 0 bytes free; 4.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is there any way to fix the problem ?
What solution do i have ?
In utils/gpu.py the GPU 0 is hardcoded:
gpu_used_mem = int(gpu_query['gpu'][0]['fb_memory_usage']['used'])
gpu_total_mem = int(gpu_query['gpu'][0]['fb_memory_usage']['total'])
Also in train.py:
def get_gpu_memory(nvsmi):
"""
returns the gpu memory usage
"""
gpu_query = nvsmi.DeviceQuery('memory.used, memory.total')
gpu_used_mem = int(gpu_query['gpu'][0]['fb_memory_usage']['used'])
gpu_total_mem = int(gpu_query['gpu'][0]['fb_memory_usage']['total'])
return gpu_used_mem, gpu_total_mem
So if you run the trainer on gpuid : 1, the logs will show the memory for the incorrect GPU.
Even fixing these issues and setting gpuid:1 in train.json, I'm still having trouble getting it to run correctly on GPU 1. Starting the trainer seems to immediately max out memory on GPU 0 and throw CUDA memory error.
I am not sure if I missed a ReadMe or push note, but the latest repo as of today keeps defaulting the LR to 1e-6. I went back to c8c658d and it works perfectly. Not sure if this is a glitch on my end and I need a fresh install. Thought I'd mention it, incase it ends up being real.
Edit: I found the note about setting the LR in optimizer.json etc. I was setting LR directly from the CLI and my train.json was also not 1e-6; therefore my LR was not being considered.
Just finished training directly on https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip
(directly clone the diffusers)
but it exports it with v2-inference-v.yaml
Automatic should support it soon too
AUTOMATIC1111/stable-diffusion-webui@8a34671
config_unclip = os.path.join(sd_repo_configs_path, "v2-1-stable-unclip-l-inference.yaml")
config_unopenclip = os.path.join(sd_repo_configs_path, "v2-1-stable-unclip-h-inference.yaml")
Just a head up if we can fix this and be ahead of others ;)
Fresh install of new version 2, followed the instructions and all was well until I tried to start some training:
Traceback (most recent call last):
File "G:\ed2\train.py", line 53, in
import debug
ModuleNotFoundError: No module named 'debug'
I get this when installing via YT tutorial. All other steps were followed. I have the correct files in the root dir.
(venv) D:\EveryDream2\EveryDream2trainer>python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim ^
More? --original_config_file v1-inference.yaml ^
More? --image_size 512 ^
More? --checkpoint_path sd_v1-5_vae.ckpt ^
More? --prediction_type epsilon ^
More? --upcast_attn False ^
More? --dump_path "ckpt_cache/sd_v1-5_vae"
Traceback (most recent call last):
File "D:\EveryDream2\EveryDream2trainer\utils\convert_original_stable_diffusion_to_diffusers.py", line 854, in
checkpoint = torch.load(args.checkpoint_path)
File "D:\EveryDream2\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "D:\EveryDream2\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 920, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
Attach log and cfg
project-20230326-183311.log
project-20230326-183311_cfg.txt
Runtime environment (please complete the following information):
Additional context
In the beginning of my train, I receive the following warning:
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.0.0+cu118 with CUDA 1108 (you have 1.13.1+cu117)
Python 3.10.10 (you have 3.10.6)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
I am getting an error when the training reaches the generation of samples:
NotImplementedError Traceback (most recent call last)
File /workspace/EveryDream2trainer/train.py:1010
1008 print(f" Args:")
1009 pprint.pprint(vars(args))
-> 1010 main(args)
File /workspace/EveryDream2trainer/train.py:936, in main(args)
934 save_path = os.path.join(f"{log_folder}/ckpts/errored-{args.project_name}-ep{epoch:02}-gs{global_step:05}")
935 __save_model(save_path, unet, text_encoder, tokenizer, noise_scheduler, vae, args.save_ckpt_dir, yaml, args.save_full_precision)
--> 936 raise ex
938 logging.info(f"{Fore.LIGHTWHITE_EX} ***************************{Style.RESET_ALL}")
939 logging.info(f"{Fore.LIGHTWHITE_EX} **** Finished training ****{Style.RESET_ALL}")
File /workspace/EveryDream2trainer/train.py:882, in main(args)
879 torch.cuda.empty_cache()
881 if (global_step + 1) % sample_generator.sample_steps == 0:
--> 882 generate_samples(global_step=global_step, batch=batch)
884 min_since_last_ckpt = (time.time() - last_epoch_saved_time) / 60
886 if args.ckpt_every_n_minutes is not None and (min_since_last_ckpt > args.ckpt_every_n_minutes):
File /workspace/EveryDream2trainer/train.py:787, in main..generate_samples(global_step, batch)
784 print(f" * SampleGenerator config changed, now generating images samples every " +
785 f"{sample_generator.sample_steps} training steps (next={next_sample_step})")
786 sample_generator.update_random_captions(batch["captions"])
--> 787 inference_pipe = sample_generator.create_inference_pipe(unet=unet,
788 text_encoder=text_encoder,
789 tokenizer=tokenizer,
790 vae=vae,
791 diffusers_scheduler_config=reference_scheduler.config
792 ).to(device)
793 sample_generator.generate_samples(inference_pipe, global_step)
795 del inference_pipe
File /workspace/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.call..decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File /workspace/EveryDream2trainer/utils/sample_generator.py:294, in SampleGenerator.create_inference_pipe(self, unet, text_encoder, tokenizer, vae, diffusers_scheduler_config)
283 pipe = StableDiffusionPipeline(
284 vae=vae,
285 text_encoder=text_encoder,
(...)
291 feature_extractor=None, # must be None if no safety checker
292 )
293 if self.use_xformers:
--> 294 pipe.enable_xformers_memory_efficient_attention()
295 return pipe
File /workspace/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:1080, in DiffusionPipeline.enable_xformers_memory_efficient_attention(self, attention_op)
1050 def enable_xformers_memory_efficient_attention(self, attention_op: Optional[Callable] = None):
1051 r"""
1052 Enable memory efficient attention as implemented in xformers.
1053
(...)
1078 ```
1079 """
-> 1080 self.set_use_memory_efficient_attention_xformers(True, attention_op)
File /workspace/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:1105, in DiffusionPipeline.set_use_memory_efficient_attention_xformers(self, valid, attention_op)
1103 module = getattr(self, module_name)
1104 if isinstance(module, torch.nn.Module):
-> 1105 fn_recursive_set_mem_eff(module)
File /workspace/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:1096, in DiffusionPipeline.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
1094 def fn_recursive_set_mem_eff(module: torch.nn.Module):
1095 if hasattr(module, "set_use_memory_efficient_attention_xformers"):
-> 1096 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
1098 for child in module.children():
1099 fn_recursive_set_mem_eff(child)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:219, in ModelMixin.set_use_memory_efficient_attention_xformers(self, valid, attention_op)
217 for module in self.children():
218 if isinstance(module, torch.nn.Module):
--> 219 fn_recursive_set_mem_eff(module)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:212, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
210 def fn_recursive_set_mem_eff(module: torch.nn.Module):
211 if hasattr(module, "set_use_memory_efficient_attention_xformers"):
--> 212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
215 fn_recursive_set_mem_eff(child)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:219, in ModelMixin.set_use_memory_efficient_attention_xformers(self, valid, attention_op)
217 for module in self.children():
218 if isinstance(module, torch.nn.Module):
--> 219 fn_recursive_set_mem_eff(module)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:215, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
--> 215 fn_recursive_set_mem_eff(child)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py:212, in ModelMixin.set_use_memory_efficient_attention_xformers..fn_recursive_set_mem_eff(module)
210 def fn_recursive_set_mem_eff(module: torch.nn.Module):
211 if hasattr(module, "set_use_memory_efficient_attention_xformers"):
--> 212 module.set_use_memory_efficient_attention_xformers(valid, attention_op)
214 for child in module.children():
215 fn_recursive_set_mem_eff(child)
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/cross_attention.py:146, in CrossAttention.set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers, attention_op)
140 _ = xformers.ops.memory_efficient_attention(
141 torch.randn((1, 2, 40), device="cuda"),
142 torch.randn((1, 2, 40), device="cuda"),
143 torch.randn((1, 2, 40), device="cuda"),
144 )
145 except Exception as e:
--> 146 raise e
148 if is_lora:
149 processor = LoRAXFormersCrossAttnProcessor(
150 hidden_size=self.processor.hidden_size,
151 cross_attention_dim=self.processor.cross_attention_dim,
152 rank=self.processor.rank,
153 attention_op=attention_op,
154 )
File /workspace/venv/lib/python3.10/site-packages/diffusers/models/cross_attention.py:140, in CrossAttention.set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers, attention_op)
137 else:
138 try:
139 # Make sure we can run the memory efficient attention
--> 140 _ = xformers.ops.memory_efficient_attention(
141 torch.randn((1, 2, 40), device="cuda"),
142 torch.randn((1, 2, 40), device="cuda"),
143 torch.randn((1, 2, 40), device="cuda"),
144 )
145 except Exception as e:
146 raise e
File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/init.py:196, in memory_efficient_attention(query, key, value, attn_bias, p, scale, op)
115 def memory_efficient_attention(
116 query: torch.Tensor,
117 key: torch.Tensor,
(...)
123 op: Optional[AttentionOp] = None,
124 ) -> torch.Tensor:
125 """Implements the memory-efficient attention mechanism following
126 "Self-Attention Does Not Need O(n^2) Memory" <[http://arxiv.org/abs/2112.05682>](http://arxiv.org/abs/2112.05682%3E%60).
127
(...)
194 :return: multi-head attention Tensor with shape [B, Mq, H, Kv]
195 """
--> 196 return _memory_efficient_attention(
197 Inputs(
198 query=query, key=key, value=value, p=p, attn_bias=attn_bias, scale=scale
199 ),
200 op=op,
201 )
File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/init.py:294, in _memory_efficient_attention(inp, op)
289 def _memory_efficient_attention(
290 inp: Inputs, op: Optional[AttentionOp] = None
291 ) -> torch.Tensor:
292 # fast-path that doesn't require computing the logsumexp for backward computation
293 if all(x.requires_grad is False for x in [inp.query, inp.key, inp.value]):
--> 294 return _memory_efficient_attention_forward(
295 inp, op=op[0] if op is not None else None
296 )
298 output_shape = inp.normalize_bmhk()
299 return _fMHA.apply(
300 op, inp.query, inp.key, inp.value, inp.attn_bias, inp.p, inp.scale
301 ).reshape(output_shape)
File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/init.py:310, in _memory_efficient_attention_forward(inp, op)
308 output_shape = inp.normalize_bmhk()
309 if op is None:
--> 310 op = _dispatch_fw(inp)
311 else:
312 _ensure_op_supports_or_raise(ValueError, "memory_efficient_attention", op, inp)
File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:98, in _dispatch_fw(inp)
96 priority_list_ops.remove(triton.FwOp)
97 priority_list_ops.insert(0, triton.FwOp)
---> 98 return _run_priority_list(
99 "memory_efficient_attention_forward", priority_list_ops, inp
100 )
File /workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:73, in _run_priority_list(name, priority_list, inp)
71 for op, not_supported in zip(priority_list, not_supported_reasons):
72 msg += "\n" + _format_not_supported_reasons(op, not_supported)
---> 73 raise NotImplementedError(msg)
NotImplementedError: No operator found for memory_efficient_attention_forward with inputs:
query : shape=(1, 2, 1, 40) (torch.float32)
key : shape=(1, 2, 1, 40) (torch.float32)
value : shape=(1, 2, 1, 40) (torch.float32)
attn_bias : <class 'NoneType'>
p : 0.0
flshattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
tritonflashattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
requires A100 GPU
cutlassF is not supported because:
xFormers wasn't build with CUDA support
smallkF is not supported because:
xFormers wasn't build with CUDA support
max(query.shape[-1] != value.shape[-1]) > 32
unsupported embed per head: 40
As the title says, holding CTRL+ALT+PAGEUP
does not kick off sampling, even after pressing this shortcut for over 30 seconds.
Just to make it evidently clear I'm pressing (and holding) these buttons:
https://i.imgur.com/3ka60uF.png
I am running EveryDream2trainer on Windows 10 in the command prompt.
Anything I should be doing differently to get it to run? Able to run SD_extension and shivam repo.
python train.py --resume_ckpt "sd_v1-5_vae" --data_root "input" --max_epochs 10 --lr_scheduler constant --project_name testo --batch_size 1 --sample_steps 20 --lr 1e-6 --resolution 512 --clip_grad_norm 1 --ckpt_every_n_minutes 30 --useadam8bit --amp --mixed_precision fp16
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 12.00 GiB total capacity; 11.26 GiB already allocated; 0 bytes free; 11.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epochs: 0%| | 0/10 [00:56<?, ?it/s, vram=4867/12288 MB gs:0]
Steps: 1%|█▎
Really looking forward to using your repo, it looks awesome. Thanks for all your hard work!
Would love to give this a try in the Google Cloud Platform ecosystem (with logging).
While I'm training the model, the sample generated images are always identical (at a pixel level) at every step.
It seems that they are generated by using the same weights instead from the updated model.
I am getting Cuda out of memory :(
After going through the install, I had some issues setting up. The directories that you have in the image aren't there yet, so I decided to input the commands as you have them typed out and it looks like there's an issue with that directory not yet existing. I'm most interested in training SD2.1
(venv) C:\Users\micha\Documents\EveryDream2trainer>python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim ^
More? --original_config_file v2-inference-v.yaml ^
More? --image_size 768 ^
More? --checkpoint_path v2-1_768-nonema-pruned.ckpt ^
More? --prediction_type v_prediction ^
More? --upcast_attn False ^
More? --dump_path "ckpt_cache/v2-1_768-nonema-pruned"
Traceback (most recent call last):
File "C:\Users\micha\Documents\EveryDream2trainer\utils\convert_original_stable_diffusion_to_diffusers.py", line 855, in
checkpoint = torch.load(args.checkpoint_path)
File "C:\Users\micha\Documents\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "C:\Users\micha\Documents\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "C:\Users\micha\Documents\EveryDream2trainer\venv\lib\site-packages\torch\serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'v2-1_768-nonema-pruned.ckpt'
when loading a fine tuned model with diffusers
device = "cuda"
pipe = StableDiffusionInpaintPipeline.from_pretrained(
'/finetuning/EveryDream2trainer/logs/2301_MYPROJ_fp16_adam_nonema_20230116-155427/ckpts/2301_MYPROJ_fp16_adam_nonema-ep300-gs04201' ,
revision="fp16",
safety_checker=None,
requires_safety_checker=False,
feature_extractor=None,
torch_dtype=torch.float16,
use_auth_token=False
).to(device)
with autocast("cuda"):
images = pipe(prompt="my amazing prompt", num_inference_steps=100, negative_prompt="negativity", image=init_image, mask_image=mask)["images"]
I get a size mismatch
incorrect configuration settings! The config of `pipeline.unet`: FrozenDict([...]) expects 4 but received `num_channels_latents`: 4 + `num_channels_mask`: 1 + `num_channels_masked_image`: 4 = 9. Please verify the config of `pipeline.unet` or your `mask_image` or `image` input.
So far I have been able to use inpainting with my previous models and understood the inpainting models are just "better" at inpainting (for instance I can always use inpainting inside A1111, no matter if my model is from ).
I'm running my first training run with ED2Trainer. I auto-captioned my instance images, used the replace tool to introduce a custom class word (a photo of a jet -> a photo of a mtVdDv jet) and started the training. I have only 20 images and therefore train for 2000 epochs.
I have added 3 prompts (one each line) into sample_prompts.txt and now started training. The trainer produces sample images, but they all turn out complelety black, only cfg scale is visible.
The current training is yet only at 195 Epochs, but the images should not be black right?
EDIT: my images are 100% SFW
python3.10 train.py --resume_ckpt "v2-1_768-nonema-pruned" \
--data_root "/finetuning/instances/myproject/w-captions/" \
--max_epochs 2000 \
--lr_scheduler cosine \
--lr_decay_steps 1500 \
--lr_warmup_steps 20 \
--project_name myproject \
--batch_size 2 \
--sample_steps 100 \
--lr 1.5e-6 \
--save_every_n_epochs 500 \
--useadam8bit
Coming from Koyha's trainer, it seems that a batch size of 4 at 131 epochs does 17 batches/epoch versus with Kohya, I was getting 13 batches/epoch.
This leads to more training than I might want so I'd be interested how it works under the hood. I see you are doing a len
calculation on line 655 in train.py
so if you could elaborate on the parameters and how that affects batches/epoch, I would greatly appreciate it.
Also thanks for making a better trainer, this blows Kohya's out of the water with his accelerate bug.
You gotta take a look at your runpod .ipynb file. It won't run with the 2.1.10 docker you specified.
There are couple of issues with it, but your docs are incomplete (firstly your hf_downloader downloads at the wrong place, then it is missing config files, etc.)
Please fix it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.