I keep getting out of memory exceptions no matter how I try to set PYTORCH_CUDA_ALLOC_CONF
This is the error:
File "/opt/saturncloud/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 13.90 GiB already allocated; 14.75 MiB free; 14.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I'm trying to run training on google colab, and I'm running into dependency issues.
It would be greatly helpful if you could provide a working notebook.
Thank you!
Hi ZichengDuan!
can you please support a negative prompt? it may help to converge to the wanted character - i.e. a real human being and a CGI.
additional question: only stable-diffusion-xl-base-1.0 (among the known models) have tokenizer_2 and text_encoder_2:
how can I modify the code to work with other diffusion models?
Hi. Thank you for your implementation of the paper.
While I was looking at your code, I couldn't understand whether the model being trained each loop is the model that was trained in the previous loop, or you are calling a vanilla SDXL model every loop.
Can you tell me what is right, and where the appropriate code for my question is ?
Hello!
When I'm traing using config:"resume_from_checkpoint: latest",I load the lora checkpoint but it has the error:"No inf checks were recorded for this optimizer." Howevwe, if I train without checkpoint, it has the error, bf16 is not supported int(sorry I forget the integrity error)
Thank you for your work. Upon checking the official repository, I couldn't locate the diffusers version 0.24.0.dev0 so that I can not use the API text_encoder_lora_state_dict. What should I do, if there are alternative APIs available. Currently, I am using diffuser version 0.23.1.
Hi, is there a way i can train the model to produce multiple consistent characters just use one model under the condition that i can train one time for a character?
when I try to run python3 main.py
I get the following error
ImportError: cannot import name 'text_encoder_lora_state_dict' from 'diffusers.models.lora' (/usr/local/lib/python3.11/site-packages/diffusers/models/lora.py)
I had to change a couple of things to get to this point
I changed to requirements file to this in order to make it work
Hi, there. Wonderful work! I am now running the training process, but it seems to take a lot of time, wondering what should be a normal duration time for the training process when training on a single v100, say using the default config provided in the config file.
I did not modify any code, I simply ran the training and inference programs directly, and the results were very poor. I'm not sure what the problem is. As the loop increases, the results get worse. Is there a problem somewhere?
loop=0:
loop=1: