Comments (11)
I've started to test the generation speed and I've realized that even though we are loading fp16 weights, all the computations are in fp32. This results in a ~2x slowdown with respect to the python implementation I'm testing against (Automatic1111), that's why I've noticed.
I've added my fp16 hacks back and now the speed is comparable to the python one. Basically, what I've done is replace all Kind::Float with Kind::Half in attention.rs, embeddings.rs, unet_2d.rs and clip.rs But in clip.rs I need to keep the last Float and I use mask.fill_(f32::MIN as f64).triu_(1).unsqueeze(1).to_kind(Kind::Half) instead.
On the pipeline I do vs.half(); after loading the weights (and the equivalent for vs_ae and vs_unet). And finally, when generating the random latents I also use Kind::Half.
This is really just a hack that forces to always use Kind::Half, and we should do it in a configurable way, that's why I'm just explaining the hack instead of creating a pull request. Any preferences on how to implement this the right way?
Once we have the hack in place and using sliced attention of size 1 I've compared it to Automatic1111 with the xformers attention. In this case we only lose ~25% of performance or maybe even less. We also lose a little bit of max image size, due to the higher memory needs. But in any case, I'm very satisfied with the sliced attention performance even if supporting xformers would be a really welcomed optimization.
from diffusers-rs.
Thanks for all the details, that's very interesting and it would be great to support GPUs with less memory. I have a 8GB GPU and was only able to run the code on cpu.
Based on your details and links, I was able to get the code to run using the fp16 weights for unet (available here) and making a couple changes to run the unet only on the GPU..
PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 \
cargo run --example stable-diffusion --features clap -- --cpu-for-vae --unet-weights data/unet-fp16.ot
Fwiw, autocast
is also available in tch
but somehow it did not seem to help much, I probably messed something up and will have to dig further.
from diffusers-rs.
8 bit optimization as well https://github.com/TimDettmers/bitsandbytes
from diffusers-rs.
Thanks, that's great. I got it working yesterday in fp16 but my solution had a few hacks, your solution is much cleaner.
I can also confirm that it works for me using the vae on the GPU (with its fp16 version). And that clip works on fp16 too, also on the GPU. I had to force clip on the GPU because you didn't include the option, is there a reason for only allowing it on the CPU?
So it works for me with all fp16 and on my GPU (2070 Super with 8GB).
I'm using everything from v1.5 of stable diffusion (https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/fp16) , including the clip model from text_encoder. Probably it's better suggest to source the clip model from there too in the example.
from diffusers-rs.
The main reason why I've put clip model on the cpu by default was that I find it to run very quickly compared to the rest of the pipeline so didn't find much of an advantage to running it on the gpu and it's using a bit of memory there, also it seemed a bit painful for users willing to run everything on the cpu to have to specify 3 flags. Anyway, I've tweaked the cpu flag a bit so that users can set it to -cpu all
, or -cpu vae -cpu clip
, etc.
from diffusers-rs.
Right, it would be better to use fp16 all the way, I actually mentioned earlier giving a try to autocast
which is probably what the python version does to but somehow the generated images look bad there. We should probably try to get to the bottom of this as autocast is fairly nice and should ensure that fp16 is used where appropriate (and fp32 is still used on some small bits that require more precision).
You can see an example of how to use autocast in this test snippet.
from diffusers-rs.
I just saw you comment, autocast seems great, it's probably what we need, I'll have a look tomorrow.
In the mean time, I've created this draft PR that includes my changes for Kind::Half as well as other things for img2img just in case anyone wants to try before we get autocast working: #6
from diffusers-rs.
Based on your details and links, I was able to get the code to run using the fp16 weights for unet (available here) and making a couple changes to run the unet only on the GPU..
Unfortunately those weight are for Stable Diffusion v1.5 and I think is not compatible with v2.x, just tried (I have a RTX 2070 Super too) and I get this error at runtime:
Building the Clip transformer.
Building the autoencoder.
Building the unet.
Error: cannot find the tensor named up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias in data/unet_v1.5_fp16.ot
Any hint here to get it running or I can use sadly only the CPU version? 😢
from diffusers-rs.
Have you tried the weights for the 2.1 version? I would guess they are here on huggingface though I haven't tried.
from diffusers-rs.
Thanks @LaurentMazare!
Tried just now, converted from the original weight but now I get the out of memory issue:
CUDA out of memory. Tried to allocate 3.16 GiB (GPU 0; 7.78 GiB total capacity; 3.52 GiB already allocated; 2.62 GiB free; 3.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is this the fp16 version?
from diffusers-rs.
I think so as it's in the fp16
branch in the huggingface repo. You can probably check by loading it in python and looking at the dtype reported by torch.
from diffusers-rs.
Related Issues (20)
- Feature Request: Negative prompts HOT 1
- Add Scheduler trait/enum HOT 2
- m1 mac gpu HOT 6
- Google Colab Notebook to run diffusion experiment on the GPU
- Embed the examples logic into the pipeline HOT 1
- How to load a parameter file in safetensors format? HOT 1
- PytorchStreamReader failed reading zip archive HOT 2
- ControlNet support? HOT 5
- Bad distorted picture using the in-painting example provided HOT 4
- Loading of text embeddings in pt format? HOT 2
- Example of inpaint doesn't work for Stable Diffusion 2.1 HOT 2
- CUDA out of memory on 12GB GPU HOT 2
- Error: The system cannot find the file specified. (os error 2) HOT 2
- Tracking issue for SD ecosystem feature parity HOT 6
- DirectML Support HOT 1
- Cannot link when used together with cxx-qt crate HOT 1
- CUDA/GPU Not Working. HOT 1
- STATUS_DLL_NOT_FOUND HOT 1
- Benchmarks? HOT 1
- Integration with Stable Diffusion XL 1.0 ? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffusers-rs.