When stable diffusion was released, the most requested features were always about redu

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Implement GPU memory optimization about diffusers-rs HOT 11 OPEN

siriux commented on May 29, 2024 4

Implement GPU memory optimization

from diffusers-rs.

Comments (11)

siriux commented on May 29, 2024 1

I've started to test the generation speed and I've realized that even though we are loading fp16 weights, all the computations are in fp32. This results in a ~2x slowdown with respect to the python implementation I'm testing against (Automatic1111), that's why I've noticed.

I've added my fp16 hacks back and now the speed is comparable to the python one. Basically, what I've done is replace all Kind::Float with Kind::Half in attention.rs, embeddings.rs, unet_2d.rs and clip.rs But in clip.rs I need to keep the last Float and I use mask.fill_(f32::MIN as f64).triu_(1).unsqueeze(1).to_kind(Kind::Half) instead.

On the pipeline I do vs.half(); after loading the weights (and the equivalent for vs_ae and vs_unet). And finally, when generating the random latents I also use Kind::Half.

This is really just a hack that forces to always use Kind::Half, and we should do it in a configurable way, that's why I'm just explaining the hack instead of creating a pull request. Any preferences on how to implement this the right way?

Once we have the hack in place and using sliced attention of size 1 I've compared it to Automatic1111 with the xformers attention. In this case we only lose ~25% of performance or maybe even less. We also lose a little bit of max image size, due to the higher memory needs. But in any case, I'm very satisfied with the sliced attention performance even if supporting xformers would be a really welcomed optimization.

from diffusers-rs.

LaurentMazare commented on May 29, 2024

Thanks for all the details, that's very interesting and it would be great to support GPUs with less memory. I have a 8GB GPU and was only able to run the code on cpu.
Based on your details and links, I was able to get the code to run using the fp16 weights for unet (available here) and making a couple changes to run the unet only on the GPU..

PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 \
    cargo run --example stable-diffusion --features clap -- --cpu-for-vae --unet-weights data/unet-fp16.ot

Fwiw, autocast is also available in tch but somehow it did not seem to help much, I probably messed something up and will have to dig further.

from diffusers-rs.

geocine commented on May 29, 2024

8 bit optimization as well https://github.com/TimDettmers/bitsandbytes

from diffusers-rs.

siriux commented on May 29, 2024

Thanks, that's great. I got it working yesterday in fp16 but my solution had a few hacks, your solution is much cleaner.

I can also confirm that it works for me using the vae on the GPU (with its fp16 version). And that clip works on fp16 too, also on the GPU. I had to force clip on the GPU because you didn't include the option, is there a reason for only allowing it on the CPU?

So it works for me with all fp16 and on my GPU (2070 Super with 8GB).

I'm using everything from v1.5 of stable diffusion (https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/fp16) , including the clip model from text_encoder. Probably it's better suggest to source the clip model from there too in the example.

from diffusers-rs.

LaurentMazare commented on May 29, 2024

The main reason why I've put clip model on the cpu by default was that I find it to run very quickly compared to the rest of the pipeline so didn't find much of an advantage to running it on the gpu and it's using a bit of memory there, also it seemed a bit painful for users willing to run everything on the cpu to have to specify 3 flags. Anyway, I've tweaked the cpu flag a bit so that users can set it to -cpu all, or -cpu vae -cpu clip, etc.

from diffusers-rs.

LaurentMazare commented on May 29, 2024

Right, it would be better to use fp16 all the way, I actually mentioned earlier giving a try to autocast which is probably what the python version does to but somehow the generated images look bad there. We should probably try to get to the bottom of this as autocast is fairly nice and should ensure that fp16 is used where appropriate (and fp32 is still used on some small bits that require more precision).
You can see an example of how to use autocast in this test snippet.

from diffusers-rs.

siriux commented on May 29, 2024

I just saw you comment, autocast seems great, it's probably what we need, I'll have a look tomorrow.

In the mean time, I've created this draft PR that includes my changes for Kind::Half as well as other things for img2img just in case anyone wants to try before we get autocast working: #6

from diffusers-rs.

Emulator000 commented on May 29, 2024

Based on your details and links, I was able to get the code to run using the fp16 weights for unet (available here) and making a couple changes to run the unet only on the GPU..

Unfortunately those weight are for Stable Diffusion v1.5 and I think is not compatible with v2.x, just tried (I have a RTX 2070 Super too) and I get this error at runtime:

Building the Clip transformer.
Building the autoencoder.
Building the unet.
Error: cannot find the tensor named up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias in data/unet_v1.5_fp16.ot

Any hint here to get it running or I can use sadly only the CPU version? 😢

from diffusers-rs.

LaurentMazare commented on May 29, 2024

Have you tried the weights for the 2.1 version? I would guess they are here on huggingface though I haven't tried.

from diffusers-rs.

Emulator000 commented on May 29, 2024

Thanks @LaurentMazare!

Tried just now, converted from the original weight but now I get the out of memory issue:

CUDA out of memory. Tried to allocate 3.16 GiB (GPU 0; 7.78 GiB total capacity; 3.52 GiB already allocated; 2.62 GiB free; 3.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is this the fp16 version?

from diffusers-rs.

LaurentMazare commented on May 29, 2024

I think so as it's in the fp16 branch in the huggingface repo. You can probably check by loading it in python and looking at the dtype reported by torch.

from diffusers-rs.

Implement GPU memory optimization about diffusers-rs HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs