laurentmazare / diffusers-rs Goto Github PK

View Code? Open in Web Editor NEW

487.0 487.0 56.0 1.45 MB

An implementation of the diffusers api in Rust

License: Apache License 2.0

Rust 96.84% Shell 1.44% Python 1.72%

diffusers-rs's People

Contributors

Stargazers

Watchers

diffusers-rs's Issues

Implement GPU memory optimization

When stable diffusion was released, the most requested features were always about reducing GPU requirements as this makes it available to more users with cheaper GPUs. I think this is one important feature to implement in diffusers-rs.

Here it's a list of the diffusers library optimizations: https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx

The main memory reduction feature I think would be allowing half precision models (fp16), that AFAIK is not yet implemented (but I might have missed the conversion somewhere).

I don't know if there is a better option to set the fp16 requirement in advance, but otherwise just loading the VarStore in a CPU device, calling half() on it, and moving it to the GPU using set_device should do the trick. This can be done in the example without touching the library I think.

Then, the next important thing would be Sliced Attention. This should be really straight forward, as we can see here it's just splitting the normal attention into slices and computing each slice at a time in a loop: https://github.com/huggingface/diffusers/blob/08a6dc8a5840e0cc09e65e71e9647321ab9bb254/src/diffusers/models/attention.py#L526

Then, is just a mater of exposing an slice_size configuration element on CrossAttention, and add an attention_slice_size to anyone that uses it.

Supporting Flash Attention from https://github.com/HazyResearch/flash-attention would be nice, but it's much more complicated as needs to be compiled for Cuda and it only works with Nvidia Cards. But this optimization is the main one used by the xformers library related to attention and provides a very good speedup in many cases.

Finally, the last important thing missing would be to allow move some models to the CPU when not in use (see huggingface/diffusers#850), or even run them on the CPU as needed only leaving the unet on the GPU (see huggingface/diffusers#537).

They use the accelerate library, but I think this can be implemented directly on tch. I think just providing a set_device method on all the models would be enough, as everything else can be handled directly on the example or the user code.

What do you think? I'm planning to play a little bit with the diffusers-rs library and stable diffusion the next weeks and I can try to implement a few optimizations, but my knowledge on ML is still very basic.

Regarding Inference Time

I have tried with rust implementation of Stable diffusion v2 on A100 gpu with 40gb of ram. Normal stable diffusion pipeline from huggingface takes around 7-8s to generate an image whereas rust implementation takes around 12-13s. It will be really helpful if someone can explain that why is huggingface taking less time compared to rust implementation or am I missing something while running rust implementation?

Thanks!!

[Question] How do I enable CUDA?

Thanks so much for your Rust implementation, and especially for the Stable Diffusion example.

I have a noob question: how do I enable CUDA?

My problem:

Even though my machine (on Windows):

has RTX 3090 graphic cards,
has CUDA installed,

when running the Stable Diffusion example without the arg --cpu it still resorts to my CPU instead.

Would someone please kindly help?

Integration with Stable Diffusion XL 1.0 ?

Hello,

That might be a silly question, but will there be future integrations with SDXL in a near future? It seems to me that the XL version is pretty different so I figured we couldn't use it as is.

Thanks for your help!

m1 mac gpu

Is there a way to specify gpu use for m1 macs? It's using my cpu when generating but I have more than 8GB memory, I'm using the command:
cargo run --example stable-diffusion --features clap -- --prompt "A very rusty robot holding a fire torch." --cpu all --sd-version v1-5

How to load a parameter file in safetensors format?

Many model files are saved in safesensor format, how to load such files?

for example: https://civitai.com/models/5041/cheese-daddys-landscapes-mix

And safetensors also have rust api: https://crates.io/crates/safetensors

PytorchStreamReader failed reading zip archive

When trying

cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."

I got

Error: Internal torch error: PytorchStreamReader failed reading zip archive: failed finding central directory
Exception raised from valid at /Users/runner/work/pytorch/pytorch/pytorch/caffe2/serialize/inline_container.cc:183 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x10818f992 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x10818e0aa in libc10.dylib)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 128 (0x11806ddc0 in libtorch_cpu.dylib)
frame #3: caffe2::serialize::PyTorchStreamReader::init() + 315 (0x11806d1cb in libtorch_cpu.dylib)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 261 (0x11806cfe5 in libtorch_cpu.dylib)
frame #5: torch::jit::import_ir_module(std::__1::shared_ptr<torch::jit::CompilationUnit>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, c10::optional<c10::Device>, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >&) + 519 (0x1194ed787 in libtorch_cpu.dylib)
frame #6: torch::jit::import_ir_module(std::__1::shared_ptr<torch::jit::CompilationUnit>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, c10::optional<c10::Device>) + 69 (0x1194ed4c5 in libtorch_cpu.dylib)
frame #7: torch::jit::load(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, c10::optional<c10::Device>) + 238 (0x1194ee5ce in libtorch_cpu.dylib)
frame #8: at_load_callback_with_device + 148 (0x10790e9f4 in stable-diffusion)
frame #9: tch::wrappers::tensor::Tensor::load_multi_with_device::h76a379da6fb3bc44 + 420 (0x10771b7c4 in stable-diffusion)
frame #10: tch::nn::var_store::VarStore::named_tensors::haabc631e8482797f + 246 (0x10771c826 in stable-diffusion)
frame #11: tch::nn::var_store::VarStore::load_internal::h57d18ea780681827 + 69 (0x10771bff5 in stable-diffusion)
frame #12: tch::nn::var_store::VarStore::load::h7744a82ce429d62d + 214 (0x10771cbd6 in stable-diffusion)
frame #13: diffusers::pipelines::stable_diffusion::StableDiffusionConfig::build_clip_transformer::h6195c2a8233c895b + 239 (0x10773b7af in stable-diffusion)
frame #14: stable_diffusion::run::h8a551142c83d2c48 + 4025 (0x107703c09 in stable-diffusion)
frame #15: stable_diffusion::main::hd0cfa2a2c29f23e0 + 121 (0x107705b19 in stable-diffusion)
frame #16: core::ops::function::FnOnce::call_once::h4135dd5f1b217b3c + 14 (0x1076f3ace in stable-diffusion)
frame #17: std::sys_common::backtrace::__rust_begin_short_backtrace::h8da8413ad3c77f72 + 17 (0x1077016f1 in stable-diffusion)
frame #18: std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h65c09f346b6d0be1 + 20 (0x1076e9584 in stable-diffusion)
frame #19: std::rt::lang_start_internal::haf0419567751b65f + 708 (0x107a8d744 in stable-diffusion)
frame #20: std::rt::lang_start::he551d2857570b3e6 + 55 (0x1076e9557 in stable-diffusion)
frame #21: main + 24 (0x10770a4b8 in stable-diffusion)
frame #22: start + 1 (0x7fff6abcacc9 in libdyld.dylib)

State

This is what my data folder look like and it run just fine for --sd-version v1-5 but get an above error 👆 for 2.1

bpe_simple_vocab_16e6.txt
clip_v2.1.ot
pytorch_model.ot
unet.ot
unet_v2.1.ot
vae.ot
vae_v2.1.ot

Not sure what files or flag I miss for 2.1?

Support for premade checkpoints

Would it be possible to use models based on the CompVis style used by stabilityai and supported in HF diffusers? My personal goals are:

Support for 1.5 and 2.1 ckpt models (converted over to .ot)
Support for third party merges and training checkpoints (based on 1.5 or 2.1so probably would work the same). Analog Diffusion 1.0

I tried the following to convert the file over, and got the names of the tensors using the tensor tools. Maybe these can be extracted and compiled back together?

import numpy as np
import torch

model = torch.load("./data/analog-diffusion-1.0.ckpt")
x = {k: v.numpy() for k, v in model["state_dict"].items()}
np.savez("./data/analog-diffusion-1.0.npz", **x)

cargo run --release --example tensor-tools cp ./data/analog-diffusion-1.0.npz ./data/analog-diffusion-1.0.ot

cargo run --release --example tensor-tools ls ./data/analog-diffusion-1.0.ot

./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.0.0.weight Tensor[[320, 4, 3, 3], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.0.0.bias Tensor[[320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.0.weight Tensor[[1280, 320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.0.bias Tensor[[1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.2.weight Tensor[[1280, 1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.2.bias Tensor[[1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.norm.weight Tensor[[320], Half]
...
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.out_proj.weight Tensor[[768, 768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.out_proj.bias Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm1.weight Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm1.bias Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc1.weight Tensor[[3072, 768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc1.bias Tensor[[3072], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc2.weight Tensor[[768, 3072], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc2.bias Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm2.weight Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm2.bias Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.final_layer_norm.weight Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.final_layer_norm.bias Tensor[[768], Half]

Full list analog-diffusion-1.0.ot.log

Thanks!

Google Colab Notebook to run diffusion experiment on the GPU

Hi,
since my NVidia GPU doesn't have enough memory, I resorted to Google Colab and wrote a notebook to run diffusion experiments on the 16GB GPU available there. Here you can access it. Would you be interested in it, e.g. instructing on how to use it in the README ?

Automate weights setup

Hello @LaurentMazare and thanks for this awesome Rust implementation!
I'm playing around with it but I find the overall setup a little boring as it requires to switch from bash to python a couple of times and to place things in the right directory.

Therefore, I wrote a shell script to automate the whole process. Might you be interested in such a contribution (e.g. under a scripts folder in the root of the repository) ?
Cheers 😃

Support DPM MultiStep Scheduler

Would be great to have support for the DPM MultiStep Scheduler.

Diffusers recommends using it as it's the fastest scheduler at the moment: https://huggingface.co/docs/diffusers/v0.9.0/en/api/pipelines/stable_diffusion_2#available-checkpoints

Link to diffusers python implementation:

https://github.com/huggingface/diffusers/blob/v0.9.0/src/diffusers/schedulers/scheduling_dpmsolver_multistep.py#L57

Feature Request: Negative prompts

Stable Diffusion 2.x is a lot more dependent on negative prompts for good looking outputs than SD 1.5 was.

So it would be very useful if support for negative prompts would be added.

Embed the examples logic into the pipeline

Hi @LaurentMazare,
now that this repository has expanded more and more, I was thinking we could move most of the code of the examples inside a StableDiffusion pipeline, as HF does.

This would also allow us to support more features, such as negative prompts and then support more diffusion pipelines, such as latent diffusion and so on.
Do you think this is a good idea ? 😄

CUDA out of memory on 12GB GPU

My GPU has 12GB memory (11GB free) but I still get CUDA out of memory.

 nvidia-smi
Thu May  4 18:32:15 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN Xp                 Off| 00000000:2B:00.0  On |                  N/A |
| 24%   40C    P0               64W / 250W|   1079MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

 cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."                                                
    Finished dev [unoptimized + debuginfo] target(s) in 0.18s
     Running `target/debug/examples/stable-diffusion --prompt 'A rusty robot holding a fire torch.'
Cuda available: true
Cudnn available: true
MPS available: false
Running with prompt "hello".
Building the Clip transformer.
Building the autoencoder.
Building the unet.
Timestep 0/30
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("CUDA out of memory. Tried to allocate 3.16 GiB (GPU 0; 11.87 GiB total capacity; 8.27 GiB already allocated; 1.68 GiB free; 8.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:936 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faacc05a6bb in /home/jarrod/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x2f176 (0x7faacba2f176 in /home/jarrod/libtorch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2fc12 (0x7faacba2fc12 in /home/jarrod/libtorch/lib/libc10_cuda.so)
...
frame #34: <unknown function> + 0x23790 (0x7faacbe3c790 in /usr/lib/libc.so.6)
frame #35: __libc_start_main + 0x8a (0x7faacbe3c84a in /usr/lib/libc.so.6)
frame #36: <unknown function> + 0x4f0f5 (0x55e9bd45e0f5 in ./target/release/examples/stable-diffusion)
")', /home/jroddev/.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.11.0/src/wrappers/tensor_generated.rs:15578:36

Is there a setting/flag that I'm missing? Should 12GB be sufficient to run this?

Add support for Inpainting

This is a feature request to support the inpainting pipeline. I will be probably work on this in the following days, unless somebody comes up with an implementation first.

Stable Diffusion 2.0 support

Will this support Stable Diffusion 2.0?

https://stability.ai/blog/stable-diffusion-v2-release

Example of inpaint doesn't work for Stable Diffusion 2.1

I'm trying the same example for the 2.1 configuration, downloaded the appropriate CLIP, UNET and VAE and converted them correctly but it does not seems to work.

Command:

cargo run --example stable-diffusion-inpaint --features clap -- --sd-version="v2-1" --prompt "Face of a yellow cat, high resolution, sitting on a park bench." --input-image="temp/dog.png" --mask-image="temp/dog_mask.png" --width=512 --height=512

This is the output that i get with the dog/cat example that works perfectly with Stable Diffusion 1.5:

It seems that the inpainted area doensn't correctly populate.

Any possible reason for that? Should the example/code be adapted for some additional steps for the 2.1 version?

Where are the in-progress image files stored?

The readme mentions this:

At each step, some sd_*.png image is generated, the last one sd_30.png should be the least noisy one.

I can't find those files anywhere though. I only see the final file called sd_final.png.

Feature Request: Automatically count how long it takes to generate an image

Currently, it prints out something like this when generating an image:

Building the Clip transformer.
Building the autoencoder.
Building the unet.
Timestep 0/5
Timestep 1/5
Timestep 2/5
Timestep 3/5
Timestep 4/5
Generating the final image for sample 1/1.

It would be nice if it would print out this instead:

Building the Clip transformer.
Building the autoencoder.
Building the unet.
Timestep 0/5 | 10 seconds
Timestep 1/5 | 9 seconds
Timestep 2/5 | 10 seconds
Timestep 3/5 | 11 seconds
Timestep 4/5 | 10 seconds
Generating the final image for sample 1/1
Total Elapsed Time: 52 seconds

That would make it easier to compare different settings, like different schedulers, different SD version, CPU vs GPU, etc, and see how much speed difference there is.

Implement (main) missing schedulers

This issue aims at tracking the main schedulers missing in this Rust implementation:

Euler Discrete (Python version)
Euler Ancestral Discrete (Python version)
LMS Discrete (Python version)
Heun Discrete (Python version)
K-DPM-2 Discrete (Python Version)
K-DPM-2 Ancestral Discrete (Python Version)
PNDM ¹ (Python version)

I will be working on the first four in the following days/weeks, if you agree. Just wanted to discuss with you the following aspect: they all require a linear interpolator. The python version uses np.interpolate(.). Do you prefer to introduce an extra dependency using a third-party interpolation lib or to implement "manually" a linear interpolator and to place it, for instance, at package level in mod.rs ?

Feel free to update this issue adding any other scheduler you want include😄.

only the relevant part for stable diffusion, i e. skipping Runge-Kutta steps. ↩

Implement Img2Img

I'm going to start understanding the Img2Img pipeline and what is needed to port it.

Also, are you interested in keep growing the SD example or you prefer to keep it simple. I'm asking because for my use case of building an SD based image editor I'm going to need to create a separate library tailed for my needs, but maybe is interesting too also include some features like Img2Img in the SD example.

Error: The system cannot find the file specified. (os error 2)

     Running `target\debug\examples\stable-diffusion.exe --prompt "A very rusty robot holding a fire torch." --cpu all`
Cuda available: false
Cudnn available: false
MPS available: false
Running with prompt "A very rusty robot holding a fire torch.".
Building the Clip transformer.
Error: The system cannot find the file specified. (os error 2)
error: process didn't exit successfully: `target\debug\examples\stable-diffusion.exe --prompt "A very rusty robot holding a fire torch." --cpu all` (exit code: 1)

This is the output I get after running cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch." in cmd. I am on windows. I ran scripts/download_weights.sh in WSL and moved the /scripts/data directory to /data beforehand as well. I don't know what file it's referring to by "Error: The system cannot find the file specified. (os error 2)", help would be appreciated!

Add support for padding_mode config option in the models

As we can see here huggingface/diffusers#556, adding support for padding_mode allows to generate images that can be seamlessly tiled.

Ideally, we would like to be able to reconfigure the padding_mode of all the conv networks without needed to recreate the pipeline.

Fatal error LNK1120: 2 unresolved externals

I'm trying to run this again, but it does not compile. I have cloned this repo with
git clone https://github.com/LaurentMazare/diffusers-rs.git
and then run
cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."
like the readme says. It then compiles everything, until it reaches
Compiling diffusers v0.2.0
and then fails with
error: linking with `link.exe` failed: exit code: 1120
and the actual issue it complains about is this:

  = note:    Creating library C:\Other\StableDiffusion\Rust\Testing\fromgithub\diffusers-rs\target\debug\examples\stable_diffusion.lib and object C:\Other\StableDiffusion\Rust\Testing\fromgithub\diffusers-rs\target\debug\examples\stable_diffusion.exp
          libtorch_sys-56f4e8e25c10d9d7.rlib(torch_api.o) : error LNK2001: unresolved external symbol __imp___tls_index_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
          libtorch_sys-56f4e8e25c10d9d7.rlib(torch_api.o) : error LNK2001: unresolved external symbol __imp___tls_offset_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
          C:\Other\StableDiffusion\Rust\Testing\fromgithub\diffusers-rs\target\debug\examples\stable_diffusion.exe : fatal error LNK1120: 2 unresolved externals

Am I doing something wrong, or is the current main branch maybe just not compiling? I know I was able to run this fine in the past.

Feature Request: Add flag for how many CPU threads should be used

When running this on the CPU, it would be nice to have a flag for setting how many threads it should be allowed to use. Sometimes it might make sense to not use all available threads, and sometimes using more than the available threads might also make sense.

Currently, I see that if I run this on my CPU (Ryzen 3950X), I get a consistent 58% CPU usage, so it appears to use too few threads for me by default. It would be nice to be able to bring it up to 100%.

Add Scheduler trait/enum

Right now we are adding the schedulers, but it is difficult to work with since swapping the scheduler doesn't work well. This also slows down testing and evaluation of the schedulers, as a separate script needs to be made each time to test the samplers. I also was implementing these into an application and swapping the schedulers wasn't working (due to different types at runtime).

I experimented some in adding a trait, so we can use impl Scheduler. Came up with the following, but causes some points of contention.

steps need a &mut self for some schedulers
steps needs a timestep of different number types (f64 and usize) currently
add_noise doesn't have a noise input on each one
timesteps returns a slice of usize or f64 (or maybe other ones?)

pub trait Scheduler {
    fn step<T: SomeTraitThatWouldTakef64AndUsize>(&mut self, model_output: &Tensor, timestep: T, sample: &Tensor) -> Tensor;
    fn timesteps(&self) -> &[usize];
    fn add_noise(&self, original_samples: &Tensor, noise: Tensor, timestep: usize) -> Tensor;
    fn init_noise_sigma(&self) -> f64;
    fn scale_model_input(&self, sample: Tensor, timestep: usize) -> Tensor;
}

And then I could do the following (I'm still learning traits):

sample<T: Scheduler>(
    ...,
    mut scheduler: T
)

And/or we could also do a Scheduler enum.

enum SamplerScheduler {
    Dpmpp2m(dpmsolver_multistep::DPMSolverMultistepScheduler),
    Dpmpp2s(dpmsolver_singlestep::DPMSolverSinglestepScheduler),
    Ddim(ddim::DDIMScheduler),
    Ddpm(ddpm::DDPMScheduler),
    EulerDiscrete(euler_discrete::EulerDiscreteScheduler),
}

I'm not 100% sure what's the best approach.

CUDA/GPU Not Working.

I've been testing out the default test script and every time is says:

Cuda available: false
Cudnn available: false
MPS available: false

Even though I've testing in pytorch and tensorflow that my gpu is available and shown.

ControlNet support?

I think it would be nice to have ControlNet support.
Not sure how hard it gonna take for this task. 🤔

STATUS_DLL_NOT_FOUND

Hello, I'm a window user trying to enable CUDA in my otherwise functionning app, I have installed locally PyTorch 2.0.0 + the NVIDIA packages, and I can check in a python terminal that CUDA is enabled (torch.cuda.is_available() give me True). My current problem has to do with Windows 11 it seems, I get this error when trying to launch a cargo run :
exit code: 0xc0000135, STATUS_DLL_NOT_FOUND

The program doesn't even launch, so the main isn't called and I can't debug anything beyond that stage. I'm trying to read about this error but I could use some help, I understand that I have to download and install missing DLL, but which one to target? Did anyone have the same problem and could give me a hint about how to solve this issue ?

Thanks!

WebAssembly inference example

Hi, is there an example of using stable diffusion in WebAssembly run time using Wagi / Spin / WasmEdge etc. using GPU and serving modules over https requests? If not, how can I achieve such use case? any help would be appreciated, thanks.

Bad distorted picture using the in-painting example provided

I noticed the exactly identical issue to this one in runwayml and I'm asking if there is something that we are missing.

I also tried to adapt the Python code from here mentioned also here but without any luck.

I just receive some weird runtime errors like:

Given groups=1, weight of size [320, 9, 3, 3], expected input[2, 4, 64, 64] to have 9 channels, but got 4 channels instead

Someone has some suggestion on how can I use the in-painting correctly?

Tracking issue for SD ecosystem feature parity

The intention for this issue is to provide a comprehensive outline of all the core features and capabilities other distributions of Stable Diffusion (primarily A1111) provide. It's a big list, but not all are nearly as high priority as others. Some items in this outline will be turned into GitHub issues for discussing and tracking progress on implementation. Please comment on this issue to suggest additions, clarifications, and sub-features and I'll aim to keep the outline up to date.

Generation methods

Txt2img
Img2img
- In/outpainting
  - Choice of starting with existing image, smeared surrounding colors, latent noise, and latent nothing
- Denoising strength (this is already implemented?)
Depth2Img (via txt2img and img2img)
Regional prompts/latent couple/two shot diffusion (a unique prompt per grid area, like the left half and right half of the image)

Generation parameters

Viewing the image generation progress as it runs (this is very high priority for Graphite)
Negative prompts
CFG scale (is this already implemented?)
Non-square multiple-of-64 resolutions
- Widths and heights as multiples of 8 instead of 64
Infinite prompt token length
Multiple prompts (like space ship (sci-fi) vs. space AND ship (sailing ship in space))
Prompt token weighting (like (beautiful:1.5) tree (with autumn leaves:0.8))
Seed resize (pin a seed and its resolution, then generate at a different resolution or aspect ratio and keep mostly the same image)

Model support

Stylization

ControlNet

Some features are described at https://github.com/Mikubill/sd-webui-controlnet but I don't currently have time to make a list of them. Help with such a list would be appreciated.

This method of promptless inpainting: https://www.reddit.com/r/StableDiffusion/comments/13w28bi/controlnet_and_a1111_devs_discussing_new_inpaint/

Optimizations

VRAM reduction strategies, things like xformers and floating point precision? I don't understand this stuff enough to really get it. Also other methods will remove certain parts of the pipeline from VRAM after that stage has been completed which trades time for VRAM requirements. I'll need help creating a list of out this.

Upscaling

Some upscalers are entirely separate models and are thus likely out of scope. Other upscalers, I think, are part of the SD pipeline. Some are scripts, but I think others are actual models which require being implemented in the actual pipeline? Those ones should probably be included here, but I need help creating a list.

Sampling methods

Other models

Upscaling (ESRGAN, etc.)
CLIP interrogator
Restore faces (GFPGAN, CodeFormer)
(probably more?)

Did I miss something? Probably! Hopefully the community can help me keep this list updated so it's as comprehensive as possible. Thanks ❤️.

Ideally these capabilities would be modular, allowing for composability and opting in and out of specific features at will for any desired image generation pipeline. In our use case with Graphite, we want to put different options into nodes within a node graph so they are user-configurable. (I should also mention that keeping the MIT/Apache 2.0 license is important for Graphite, since our project is also Apache 2.0, so I'd humbly request that some care be taken to not copy from copyleft code which would force this library to change its license, thanks 😃).

Error: No such file or directory

Running

cargo run --example stable-diffusion --features clap -- --prompt "A very rusty robot holding a fire torch." --cpu all

Yields

Error: No such file or directory (os error 2)

But doesn't tell me what directory is missing. I assume it can't find the model weights but have no idea how to fix. Can this error me made more clear?

Loading of text embeddings in pt format?

I was wondering on how is possible to load embeddings in pt format?

If I try to load it as Clip I get an error:

Expected Tensor but got GenericDict

Also it would be nice to support more than one external text embedding.

Benchmarks?

Hi @LaurentMazare,
have you benchmarked this against huggingface's python diffusers? Should I expected it to be any faster? If yes, can you give me some intuition on the reason?

An inference step number which evenly divides the training step number causes out of bounds indexing

DDIMScheduler's first time step is effectively inference_steps * train_timesteps / inference_steps, which if there is no truncation, puts it exactly at train_timesteps, just out of bounds.

thread '<unnamed>' panicked at 'index out of bounds: the len is 1000 but the index is 1000'
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\std\src\panicking.rs:584
   1: core::panicking::panic_fmt
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\core\src\panicking.rs:142
   2: core::panicking::panic_bounds_check
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\core\src\panicking.rs:84
   3: core::slice::index::impl$2::index<f64>
             at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\core\src\slice\index.rs:250
   4: core::slice::index::impl$0::index
             at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\core\src\slice\index.rs:18
   5: alloc::vec::impl$16::index<f64,usize,alloc::alloc::Global>
             at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\alloc\src\vec\mod.rs:2628
   6: diffusers::schedulers::ddim::DDIMScheduler::step
             at diffusers-0.1.2\src\schedulers\ddim.rs:74

CUDNN_STATUS_NOT_INITIALIZED

Building the autoencoder.
Building the unet.
Timestep 0/30
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("cuDNN error: CUDNN_STATUS_NOT_INITIALIZED\nException raised from createCuDNNHandle at /build/python-pytorch/src/pytorch-1.13.0-cuda/aten/src/ATen/cudnn/Handle.cpp:9 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x92 (0x7fdb34705bf2 in /usr/lib/libc10.so)\nframe #1: <unknown function> + 0xd89413 (0x7fdaeab89413 in /usr/lib/libtorch_cuda.so)\nframe #2: at::native::getCudnnHandle() + 0x7b8 (0x7fdaeaeded18 in /usr/lib/libtorch_cuda.so)\nframe #3: <unknown function> + 0x1055f89 (0x7fdaeae55f89 in /usr/lib/libtorch_cuda.so)\nframe #4: <unknown function> + 0x10505f4 (0x7fdaeae505f4 in /usr/lib/libtorch_cuda.so)\nframe #5: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0xad (0x7fdaeae50a2d in /usr/lib/libtorch_cuda.so)\nframe #6: <unknown function> + 0x3882bf4 (0x7fdaed682bf4 in /usr/lib/libtorch_cuda.so)\nframe #7: <unknown function> + 0x3882cad (0x7fdaed682cad in /usr/lib/libtorch_cuda.so)\nframe #8: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x226 (0x7fdae0416e46 in /usr/lib/libtorch_cpu.so)\nframe #9: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x1097 (0x7fdadf821ff7 in /usr/lib/libtorch_cpu.so)\nframe #10: <unknown function> + 0x258518e (0x7fdae078518e in /usr/lib/libtorch_cpu.so)\nframe #11: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x299 (0x7fdadff96759 in /usr/lib/libtorch_cpu.so)\nframe #12: at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x111 (0x7fdadf814bd1 in /usr/lib/libtorch_cpu.so)\nframe #13: <unknown function> + 0x2584c3e (0x7fdae0784c3e in /usr/lib/libtorch_cpu.so)\nframe #14: at::_ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x15d (0x7fdadff43b3d in /usr/lib/libtorch_cpu.so)\nframe #15: <unknown function> + 0x43b0226 (0x7fdae25b0226 in /usr/lib/libtorch_cpu.so)\nframe #16: <unknown function> + 0x43b1060 (0x7fdae25b1060 in /usr/lib/libtorch_cpu.so)\nframe #17: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x247 (0x7fdadff95a27 in /usr/lib/libtorch_cpu.so)\nframe #18: at::native::conv2d(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x20d (0x7fdadf81908d in /usr/lib/libtorch_cpu.so)\nframe #19: <unknown function> + 0x273d3f6 (0x7fdae093d3f6 in /usr/lib/libtorch_cpu.so)\nframe #20: at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x202 (0x7fdae053fad2 in /usr/lib/libtorch_cpu.so)\nframe #21: <unknown function> + 0x2f359e (0x55ad31d2559e in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #22: <unknown function> + 0x2fe5da (0x55ad31d305da in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #23: <unknown function> + 0x2b79bd (0x55ad31ce99bd in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #24: <unknown function> + 0x2bd561 (0x55ad31cef561 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #25: <unknown function> + 0x2e1ad0 (0x55ad31d13ad0 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #26: <unknown function> + 0x2c05f1 (0x55ad31cf25f1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #27: <unknown function> + 0xd90f5 (0x55ad31b0b0f5 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #28: <unknown function> + 0x96438 (0x55ad31ac8438 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #29: <unknown function> + 0x975a1 (0x55ad31ac95a1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #30: <unknown function> + 0xb496b (0x55ad31ae696b in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #31: <unknown function> + 0xa10ae (0x55ad31ad30ae in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #32: <unknown function> + 0xacbf1 (0x55ad31adebf1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #33: <unknown function> + 0x621aee (0x55ad32053aee in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #34: <unknown function> + 0xacbc0 (0x55ad31adebc0 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #35: <unknown function> + 0x9b83c (0x55ad31acd83c in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #36: <unknown function> + 0x23290 (0x7fdade03c290 in /usr/lib/libc.so.6)\nframe #37: __libc_start_main + 0x8a (0x7fdade03c34a in /usr/lib/libc.so.6)\nframe #38: <unknown function> + 0x91905 (0x55ad31ac3905 in $HOME.cargo-target/debug/examples/stable-diffusion)\n")', $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/wrappers/tensor_generated.rs:6457:72
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: core::result::unwrap_failed
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1785:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1078:23
   4: tch::wrappers::tensor_generated::<impl tch::wrappers::tensor::Tensor>::conv2d
             at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/wrappers/tensor_generated.rs:6457:9
   5: <tch::nn::conv::Conv<[i64; 2]> as tch::nn::module::Module>::forward
             at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/nn/conv.rs:216:9
   6: tch::nn::module::<impl tch::wrappers::tensor::Tensor>::apply
             at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/nn/module.rs:47:9
   7: diffusers::models::unet_2d::UNet2DConditionModel::forward
             at ./src/models/unet_2d.rs:237:18
   8: stable_diffusion::run
             at ./examples/stable-diffusion/main.rs:167:30
   9: stable_diffusion::main
             at ./examples/stable-diffusion/main.rs:200:9
  10: core::ops::function::FnOnce::call_once
             at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Not sure if this is regarding the library or some other issue.

DirectML Support

Hello!
Are there any plans for additional backends?
I'd love to see DirectML support to allow it to be used on nvidia/amd cards without requiring installing additional drivers (on Windows/Xbox, at least).

In a similar vein for Apple, CoreML.

Thanks again for your work on this project!

Please consider packaging the final weights

In the readme you mention:

If there is some interest in having the final weight files available, open an issue and we could consider packaging them.

Having to run a bunch of python code first to convert the weights manually is quite annoying, so yes, please consider packaging them in an easy download in the right format.

I'd really like to use this, but doing all those steps manually for converting the weights is too difficult.

Running the 1.5 model is not possible without having the 2.1 weights installed

As far as I understand it, this command should generate an image with SD 1.5:

cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."

and this command should generate an image with SD 2.1:

cargo run --example stable-diffusion-2 --features clap -- --prompt "A rusty robot holding a fire torch."

I am trying to run the first command, that should use SD 1.5. But I get this error:

Error: Internal torch error: open file failed because of errno 2 on fopen: No such file or directory, file path: data/clip_v2.1.ot

I have the weights for 1.5 in the data folder, but no weights for 2.1.

Feature Request: Add flag for height and width of the image

It would be very useful to have a simple flag like --H 512 and --W 512 for setting the resolution to generate an image at.

Cannot link when used together with cxx-qt crate

I'm trying to build a simple Qt gui that uses diffusers-rs. I can compile and run both the diffusers example and my cxx-qt gui app separately but when adding both cxx-qt and diffusers in Cargo.toml linking fails.

As a test, I modified the diffusers Cargo.toml as follows with no other changes:

$ git diff Cargo.toml
diff --git a/Cargo.toml b/Cargo.toml
index db4b4d3..70dce57 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -17,6 +17,9 @@ exclude = [
]

[dependencies]
+cxx = "1.0"
+cxx-qt = "0.5"
+cxx-qt-lib = "0.5"
anyhow = "1"
thiserror = "1"
regex = "1.6.0"
@@ -48,3 +51,6 @@ doc-only = ["tch/doc-only"]

[package.metadata.docs.rs]
features = ["doc-only"]
+
+[build-dependencies]
+cxx-qt-build = "0.5"

which resulted in the following error:

= note: /usr/bin/ld: /home/mneilly/RustProjects/third_party/DL/diffusers-rs/target/debug/deps/libtorch_sys-34524f8ee2d3acbe.rlib(torch_api_generated.o): undefined reference to symbol '_ZN3c104warnERKNS_7WarningE'
/usr/bin/ld: /home/mneilly/RustProjects/sandboxes/tch-sandbox/pytorch-2.0/lib/python3.11/site-packages/torch/lib/libc10.so: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
= note: some extern functions couldn't be found; some native libraries may need to be installed or have their path specified
= note: use the -l flag to specify native libraries to link
= note: use the cargo:rustc-link-lib directive to specify the native libraries to link with Cargo (see https://doc.rust-lang.org/cargo/reference/build-scripts.html#cargorustc-link-libkindname)
error: could not compile diffusers (example "stable-diffusion") due to previous error

Note that leaving out cxx-qt-lib causes the linker error to go away.

I've attached the full output from the build.

diffqt.log