laurentmazare / diffusers-rs Goto Github PK
View Code? Open in Web Editor NEWAn implementation of the diffusers api in Rust
License: Apache License 2.0
An implementation of the diffusers api in Rust
License: Apache License 2.0
When stable diffusion was released, the most requested features were always about reducing GPU requirements as this makes it available to more users with cheaper GPUs. I think this is one important feature to implement in diffusers-rs.
Here it's a list of the diffusers library optimizations: https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx
The main memory reduction feature I think would be allowing half precision models (fp16), that AFAIK is not yet implemented (but I might have missed the conversion somewhere).
I don't know if there is a better option to set the fp16 requirement in advance, but otherwise just loading the VarStore in a CPU device, calling half() on it, and moving it to the GPU using set_device should do the trick. This can be done in the example without touching the library I think.
Then, the next important thing would be Sliced Attention. This should be really straight forward, as we can see here it's just splitting the normal attention into slices and computing each slice at a time in a loop: https://github.com/huggingface/diffusers/blob/08a6dc8a5840e0cc09e65e71e9647321ab9bb254/src/diffusers/models/attention.py#L526
Then, is just a mater of exposing an slice_size configuration element on CrossAttention, and add an attention_slice_size to anyone that uses it.
Supporting Flash Attention from https://github.com/HazyResearch/flash-attention would be nice, but it's much more complicated as needs to be compiled for Cuda and it only works with Nvidia Cards. But this optimization is the main one used by the xformers library related to attention and provides a very good speedup in many cases.
Finally, the last important thing missing would be to allow move some models to the CPU when not in use (see huggingface/diffusers#850), or even run them on the CPU as needed only leaving the unet on the GPU (see huggingface/diffusers#537).
They use the accelerate library, but I think this can be implemented directly on tch. I think just providing a set_device method on all the models would be enough, as everything else can be handled directly on the example or the user code.
What do you think? I'm planning to play a little bit with the diffusers-rs library and stable diffusion the next weeks and I can try to implement a few optimizations, but my knowledge on ML is still very basic.
I have tried with rust implementation of Stable diffusion v2 on A100 gpu with 40gb of ram. Normal stable diffusion pipeline from huggingface takes around 7-8s to generate an image whereas rust implementation takes around 12-13s. It will be really helpful if someone can explain that why is huggingface taking less time compared to rust implementation or am I missing something while running rust implementation?
Thanks!!
Thanks so much for your Rust implementation, and especially for the Stable Diffusion example.
I have a noob question: how do I enable CUDA?
Even though my machine (on Windows):
when running the Stable Diffusion example without the arg --cpu
it still resorts to my CPU instead.
Would someone please kindly help?
Hello,
That might be a silly question, but will there be future integrations with SDXL in a near future? It seems to me that the XL version is pretty different so I figured we couldn't use it as is.
Thanks for your help!
Is there a way to specify gpu use for m1 macs? It's using my cpu when generating but I have more than 8GB memory, I'm using the command:
cargo run --example stable-diffusion --features clap -- --prompt "A very rusty robot holding a fire torch." --cpu all --sd-version v1-5
Many model files are saved in safesensor format, how to load such files?
for example: https://civitai.com/models/5041/cheese-daddys-landscapes-mix
And safetensors also have rust api: https://crates.io/crates/safetensors
When trying
cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."
I got
Error: Internal torch error: PytorchStreamReader failed reading zip archive: failed finding central directory
Exception raised from valid at /Users/runner/work/pytorch/pytorch/pytorch/caffe2/serialize/inline_container.cc:183 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x10818f992 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x10818e0aa in libc10.dylib)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 128 (0x11806ddc0 in libtorch_cpu.dylib)
frame #3: caffe2::serialize::PyTorchStreamReader::init() + 315 (0x11806d1cb in libtorch_cpu.dylib)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 261 (0x11806cfe5 in libtorch_cpu.dylib)
frame #5: torch::jit::import_ir_module(std::__1::shared_ptr<torch::jit::CompilationUnit>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, c10::optional<c10::Device>, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > >&) + 519 (0x1194ed787 in libtorch_cpu.dylib)
frame #6: torch::jit::import_ir_module(std::__1::shared_ptr<torch::jit::CompilationUnit>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, c10::optional<c10::Device>) + 69 (0x1194ed4c5 in libtorch_cpu.dylib)
frame #7: torch::jit::load(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, c10::optional<c10::Device>) + 238 (0x1194ee5ce in libtorch_cpu.dylib)
frame #8: at_load_callback_with_device + 148 (0x10790e9f4 in stable-diffusion)
frame #9: tch::wrappers::tensor::Tensor::load_multi_with_device::h76a379da6fb3bc44 + 420 (0x10771b7c4 in stable-diffusion)
frame #10: tch::nn::var_store::VarStore::named_tensors::haabc631e8482797f + 246 (0x10771c826 in stable-diffusion)
frame #11: tch::nn::var_store::VarStore::load_internal::h57d18ea780681827 + 69 (0x10771bff5 in stable-diffusion)
frame #12: tch::nn::var_store::VarStore::load::h7744a82ce429d62d + 214 (0x10771cbd6 in stable-diffusion)
frame #13: diffusers::pipelines::stable_diffusion::StableDiffusionConfig::build_clip_transformer::h6195c2a8233c895b + 239 (0x10773b7af in stable-diffusion)
frame #14: stable_diffusion::run::h8a551142c83d2c48 + 4025 (0x107703c09 in stable-diffusion)
frame #15: stable_diffusion::main::hd0cfa2a2c29f23e0 + 121 (0x107705b19 in stable-diffusion)
frame #16: core::ops::function::FnOnce::call_once::h4135dd5f1b217b3c + 14 (0x1076f3ace in stable-diffusion)
frame #17: std::sys_common::backtrace::__rust_begin_short_backtrace::h8da8413ad3c77f72 + 17 (0x1077016f1 in stable-diffusion)
frame #18: std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h65c09f346b6d0be1 + 20 (0x1076e9584 in stable-diffusion)
frame #19: std::rt::lang_start_internal::haf0419567751b65f + 708 (0x107a8d744 in stable-diffusion)
frame #20: std::rt::lang_start::he551d2857570b3e6 + 55 (0x1076e9557 in stable-diffusion)
frame #21: main + 24 (0x10770a4b8 in stable-diffusion)
frame #22: start + 1 (0x7fff6abcacc9 in libdyld.dylib)
This is what my data
folder look like and it run just fine for --sd-version v1-5
but get an above error ๐ for 2.1
bpe_simple_vocab_16e6.txt
clip_v2.1.ot
pytorch_model.ot
unet.ot
unet_v2.1.ot
vae.ot
vae_v2.1.ot
Not sure what files or flag I miss for 2.1
?
Would it be possible to use models based on the CompVis style used by stabilityai and supported in HF diffusers? My personal goals are:
I tried the following to convert the file over, and got the names of the tensors using the tensor tools. Maybe these can be extracted and compiled back together?
import numpy as np
import torch
model = torch.load("./data/analog-diffusion-1.0.ckpt")
x = {k: v.numpy() for k, v in model["state_dict"].items()}
np.savez("./data/analog-diffusion-1.0.npz", **x)
cargo run --release --example tensor-tools cp ./data/analog-diffusion-1.0.npz ./data/analog-diffusion-1.0.ot
cargo run --release --example tensor-tools ls ./data/analog-diffusion-1.0.ot
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.0.0.weight Tensor[[320, 4, 3, 3], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.0.0.bias Tensor[[320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.0.weight Tensor[[1280, 320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.0.bias Tensor[[1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.2.weight Tensor[[1280, 1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.2.bias Tensor[[1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.norm.weight Tensor[[320], Half]
...
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.out_proj.weight Tensor[[768, 768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.self_attn.out_proj.bias Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm1.weight Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm1.bias Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc1.weight Tensor[[3072, 768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc1.bias Tensor[[3072], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc2.weight Tensor[[768, 3072], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.mlp.fc2.bias Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm2.weight Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.11.layer_norm2.bias Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.final_layer_norm.weight Tensor[[768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.final_layer_norm.bias Tensor[[768], Half]
Full list analog-diffusion-1.0.ot.log
Thanks!
Hi,
since my NVidia GPU doesn't have enough memory, I resorted to Google Colab and wrote a notebook to run diffusion experiments on the 16GB GPU available there. Here you can access it. Would you be interested in it, e.g. instructing on how to use it in the README ?
Hello @LaurentMazare and thanks for this awesome Rust implementation!
I'm playing around with it but I find the overall setup a little boring as it requires to switch from bash to python a couple of times and to place things in the right directory.
Therefore, I wrote a shell script to automate the whole process. Might you be interested in such a contribution (e.g. under a scripts
folder in the root of the repository) ?
Cheers ๐
Would be great to have support for the DPM MultiStep Scheduler.
Diffusers recommends using it as it's the fastest scheduler at the moment: https://huggingface.co/docs/diffusers/v0.9.0/en/api/pipelines/stable_diffusion_2#available-checkpoints
Link to diffusers python implementation:
Stable Diffusion 2.x is a lot more dependent on negative prompts for good looking outputs than SD 1.5 was.
So it would be very useful if support for negative prompts would be added.
Hi @LaurentMazare,
now that this repository has expanded more and more, I was thinking we could move most of the code of the examples inside a StableDiffusion
pipeline, as HF does.
This would also allow us to support more features, such as negative prompts and then support more diffusion pipelines, such as latent diffusion and so on.
Do you think this is a good idea ? ๐
My GPU has 12GB memory (11GB free) but I still get CUDA out of memory.
nvidia-smi
Thu May 4 18:32:15 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA TITAN Xp Off| 00000000:2B:00.0 On | N/A |
| 24% 40C P0 64W / 250W| 1079MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."
Finished dev [unoptimized + debuginfo] target(s) in 0.18s
Running `target/debug/examples/stable-diffusion --prompt 'A rusty robot holding a fire torch.'
Cuda available: true
Cudnn available: true
MPS available: false
Running with prompt "hello".
Building the Clip transformer.
Building the autoencoder.
Building the unet.
Timestep 0/30
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("CUDA out of memory. Tried to allocate 3.16 GiB (GPU 0; 11.87 GiB total capacity; 8.27 GiB already allocated; 1.68 GiB free; 8.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:936 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faacc05a6bb in /home/jarrod/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x2f176 (0x7faacba2f176 in /home/jarrod/libtorch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2fc12 (0x7faacba2fc12 in /home/jarrod/libtorch/lib/libc10_cuda.so)
...
frame #34: <unknown function> + 0x23790 (0x7faacbe3c790 in /usr/lib/libc.so.6)
frame #35: __libc_start_main + 0x8a (0x7faacbe3c84a in /usr/lib/libc.so.6)
frame #36: <unknown function> + 0x4f0f5 (0x55e9bd45e0f5 in ./target/release/examples/stable-diffusion)
")', /home/jroddev/.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.11.0/src/wrappers/tensor_generated.rs:15578:36
Is there a setting/flag that I'm missing? Should 12GB be sufficient to run this?
This is a feature request to support the inpainting pipeline. I will be probably work on this in the following days, unless somebody comes up with an implementation first.
Will this support Stable Diffusion 2.0?
I'm trying the same example for the 2.1 configuration, downloaded the appropriate CLIP, UNET and VAE and converted them correctly but it does not seems to work.
Command:
cargo run --example stable-diffusion-inpaint --features clap -- --sd-version="v2-1" --prompt "Face of a yellow cat, high resolution, sitting on a park bench." --input-image="temp/dog.png" --mask-image="temp/dog_mask.png" --width=512 --height=512
This is the output that i get with the dog/cat example that works perfectly with Stable Diffusion 1.5:
It seems that the inpainted area doensn't correctly populate.
Any possible reason for that? Should the example/code be adapted for some additional steps for the 2.1 version?
The readme mentions this:
At each step, some sd_*.png image is generated, the last one sd_30.png should be the least noisy one.
I can't find those files anywhere though. I only see the final file called sd_final.png
.
Currently, it prints out something like this when generating an image:
Building the Clip transformer.
Building the autoencoder.
Building the unet.
Timestep 0/5
Timestep 1/5
Timestep 2/5
Timestep 3/5
Timestep 4/5
Generating the final image for sample 1/1.
It would be nice if it would print out this instead:
Building the Clip transformer.
Building the autoencoder.
Building the unet.
Timestep 0/5 | 10 seconds
Timestep 1/5 | 9 seconds
Timestep 2/5 | 10 seconds
Timestep 3/5 | 11 seconds
Timestep 4/5 | 10 seconds
Generating the final image for sample 1/1
Total Elapsed Time: 52 seconds
That would make it easier to compare different settings, like different schedulers, different SD version, CPU vs GPU, etc, and see how much speed difference there is.
This issue aims at tracking the main schedulers missing in this Rust implementation:
I will be working on the first four in the following days/weeks, if you agree. Just wanted to discuss with you the following aspect: they all require a linear interpolator. The python version uses np.interpolate(.)
. Do you prefer to introduce an extra dependency using a third-party interpolation lib or to implement "manually" a linear interpolator and to place it, for instance, at package level in mod.rs
?
Feel free to update this issue adding any other scheduler you want include๐.
only the relevant part for stable diffusion, i e. skipping Runge-Kutta steps. โฉ
I'm going to start understanding the Img2Img pipeline and what is needed to port it.
Also, are you interested in keep growing the SD example or you prefer to keep it simple. I'm asking because for my use case of building an SD based image editor I'm going to need to create a separate library tailed for my needs, but maybe is interesting too also include some features like Img2Img in the SD example.
Running `target\debug\examples\stable-diffusion.exe --prompt "A very rusty robot holding a fire torch." --cpu all`
Cuda available: false
Cudnn available: false
MPS available: false
Running with prompt "A very rusty robot holding a fire torch.".
Building the Clip transformer.
Error: The system cannot find the file specified. (os error 2)
error: process didn't exit successfully: `target\debug\examples\stable-diffusion.exe --prompt "A very rusty robot holding a fire torch." --cpu all` (exit code: 1)
This is the output I get after running cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."
in cmd. I am on windows. I ran scripts/download_weights.sh
in WSL and moved the /scripts/data
directory to /data
beforehand as well. I don't know what file it's referring to by "Error: The system cannot find the file specified. (os error 2)
", help would be appreciated!
As we can see here huggingface/diffusers#556, adding support for padding_mode allows to generate images that can be seamlessly tiled.
Ideally, we would like to be able to reconfigure the padding_mode of all the conv networks without needed to recreate the pipeline.
I'm trying to run this again, but it does not compile. I have cloned this repo with
git clone https://github.com/LaurentMazare/diffusers-rs.git
and then run
cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."
like the readme says. It then compiles everything, until it reaches
Compiling diffusers v0.2.0
and then fails with
error: linking with `link.exe` failed: exit code: 1120
and the actual issue it complains about is this:
= note: Creating library C:\Other\StableDiffusion\Rust\Testing\fromgithub\diffusers-rs\target\debug\examples\stable_diffusion.lib and object C:\Other\StableDiffusion\Rust\Testing\fromgithub\diffusers-rs\target\debug\examples\stable_diffusion.exp
libtorch_sys-56f4e8e25c10d9d7.rlib(torch_api.o) : error LNK2001: unresolved external symbol __imp___tls_index_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
libtorch_sys-56f4e8e25c10d9d7.rlib(torch_api.o) : error LNK2001: unresolved external symbol __imp___tls_offset_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
C:\Other\StableDiffusion\Rust\Testing\fromgithub\diffusers-rs\target\debug\examples\stable_diffusion.exe : fatal error LNK1120: 2 unresolved externals
Am I doing something wrong, or is the current main branch maybe just not compiling? I know I was able to run this fine in the past.
When running this on the CPU, it would be nice to have a flag for setting how many threads it should be allowed to use. Sometimes it might make sense to not use all available threads, and sometimes using more than the available threads might also make sense.
Currently, I see that if I run this on my CPU (Ryzen 3950X), I get a consistent 58% CPU usage, so it appears to use too few threads for me by default. It would be nice to be able to bring it up to 100%.
Right now we are adding the schedulers, but it is difficult to work with since swapping the scheduler doesn't work well. This also slows down testing and evaluation of the schedulers, as a separate script needs to be made each time to test the samplers. I also was implementing these into an application and swapping the schedulers wasn't working (due to different types at runtime).
I experimented some in adding a trait, so we can use impl Scheduler
. Came up with the following, but causes some points of contention.
&mut self
for some schedulersf64
and usize
) currentlyusize
or f64
(or maybe other ones?)pub trait Scheduler {
fn step<T: SomeTraitThatWouldTakef64AndUsize>(&mut self, model_output: &Tensor, timestep: T, sample: &Tensor) -> Tensor;
fn timesteps(&self) -> &[usize];
fn add_noise(&self, original_samples: &Tensor, noise: Tensor, timestep: usize) -> Tensor;
fn init_noise_sigma(&self) -> f64;
fn scale_model_input(&self, sample: Tensor, timestep: usize) -> Tensor;
}
And then I could do the following (I'm still learning traits):
sample<T: Scheduler>(
...,
mut scheduler: T
)
And/or we could also do a Scheduler enum.
enum SamplerScheduler {
Dpmpp2m(dpmsolver_multistep::DPMSolverMultistepScheduler),
Dpmpp2s(dpmsolver_singlestep::DPMSolverSinglestepScheduler),
Ddim(ddim::DDIMScheduler),
Ddpm(ddpm::DDPMScheduler),
EulerDiscrete(euler_discrete::EulerDiscreteScheduler),
}
I'm not 100% sure what's the best approach.
I've been testing out the default test script and every time is says:
Cuda available: false
Cudnn available: false
MPS available: false
Even though I've testing in pytorch and tensorflow that my gpu is available and shown.
I think it would be nice to have ControlNet support.
Not sure how hard it gonna take for this task. ๐ค
Hello, I'm a window user trying to enable CUDA in my otherwise functionning app, I have installed locally PyTorch 2.0.0 + the NVIDIA packages, and I can check in a python terminal that CUDA is enabled (torch.cuda.is_available() give me True). My current problem has to do with Windows 11 it seems, I get this error when trying to launch a cargo run :
exit code: 0xc0000135, STATUS_DLL_NOT_FOUND
The program doesn't even launch, so the main isn't called and I can't debug anything beyond that stage. I'm trying to read about this error but I could use some help, I understand that I have to download and install missing DLL, but which one to target? Did anyone have the same problem and could give me a hint about how to solve this issue ?
Thanks!
Hi, is there an example of using stable diffusion in WebAssembly run time using Wagi / Spin / WasmEdge etc. using GPU and serving modules over https requests? If not, how can I achieve such use case? any help would be appreciated, thanks.
I noticed the exactly identical issue to this one in runwayml
and I'm asking if there is something that we are missing.
I also tried to adapt the Python code from here mentioned also here but without any luck.
I just receive some weird runtime errors like:
Given groups=1, weight of size [320, 9, 3, 3], expected input[2, 4, 64, 64] to have 9 channels, but got 4 channels instead
Someone has some suggestion on how can I use the in-painting correctly?
The intention for this issue is to provide a comprehensive outline of all the core features and capabilities other distributions of Stable Diffusion (primarily A1111) provide. It's a big list, but not all are nearly as high priority as others. Some items in this outline will be turned into GitHub issues for discussing and tracking progress on implementation. Please comment on this issue to suggest additions, clarifications, and sub-features and I'll aim to keep the outline up to date.
space ship
(sci-fi) vs. space AND ship
(sailing ship in space))(beautiful:1.5) tree (with autumn leaves:0.8)
)Some features are described at https://github.com/Mikubill/sd-webui-controlnet but I don't currently have time to make a list of them. Help with such a list would be appreciated.
VRAM reduction strategies, things like xformers and floating point precision? I don't understand this stuff enough to really get it. Also other methods will remove certain parts of the pipeline from VRAM after that stage has been completed which trades time for VRAM requirements. I'll need help creating a list of out this.
Some upscalers are entirely separate models and are thus likely out of scope. Other upscalers, I think, are part of the SD pipeline. Some are scripts, but I think others are actual models which require being implemented in the actual pipeline? Those ones should probably be included here, but I need help creating a list.
Did I miss something? Probably! Hopefully the community can help me keep this list updated so it's as comprehensive as possible. Thanks โค๏ธ.
Ideally these capabilities would be modular, allowing for composability and opting in and out of specific features at will for any desired image generation pipeline. In our use case with Graphite, we want to put different options into nodes within a node graph so they are user-configurable. (I should also mention that keeping the MIT/Apache 2.0 license is important for Graphite, since our project is also Apache 2.0, so I'd humbly request that some care be taken to not copy from copyleft code which would force this library to change its license, thanks ๐).
Running
cargo run --example stable-diffusion --features clap -- --prompt "A very rusty robot holding a fire torch." --cpu all
Yields
Error: No such file or directory (os error 2)
But doesn't tell me what directory is missing. I assume it can't find the model weights but have no idea how to fix. Can this error me made more clear?
I was wondering on how is possible to load embeddings in pt
format?
If I try to load it as Clip I get an error:
Expected Tensor but got GenericDict
Also it would be nice to support more than one external text embedding.
Hi @LaurentMazare,
have you benchmarked this against huggingface's python diffusers? Should I expected it to be any faster? If yes, can you give me some intuition on the reason?
DDIMScheduler
's first time step is effectively inference_steps * train_timesteps / inference_steps
, which if there is no truncation, puts it exactly at train_timesteps
, just out of bounds.
thread '<unnamed>' panicked at 'index out of bounds: the len is 1000 but the index is 1000'
stack backtrace:
0: std::panicking::begin_panic_handler
at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\std\src\panicking.rs:584
1: core::panicking::panic_fmt
at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\core\src\panicking.rs:142
2: core::panicking::panic_bounds_check
at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\core\src\panicking.rs:84
3: core::slice::index::impl$2::index<f64>
at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\core\src\slice\index.rs:250
4: core::slice::index::impl$0::index
at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\core\src\slice\index.rs:18
5: alloc::vec::impl$16::index<f64,usize,alloc::alloc::Global>
at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\alloc\src\vec\mod.rs:2628
6: diffusers::schedulers::ddim::DDIMScheduler::step
at diffusers-0.1.2\src\schedulers\ddim.rs:74
Building the autoencoder.
Building the unet.
Timestep 0/30
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("cuDNN error: CUDNN_STATUS_NOT_INITIALIZED\nException raised from createCuDNNHandle at /build/python-pytorch/src/pytorch-1.13.0-cuda/aten/src/ATen/cudnn/Handle.cpp:9 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x92 (0x7fdb34705bf2 in /usr/lib/libc10.so)\nframe #1: <unknown function> + 0xd89413 (0x7fdaeab89413 in /usr/lib/libtorch_cuda.so)\nframe #2: at::native::getCudnnHandle() + 0x7b8 (0x7fdaeaeded18 in /usr/lib/libtorch_cuda.so)\nframe #3: <unknown function> + 0x1055f89 (0x7fdaeae55f89 in /usr/lib/libtorch_cuda.so)\nframe #4: <unknown function> + 0x10505f4 (0x7fdaeae505f4 in /usr/lib/libtorch_cuda.so)\nframe #5: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0xad (0x7fdaeae50a2d in /usr/lib/libtorch_cuda.so)\nframe #6: <unknown function> + 0x3882bf4 (0x7fdaed682bf4 in /usr/lib/libtorch_cuda.so)\nframe #7: <unknown function> + 0x3882cad (0x7fdaed682cad in /usr/lib/libtorch_cuda.so)\nframe #8: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x226 (0x7fdae0416e46 in /usr/lib/libtorch_cpu.so)\nframe #9: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x1097 (0x7fdadf821ff7 in /usr/lib/libtorch_cpu.so)\nframe #10: <unknown function> + 0x258518e (0x7fdae078518e in /usr/lib/libtorch_cpu.so)\nframe #11: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x299 (0x7fdadff96759 in /usr/lib/libtorch_cpu.so)\nframe #12: at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x111 (0x7fdadf814bd1 in /usr/lib/libtorch_cpu.so)\nframe #13: <unknown function> + 0x2584c3e (0x7fdae0784c3e in /usr/lib/libtorch_cpu.so)\nframe #14: at::_ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x15d (0x7fdadff43b3d in /usr/lib/libtorch_cpu.so)\nframe #15: <unknown function> + 0x43b0226 (0x7fdae25b0226 in /usr/lib/libtorch_cpu.so)\nframe #16: <unknown function> + 0x43b1060 (0x7fdae25b1060 in /usr/lib/libtorch_cpu.so)\nframe #17: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x247 (0x7fdadff95a27 in /usr/lib/libtorch_cpu.so)\nframe #18: at::native::conv2d(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x20d (0x7fdadf81908d in /usr/lib/libtorch_cpu.so)\nframe #19: <unknown function> + 0x273d3f6 (0x7fdae093d3f6 in /usr/lib/libtorch_cpu.so)\nframe #20: at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x202 (0x7fdae053fad2 in /usr/lib/libtorch_cpu.so)\nframe #21: <unknown function> + 0x2f359e (0x55ad31d2559e in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #22: <unknown function> + 0x2fe5da (0x55ad31d305da in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #23: <unknown function> + 0x2b79bd (0x55ad31ce99bd in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #24: <unknown function> + 0x2bd561 (0x55ad31cef561 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #25: <unknown function> + 0x2e1ad0 (0x55ad31d13ad0 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #26: <unknown function> + 0x2c05f1 (0x55ad31cf25f1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #27: <unknown function> + 0xd90f5 (0x55ad31b0b0f5 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #28: <unknown function> + 0x96438 (0x55ad31ac8438 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #29: <unknown function> + 0x975a1 (0x55ad31ac95a1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #30: <unknown function> + 0xb496b (0x55ad31ae696b in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #31: <unknown function> + 0xa10ae (0x55ad31ad30ae in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #32: <unknown function> + 0xacbf1 (0x55ad31adebf1 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #33: <unknown function> + 0x621aee (0x55ad32053aee in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #34: <unknown function> + 0xacbc0 (0x55ad31adebc0 in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #35: <unknown function> + 0x9b83c (0x55ad31acd83c in $HOME.cargo-target/debug/examples/stable-diffusion)\nframe #36: <unknown function> + 0x23290 (0x7fdade03c290 in /usr/lib/libc.so.6)\nframe #37: __libc_start_main + 0x8a (0x7fdade03c34a in /usr/lib/libc.so.6)\nframe #38: <unknown function> + 0x91905 (0x55ad31ac3905 in $HOME.cargo-target/debug/examples/stable-diffusion)\n")', $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/wrappers/tensor_generated.rs:6457:72
stack backtrace:
0: rust_begin_unwind
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
2: core::result::unwrap_failed
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1785:5
3: core::result::Result<T,E>::unwrap
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1078:23
4: tch::wrappers::tensor_generated::<impl tch::wrappers::tensor::Tensor>::conv2d
at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/wrappers/tensor_generated.rs:6457:9
5: <tch::nn::conv::Conv<[i64; 2]> as tch::nn::module::Module>::forward
at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/nn/conv.rs:216:9
6: tch::nn::module::<impl tch::wrappers::tensor::Tensor>::apply
at $HOME.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.9.0/src/nn/module.rs:47:9
7: diffusers::models::unet_2d::UNet2DConditionModel::forward
at ./src/models/unet_2d.rs:237:18
8: stable_diffusion::run
at ./examples/stable-diffusion/main.rs:167:30
9: stable_diffusion::main
at ./examples/stable-diffusion/main.rs:200:9
10: core::ops::function::FnOnce::call_once
at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Not sure if this is regarding the library or some other issue.
Hello!
Are there any plans for additional backends?
I'd love to see DirectML support to allow it to be used on nvidia/amd cards without requiring installing additional drivers (on Windows/Xbox, at least).
In a similar vein for Apple, CoreML.
Thanks again for your work on this project!
In the readme you mention:
If there is some interest in having the final weight files available, open an issue and we could consider packaging them.
Having to run a bunch of python code first to convert the weights manually is quite annoying, so yes, please consider packaging them in an easy download in the right format.
I'd really like to use this, but doing all those steps manually for converting the weights is too difficult.
As far as I understand it, this command should generate an image with SD 1.5:
cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."
and this command should generate an image with SD 2.1:
cargo run --example stable-diffusion-2 --features clap -- --prompt "A rusty robot holding a fire torch."
I am trying to run the first command, that should use SD 1.5. But I get this error:
Error: Internal torch error: open file failed because of errno 2 on fopen: No such file or directory, file path: data/clip_v2.1.ot
I have the weights for 1.5 in the data folder, but no weights for 2.1.
It would be very useful to have a simple flag like --H 512
and --W 512
for setting the resolution to generate an image at.
I'm trying to build a simple Qt gui that uses diffusers-rs. I can compile and run both the diffusers example and my cxx-qt gui app separately but when adding both cxx-qt and diffusers in Cargo.toml linking fails.
As a test, I modified the diffusers Cargo.toml as follows with no other changes:
$ git diff Cargo.toml
diff --git a/Cargo.toml b/Cargo.toml
index db4b4d3..70dce57 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -17,6 +17,9 @@ exclude = [
]
[dependencies]
+cxx = "1.0"
+cxx-qt = "0.5"
+cxx-qt-lib = "0.5"
anyhow = "1"
thiserror = "1"
regex = "1.6.0"
@@ -48,3 +51,6 @@ doc-only = ["tch/doc-only"]
[package.metadata.docs.rs]
features = ["doc-only"]
+
+[build-dependencies]
+cxx-qt-build = "0.5"
which resulted in the following error:
= note: /usr/bin/ld: /home/mneilly/RustProjects/third_party/DL/diffusers-rs/target/debug/deps/libtorch_sys-34524f8ee2d3acbe.rlib(torch_api_generated.o): undefined reference to symbol '_ZN3c104warnERKNS_7WarningE'
/usr/bin/ld: /home/mneilly/RustProjects/sandboxes/tch-sandbox/pytorch-2.0/lib/python3.11/site-packages/torch/lib/libc10.so: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
= note: someextern
functions couldn't be found; some native libraries may need to be installed or have their path specified
= note: use the-l
flag to specify native libraries to link
= note: use thecargo:rustc-link-lib
directive to specify the native libraries to link with Cargo (see https://doc.rust-lang.org/cargo/reference/build-scripts.html#cargorustc-link-libkindname)
error: could not compilediffusers
(example "stable-diffusion") due to previous error
Note that leaving out cxx-qt-lib causes the linker error to go away.
I've attached the full output from the build.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.