leejet / stable-diffusion.cpp Goto Github PK

View Code? Open in Web Editor NEW

2.6K 46.0 189.0 9.13 MB

Stable Diffusion in pure C/C++

License: MIT License

CMake 0.05% C++ 99.83% Dockerfile 0.01% Shell 0.01% C 0.11%

ai cplusplus diffusion ggml image-generation latent-diffusion stable-diffusion text2image txt2img image2image

stable-diffusion.cpp's People

Contributors

Stargazers

Watchers

Forkers

ggerganov smallw00d2211 perfyperfect treksis jsonzilla gj-raza peterreid katsu560 ibob jarodxiangliu andrewldesousa dotieuthien keyzf bharatr21 green-sky satyajitghana hoke soxueren thesimpleone hbcbh1999 neerajkanhere brutalcoding cdsmith doytsujin illin-villains sangohan mycpuorg k-tang-mkv hien daddyunikii saberwangss leevaleeth mbakpur123 adeliavale rioncarter ijooe85 asawari24 cauliyang dani-el-lo ammaramin22 james68904 pterameta luan1412167 drasticactions eltociear jiangzongkang aveshrahmani05 manick94 aliushn fwsgonzo kumar045 dmikey brucepro endrytate soon14 jmaigc joinhack yodamaster gstariarch hehehe159 bawa74090 fssrepo byte-6174 shahid798 nehemiah888 bitrake grtxm redlegenddev bakahr asdlei99 vinicius-ianni achaoa weiping archernero hubin858130 cuichuankai m18coppola notmariekondo ursg vic4key thunder112358 dstarepoch metalnow jonathhhan scruffyfurn winstonchen rayrayraykk 5l1v3r1 dreamsxin meowboy326 randxie yellowrosecx catflower75 jinghao666 alexmihalyk23 rbledsaw3 starrshaw hertera1 oreo-lp cywiz57

stable-diffusion.cpp's Issues

stable diffusion not work in WASM

Hello, I compile this project to wasm, but when I run it crash.

flash attention leads to error

I try to use ggml_flash_attn to accelerate the process, so I replace ggml_mul_mat in cross-attention in UNET in stable-diffusion.cpp:

...
#if 1
                struct ggml_tensor * kqv = ggml_flash_attn(ctx, q, k, v, true);
#else
                struct ggml_tensor* kq = ggml_mul_mat(ctx, k, q);  // [N * n_head, h * w, h * w]
                // kq = ggml_diag_mask_inf_inplace(ctx, kq, 0);
                kq = ggml_soft_max_inplace(ctx, kq);

                struct ggml_tensor* kqv = ggml_mul_mat(ctx, v, kq);  // [N * n_head, h * w, d_head]
#endif
...

But it leads to an error. Looks like the max_position = 2, N = 64, and const int64_t P = nek1 - N; which is less than 0. Can someone help me? Great thx!

First impressions info dump

Hey, finally stable diffusion for ggml 😄

Did a test run

$ ./sd -t 8 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal"
[INFO]  stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[INFO]  stable-diffusion.cpp:2214 - ftype: q8_0
[INFO]  stable-diffusion.cpp:2259 - params ctx size =  1618.72 MB
[INFO]  stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.46s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.16s
[INFO]  stable-diffusion.cpp:2830 - start sampling
[INFO]  stable-diffusion.cpp:2674 - step 1 sampling completed, taking 18.34s
[INFO]  stable-diffusion.cpp:2674 - step 2 sampling completed, taking 18.24s
[INFO]  stable-diffusion.cpp:2674 - step 3 sampling completed, taking 18.65s
[INFO]  stable-diffusion.cpp:2674 - step 4 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 5 sampling completed, taking 18.31s
[INFO]  stable-diffusion.cpp:2674 - step 6 sampling completed, taking 18.18s
[INFO]  stable-diffusion.cpp:2674 - step 7 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 8 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 9 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 10 sampling completed, taking 18.28s
[INFO]  stable-diffusion.cpp:2674 - step 11 sampling completed, taking 18.19s
[INFO]  stable-diffusion.cpp:2674 - step 12 sampling completed, taking 18.00s
[INFO]  stable-diffusion.cpp:2674 - step 13 sampling completed, taking 18.03s
[INFO]  stable-diffusion.cpp:2674 - step 14 sampling completed, taking 18.54s
[INFO]  stable-diffusion.cpp:2674 - step 15 sampling completed, taking 18.32s
[INFO]  stable-diffusion.cpp:2674 - step 16 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 17 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 18 sampling completed, taking 18.51s
[INFO]  stable-diffusion.cpp:2674 - step 19 sampling completed, taking 18.62s
[INFO]  stable-diffusion.cpp:2674 - step 20 sampling completed, taking 18.11s
[INFO]  stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2835 - sampling completed, taking 366.14s
[INFO]  stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2842 - decode_first_stage completed, taking 57.66s
[INFO]  stable-diffusion.cpp:2843 - txt2img completed in 423.96s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'

Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already, convert.py worked anyway though. :)

Timings: I used the q8_0 quantization and ran with different thread counts:
I have a 12core(24threads) cpu.
I took the timing of a sampling step.

quant	q8_0	q4_0	f16
-t 1	75.31s	75.20s	82.92s
-t 2	42.44s
-t 4	28.65s	29.23s	30.00s
-t 6	21.68s
-t 8	18.34s	18.89s	19.05s
-t 10	16.38s	16.78s	17.61s
-t 12	16.26s	16.98s	18.11s
-t 14	17.93s
-t 16	16.80s
-t 18	16.70s
-t 20	16.20s
-t 22	16.96s
-t 24	18.93s

Additional questions:

do you have/plan to support token weighing? ( eg: (cinematic:1.3) )
are you looking into supporting cuda/opencl backends from ggml?
are you looking into k-quants (like llama.cpp) and some form of quality mesurement of quantizations? (since k-quants use different quant for different parts of the model)
it would be nice if the tool printed the "system line" (see https://github.com/ggerganov/llama.cpp/blob/f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e/llama.cpp#L4267 )
did not see it mentioned, does it support sd 2.x / do you plan to add support for that ?
my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

edit: added f16 timings

Hey, I am using this repo successfully in a very strange single-threaded environment, and it's working. Great work! I really love the idea of a CPU-based generator that can be both single-threaded and statically built. One problem that I have though is that images smaller than 512x512 seems to be failing all the time. This issue also happens in a regular Linux terminal. Example:

$ ./build/bin/sd -m models/sd-v1-4-ggml-model-q4_1.bin -W 256 -H 256 --seed 42 --steps 12 -p "A lovely cat, high quality" -o 
sd.png

Option: 
    n_threads:       16
    mode:            txt2img
    model_path:      models/sd-v1-4-ggml-model-q4_1.bin
    output_path:     sd.png
    init_img:        
    prompt:          A lovely cat, high quality
    negative_prompt: 
    cfg_scale:       7.00
    width:           128
    height:          128
    sample_method:   eular a
    sample_steps:    12
    strength:        0.75
    seed:            42
System Info: 
    BLAS = 0
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0

$ ./build/bin/sd -m models/sd-v1-4-ggml-model-q4_1.bin -W 128 -H 128 --seed 42 --steps 12 -p "A lovely cat, high quality" -o sd.png

$ ./build/bin/sd -v -m models/sd-v1-4-ggml-model-q4_1.bin -W 512 -H 512 --seed 42 --steps 12 -p "A lovely cat, high quality" -o sd.png

These images were created on my AMD Ryzen 9 7950X machine. I am looking into this problem now, just creating this issue to track the problem.

Changing models doesn't help. I am experimenting with -march settings now.

video support

(rewriting sloppy request)
I was wondering if video support can be added?

At first I came up with lucidrain's video-diffusion-pytorch
https://github.com/lucidrains/video-diffusion-pytorch

But, after some research it seems like zeroscope might be the right model to use
https://huggingface.co/cerspense/zeroscope_v2_576w

Make a PR updating ggml and support CUDA backend

@leejet I intend to create a pull request that requires me to use the latest version of ggml to utilize ggml-alloc and ggml-backend for adding GPU acceleration to this project. The issue is that I need some feedback to make progress. I'm not sure if you're already working on something to avoid redoing tasks that are already done.

AVX512 is not auto-detected

Hi,
I tried to build this project on a Xeon W-2135 system, with both gcc-11 and gcc-12. The CPU supports AVX512 (but nothing more advanced). After building, the binary indicated support for AVX and AVX2, but not for AVX512.

So I went ahead and changed the AVX512 flag in ggml/CMakeLists.txt from OFF to ON, deleted all the generated files and ran cmake again. The logs indicated that the AVX512 flag was indeed ON, but CFLAGS/CXXFLAGS still did not contain "-mavx512f". No AVX512 in the binary yet, either.

Eventually I just put "set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mavx512f)" somewhere into the CMakeLists.txt (same for CMAKE_CXX_FLAGS) and got a functional AVX512-enabled result. But I still consider it a bug that AVX512 is not detected by default, and not detected when the flag is manually set, either.

This happened with both "CC=gcc-11" and "CC=gcc-12" when running cmake.

Best Regards

[Feature Request]: SD XL support

Error while loading weight

I got this error while loading the models

[INFO]  stable-diffusion.cpp:2500 - loading model from '/path/to/models/meichidarkMix_meichidarkV4-ggml-model-q4_0.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO]  stable-diffusion.cpp:2525 - ftype: q4_0
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2570 - params ctx size =  1431.33 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[WARN]  stable-diffusion.cpp:2650 - unknown tensor 'control_model.input_blocks.0.0.bias' in model file
[WARN]  stable-diffusion.cpp:2650 - unknown tensor 'control_model.input_blocks.0.0.weight' in model file
[WARN]  stable-diffusion.cpp:2650 - unknown tensor 'control_model.input_blocks.1.0.emb_layers.1.bias' in model file
[WARN]  stable-diffusion.cpp:2650 - unknown tensor 'control_model.input_blocks.1.0.emb_layers.1.weight' in model file
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_create
Aborted (core dumped)

I'm using Intel i5-8250U laptop with Ubuntu, RAM 12GB
Am I doing wrong while converting and quantizing the model or... ?

Thank you

“clblast.h”: No such file or directory

PS D:\src\sd-cpp-src> $env:CLBLAST_HOME = "C:\vcpkg\installed\x64-windows\"
PS D:\src\sd-cpp-src> ls $env:CLBLAST_HOME\bin


    Directory: C:\vcpkg\installed\x64-windows\bin


Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----         2023/8/31     16:47        3394560 clblast.dll
-a----         2023/8/31     17:07        1707520 openblas.dll
-a----         2023/8/31     10:43          54784 OpenCL.dll

PS D:\src\sd-cpp-src> ls $env:CLBLAST_HOME\include


    Directory: C:\vcpkg\installed\x64-windows\include


Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----         2023/8/31     10:43                CL
d-----         2023/8/31     17:07                openblas
-a----         2021/1/20      4:19          43027 clblast.h
-a----         2021/1/20      4:19         146525 clblast_c.h
-a----         2021/1/20      4:19          35227 clblast_half.h
-a----         2023/8/26      3:43           1238 openblas_common.h


PS D:\src\sd-cpp-src>> cmake -B build -DGGML_CLBLAST=ON
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.19044.
-- The C compiler identification is MSVC 19.34.31933.0
-- The CXX compiler identification is MSVC 19.34.31933.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.34.31933/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.34.31933/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Deprecation Warning at ggml/CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- x86 detected
-- clBLAST found
-- Configuring done (4.1s)
-- Generating done (0.0s)
-- Build files have been written to: D:/src/sd-cpp-src/build

PS D:\src\sd-cpp-src>> cmake --build build -j --config Release
MSBuild version 17.4.0+18d5aef85 for .NET Framework
  1>Checking Build System
  Building Custom Rule D:/src/sd-cpp-src/ggml/src/CMakeLists.txt
  ggml.c
  ggml-alloc.c
  正在生成代码...
  ggml-opencl.cpp
D:\src\sd-cpp-src\ggml\src\ggml-opencl.cpp(10,10): fatal  error C1083: 无法打开包括文件: “clblast.h”: No such file or directory
 [D:\src\sd-cpp-src\build\ggml\src\ggml.vcxproj]

Build fails

Tried whatever is given in #54 but not working:

Error:

 fatal error: intrin.h: No such file or directory
 #include <intrin.h>
                    ^
compilation terminated.
ggml\src\CMakeFiles\ggml.dir\build.make:75: recipe for target 'ggml/src/CMakeFiles/ggml.dir/ggml.c.obj' failed
mingw32-make.exe[2]: *** [ggml/src/CMakeFiles/ggml.dir/ggml.c.obj] Error 1
CMakeFiles\Makefile2:157: recipe for target 'ggml/src/CMakeFiles/ggml.dir/all' failed
mingw32-make.exe[1]: *** [ggml/src/CMakeFiles/ggml.dir/all] Error 2
Makefile:134: recipe for target 'all' failed
mingw32-make.exe: *** [all] Error 2

An illegal hardware instruction error

Device Info:
Macbook pro,Apple M2,Ventura 13.3.1

Error Info:
% ./sd -m revAnimated_v11-ggml-model-f16.bin -p "a lovely pig" -v
Option:
n_threads: 4
mode: txt2img
model_path: revAnimated_v11-ggml-model-f16.bin
output_path: output.png
init_img:
prompt: a lovely pig
negative_prompt:
cfg_scale: 7.00
width: 512
height: 512
sample_method: eular a
sample_steps: 20
strength: 0.75
seed: 42
System Info:
BLAS = 1
SSE3 = 1
AVX = 1
AVX2 = 0
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 0
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[INFO] stable-diffusion.cpp:2698 - loading model from 'revAnimated_v11-ggml-model-f16.bin'
[DEBUG] stable-diffusion.cpp:2706 - verifying magic
[DEBUG] stable-diffusion.cpp:2717 - loading hparams
[INFO] stable-diffusion.cpp:2723 - ftype: f16
[DEBUG] stable-diffusion.cpp:2729 - loading vocab
[DEBUG] stable-diffusion.cpp:2757 - ggml tensor size = 288 bytes
zsh: illegal hardware instruction ./sd -m revAnimated_v11-ggml-model-f16.bin -p "a lovely pig" -v

Stable Diffusion XL?

Hi!

Is this project viable for Stable Diffusion XL?

Thank you

Latent Consistency Models support would be great in this project

In the last week, there has been a lot of talk about a new type of model Latent Consistency Models that significantly improves the performance and generation of stable diffusion with fewer steps.

It apparently works with a LoRA adapter that can be applied to any existing model. I'm not sure if there are any specific changes to the UNet architecture that need to be made, but what needs to be done is adding a new sampler LCM Solver.

After completing the CUDA acceleration support, which is almost finished, I will see if I can work on adding LoRA support. This will require a complete change in the current project structure. Following that, I'll add the new solver and conduct the necessary tests.

ggml error when using "--schedule kerras" with clblast

[INFO]  stable-diffusion.cpp:2830 - loading model from '/home/cwillu/ext/work/models/sd/cyberrealistic_v33-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2858 - model type: SD1.x
[INFO]  stable-diffusion.cpp:2866 - ftype: f16
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1012:xnack-'
ggml_opencl: device FP16 support: true
[INFO]  stable-diffusion.cpp:3090 - total params size = 1969.98MB (clip 235.01MB, unet 1640.46MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:3096 - loading model from '/home/cwillu/ext/work/models/sd/cyberrealistic_v33-ggml-model-f16.bin' completed, taking 0.64s
[INFO]  stable-diffusion.cpp:3121 - running in eps-prediction mode
[INFO]  stable-diffusion.cpp:3365 - condition graph use 248.59MB of memory: params 235.01MB, runtime 13.58MB (static 10.65MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3365 - condition graph use 248.59MB of memory: params 235.01MB, runtime 13.58MB (static 10.65MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:4097 - get_learned_condition completed, taking 2.71s
[INFO]  stable-diffusion.cpp:4113 - start sampling
[INFO]  stable-diffusion.cpp:3753 - sampling using modified DPM++ (2M) method
ggml_opencl: ggml_cl_h2d_tensor_2d(queue, d_X, 0, src0, i03, i02, NULL) error -30 at /media/cwillu/External/cwillu/work/stable-diffusion.cpp/ggml/src/ggml-opencl.cpp:1505

I also get a similar error when using models that aren't f16 (i.e., f32, q4, etc), regardless of any other options, but that's probably a maybe-related-but-separate issue.

Print seed when using random

I've started the program with -v -s "-1" it shows:

    seed:            -1

It would be helpful if there was a random seed option and the log output included the actual seed used.

ggml Errors during build

arm64 macOS ventura, cmake version 3.27.3

[ 57%] Building CXX object CMakeFiles/stable-diffusion.dir/stable-diffusion.cpp.o
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:137:43: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
~~~ ^~~~~~~~~~~~~
%lld
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:137:58: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
~~~ ^~~~~~~~~~~~~
%lld
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:137:73: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
~~~ ^~~~~~~~~~~~~
%lld
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:137:88: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
~~~ ^~~~~~~~~~~~~
%lld
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:902:18: error: use of undeclared identifier 'ggml_group_norm'
auto h = ggml_group_norm(ctx, x);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1125:13: error: use of undeclared identifier 'ggml_group_norm'
x = ggml_group_norm(ctx, x);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1379:28: error: use of undeclared identifier 'ggml_get_dynamic'; did you mean 'ggml_get_name'?
bool dynamic = ggml_get_dynamic(ctx);
^~~~~~~~~~~~~~~~
ggml_get_name
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:664:35: note: 'ggml_get_name' declared here
GGML_API const char * ggml_get_name (const struct ggml_tensor * tensor);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1379:45: error: cannot initialize a parameter of type 'const struct ggml_tensor *' with an lvalue of type 'struct ggml_context *'
bool dynamic = ggml_get_dynamic(ctx);
^~~
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:273:12: note: 'ggml_context' is not defined, but forward declared here; conversion would be valid if it was derived from 'ggml_tensor'
struct ggml_context;
^
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:664:79: note: passing argument to parameter 'tensor' here
GGML_API const char * ggml_get_name (const struct ggml_tensor * tensor);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1380:13: error: use of undeclared identifier 'ggml_set_dynamic'; did you mean 'ggml_set_name'?
ggml_set_dynamic(ctx, false);
^~~~~~~~~~~~~~~~
ggml_set_name
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:665:35: note: 'ggml_set_name' declared here
GGML_API struct ggml_tensor * ggml_set_name ( struct ggml_tensor * tensor, const char * name);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1380:30: error: cannot initialize a parameter of type 'struct ggml_tensor *' with an lvalue of type 'struct ggml_context *'
ggml_set_dynamic(ctx, false);
^~~
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:273:12: note: 'ggml_context' is not defined, but forward declared here; conversion would be valid if it was derived from 'ggml_tensor'
struct ggml_context;
^
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:665:79: note: passing argument to parameter 'tensor' here
GGML_API struct ggml_tensor * ggml_set_name ( struct ggml_tensor * tensor, const char * name);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1382:13: error: use of undeclared identifier 'ggml_set_dynamic'; did you mean 'ggml_set_name'?
ggml_set_dynamic(ctx, dynamic);
^~~~~~~~~~~~~~~~
ggml_set_name
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:665:35: note: 'ggml_set_name' declared here
GGML_API struct ggml_tensor * ggml_set_name ( struct ggml_tensor * tensor, const char * name);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1382:30: error: cannot initialize a parameter of type 'struct ggml_tensor ' with an lvalue of type 'struct ggml_context '
ggml_set_dynamic(ctx, dynamic);
^~~
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:273:12: note: 'ggml_context' is not defined, but forward declared here; conversion would be valid if it was derived from 'ggml_tensor'
struct ggml_context;
^
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:665:79: note: passing argument to parameter 'tensor' here
GGML_API struct ggml_tensor * ggml_set_name ( struct ggml_tensor * tensor, const char * name);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1427:13: error: use of undeclared identifier 'ggml_upscale'
x = ggml_upscale(ctx, x); // [N, channels, h2, w2]
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1801:21: error: use of undeclared identifier 'ggml_concat'; did you mean 'ggml_context'?
h = ggml_concat(ctx, h, h_skip);
^
/Users/saidm/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:273:12: note: 'ggml_context' declared here
struct ggml_context;
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1818:13: error: use of undeclared identifier 'ggml_group_norm'
h = ggml_group_norm(ctx, h);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:1922:18: error: use of undeclared identifier 'ggml_group_norm'
auto h = ggml_group_norm(ctx, z);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2031:19: error: use of undeclared identifier 'ggml_group_norm'
auto h_ = ggml_group_norm(ctx, x);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2256:13: error: use of undeclared identifier 'ggml_group_norm'
h = ggml_group_norm(ctx, h);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2438:13: error: use of undeclared identifier 'ggml_group_norm'
h = ggml_group_norm(ctx, h);
^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2757:20: error: no member named 'dynamic' in 'ggml_init_params'
params.dynamic = false;
~~~~~~ ^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2776:20: error: no member named 'dynamic' in 'ggml_init_params'
params.dynamic = false;
~~~~~~ ^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2797:20: error: no member named 'dynamic' in 'ggml_init_params'
params.dynamic = false;
~~~~~~ ^
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2901:49: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
name.data(), nelements, ggml_nelements(tensor));
^~~~~~~~~~~~~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:41:68: note: expanded from macro 'LOG_ERROR'
#define LOG_ERROR(format, ...) SD_LOG(SDLogLevel::ERROR, format, ##VA_ARGS)
~~~~~~ ^~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:28:80: note: expanded from macro 'SD_LOG'
printf("[DEBUG] %s:%-4d - " format "\n", FILENAME, LINE, ##VA_ARGS);
~~~~~~ ^~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2901:49: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
name.data(), nelements, ggml_nelements(tensor));
^~~~~~~~~~~~~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:41:68: note: expanded from macro 'LOG_ERROR'
#define LOG_ERROR(format, ...) SD_LOG(SDLogLevel::ERROR, format, ##VA_ARGS)
~~~~~~ ^~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:30:80: note: expanded from macro 'SD_LOG'
printf("[INFO] %s:%-4d - " format "\n", FILENAME, LINE, ##VA_ARGS);
~~~~~~ ^~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2901:49: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
name.data(), nelements, ggml_nelements(tensor));
^~~~~~~~~~~~~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:41:68: note: expanded from macro 'LOG_ERROR'
#define LOG_ERROR(format, ...) SD_LOG(SDLogLevel::ERROR, format, ##VA_ARGS)
~~~~~~ ^~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:32:89: note: expanded from macro 'SD_LOG'
fprintf(stderr, "[WARN] %s:%-4d - " format "\n", FILENAME, LINE, ##VA_ARGS);
~~~~~~ ^~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2901:49: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
name.data(), nelements, ggml_nelements(tensor));
^~~~~~~~~~~~~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:41:68: note: expanded from macro 'LOG_ERROR'
#define LOG_ERROR(format, ...) SD_LOG(SDLogLevel::ERROR, format, ##VA_ARGS)
~~~~~~ ^~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:34:89: note: expanded from macro 'SD_LOG'
fprintf(stderr, "[ERROR] %s:%-4d - " format "\n", FILENAME, LINE, ##VA_ARGS);
~~~~~~ ^~~~~~~~~~~
/Users/saidm/stable-diffusion.cpp/stable-diffusion.cpp:2961:20: error: no member named 'dynamic' in 'ggml_init_params'
params.dynamic = dynamic;
~~~~~~ ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
8 warnings and 20 errors generated.
make[2]: *** [CMakeFiles/stable-diffusion.dir/stable-diffusion.cpp.o] Error 1
make[1]: *** [CMakeFiles/stable-diffusion.dir/all] Error 2
make: *** [all] Error 2

Enabling cuda GPU acceleration

It is possible to use cuBLAS by enabling it when compiling:
-DGGML_CUBLAS=ON

Maybe add this to the readme?

Image generation doesnt work.

(Cuda) PS D:\stable-diffusion.cpp> ./build/bin/Release/sd.exe -m "D:\stable-diffusion.cpp\models\v1-5-pruned-emaonly-ggml-model-f32.bin" -p "neko, catgirl, cute" -o "D:\stable-diffusion.cpp\outputs\output.png" -v -t 12
Option:
n_threads: 12
mode: txt2img
model_path: D:\stable-diffusion.cpp\models\v1-5-pruned-emaonly-ggml-model-f32.bin
output_path: D:\stable-diffusion.cpp\outputs\output.png
init_img:
prompt: neko, catgirl, cute
negative_prompt:
cfg_scale: 7.00
width: 512
height: 512
sample_method: eular a
sample_steps: 20
strength: 0.75
seed: 42
System Info:
BLAS = 0
SSE3 = 1
AVX = 1
AVX2 = 1
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 1
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[INFO] stable-diffusion.cpp:2687 - loading model from 'D:\stable-diffusion.cpp\models\v1-5-pruned-emaonly-ggml-model-f32.bin'
[DEBUG] stable-diffusion.cpp:2695 - verifying magic
[DEBUG] stable-diffusion.cpp:2706 - loading hparams
[INFO] stable-diffusion.cpp:2712 - ftype: f32
[DEBUG] stable-diffusion.cpp:2718 - loading vocab
[DEBUG] stable-diffusion.cpp:2746 - ggml tensor size = 272 bytes
[DEBUG] stable-diffusion.cpp:2751 - clip params ctx size = 470.72 MB
[DEBUG] stable-diffusion.cpp:2770 - unet params ctx size = 2156.43 MB
[DEBUG] stable-diffusion.cpp:2791 - vae params ctx size = 95.51 MB
[DEBUG] stable-diffusion.cpp:2812 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2828 - loading weights
[DEBUG] stable-diffusion.cpp:2932 - model size = 2719.24MB
[INFO] stable-diffusion.cpp:2941 - total params size = 2719.53MB (clip 469.50MB, unet 2155.53MB, vae 94.51MB)
[INFO] stable-diffusion.cpp:2943 - loading model from 'D:\stable-diffusion.cpp\models\v1-5-pruned-emaonly-ggml-model-f32.bin' completed, taking 43.51s
(Cuda) PS D:\stable-diffusion.cpp>

It stops and there is no output.

build error with ggml

Firstly I got many error with latest version of ggml. By checking issue #18, I change ggml branch to ed522bb8051658899b2f4a5bbb5483a5d21fcfb2. But it still give some error when I build the source, and giving

[ 16%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[ 33%] Linking C static library libggml.a
[ 33%] Built target ggml
[ 50%] Building CXX object CMakeFiles/stable-diffusion.dir/stable-diffusion.cpp.o
/mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp: In function ‘ggml_tensor* ggml_group_norm_32(ggml_context*, ggml_tensor*)’:
/mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp:266:38: error: too many arguments to function ‘ggml_tensor* ggml_group_norm(ggml_context*, ggml_tensor*)’
  266 |     return ggml_group_norm(ctx, a, 32);
      |                                      ^
In file included from /mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp:16:
/mnt/d/yeqing/github/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:929:34: note: declared here
  929 |     GGML_API struct ggml_tensor* ggml_group_norm(
      |                                  ^~~~~~~~~~~~~~~
/mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp: In member function ‘ggml_tensor* ResBlock::forward(ggml_context*, ggml_tensor*, ggml_tensor*)’:
/mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp:985:47: error: too many arguments to function ‘ggml_tensor* ggml_group_norm_inplace(ggml_context*, ggml_tensor*)’
  985 |         h = ggml_group_norm_inplace(ctx, h, 32);
      |                                               ^
In file included from /mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp:16:
/mnt/d/yeqing/github/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:933:34: note: declared here
  933 |     GGML_API struct ggml_tensor* ggml_group_norm_inplace(
      |                                  ^~~~~~~~~~~~~~~~~~~~~~~
/mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp: In member function ‘ggml_tensor* UpSample::forward(ggml_context*, ggml_tensor*)’:
/mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp:1480:35: error: too many arguments to function ‘ggml_tensor* ggml_upscale(ggml_context*, ggml_tensor*)’
 1480 |         x = ggml_upscale(ctx, x, 2);  // [N, channels, h*2, w*2]
      |                                   ^
In file included from /mnt/d/yeqing/github/stable-diffusion.cpp/stable-diffusion.cpp:16:
/mnt/d/yeqing/github/stable-diffusion.cpp/ggml/src/../include/ggml/ggml.h:1329:34: note: declared here
 1329 |     GGML_API struct ggml_tensor* ggml_upscale(
      |                                  ^~~~~~~~~~~~
make[2]: *** [CMakeFiles/stable-diffusion.dir/build.make:76: CMakeFiles/stable-diffusion.dir/stable-diffusion.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:133: CMakeFiles/stable-diffusion.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

I was wondering which version of ggml should I use to build from source?
THANKS!

miniSD/nanoSD (256x256 and 128x128 image generation) results

I'm sorry for polluting the GitHub issues with non-bugs, but since that precedent was already set by #1 and there's no Discussions enabled, I thought it may appropriate to share it here.

Laptop CPUs are always rather underpowered. As said in #15, even old desktop CPUs perform much better than modern mid-range laptops. Even more so, phones and ARM micro-computers are laughably slow.

The sampling can be much sped up by using a lower resolution, but models expectedly perform very poorly at resolutions lower than trained, resulting in colorful abstract shapes only vaguely resembling the expected objects.

But someone on HuggingFace managed to fine-tuned on 256x256 and 128x128 images to the point of getting coherent outputs!

https://huggingface.co/justinpinkney/miniSD
https://huggingface.co/bguisard/stable-diffusion-nano-2-1 (Not yet tested, need to convert to .ckpt first)

This is great news for CPU inference, since the sampling time was cut in half! The outputs might have looked slightly less detailed, but were perfectly coherent.

I haven't investigated if there are any differences in outputs between stable-diffusion.cpp and official implementation, or if quantization has greater impact at lower resolution, but it does seem promising for real-life usage of this project.

GGML OpenCL Type Error

Your GGML repo has opencl bug.
leejet/ggml#1
fix size type error
sizeof(ggml_fp16_t) is error
should be sizeof(float)

https://github.com/leejet/ggml/pull/1/files

Segmentation fault

Error says:

$ ./bin/sd -m ../models/sd-v1-4-ggml-model-q4_0.bin -p "cat"
[INFO]  stable-diffusion.cpp:2830 - loading model from '../models/sd-v1-4-ggml-model-q4_0.bin'
[INFO]  stable-diffusion.cpp:2858 - model type: SD1.x
[INFO]  stable-diffusion.cpp:2866 - ftype: q4_0
[INFO]  stable-diffusion.cpp:3094 - total params size = 1431.17MB (clip 66.46MB, unet 1270.21MB, vae 94.50MB)
[INFO]  stable-diffusion.cpp:3096 - loading model from '../models/sd-v1-4-ggml-model-q4_0.bin' completed, taking 2.13s
[INFO]  stable-diffusion.cpp:3121 - running in eps-prediction mode
[INFO]  stable-diffusion.cpp:3372 - condition graph use 79.79MB of memory: params 66.46MB, runtime 13.33MB (static 10.40MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3372 - condition graph use 79.79MB of memory: params 66.46MB, runtime 13.33MB (static 10.40MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:4228 - get_learned_condition completed, taking 0.89s
[INFO]  stable-diffusion.cpp:4244 - start sampling
[INFO]  stable-diffusion.cpp:3565 - sampling using Euler A method
Segmentation fault

Idk what is an issue because nothing is written here.

Uncaught exception (regex)

Hello,

I'm assuming that I'm not using this on a fully supported system but I cannot get this to run on Android with Termux.

I generated a q4_1 version of the standard 1.5 model with the Python script and followed the basic guide to setting everything up.

~/stable-diffusion.cpp/build $ ./bin/sd -m ../models/v1-5-pruned-emaonly-ggml-model-q4_1.bin -p "A cityscape at sunset, oil painting" -v                                
Option:                                                     
n_threads:       4                                      
mode:            txt2img                                
model_path:      ../models/v1-5-pruned-emaonly-ggml-model-q4_1.bin                                              
output_path:     output.png                             
init_img:                                               
prompt:          A cityscape at sunset, oil painting   
negative_prompt:                                        
cfg_scale:       7.00                                   
width:           512                                    
height:          512                                    
sample_method:   eular a                                
sample_steps:    20                                     
strength:        0.75                                   
seed:            42                                 
System Info:                                                
BLAS = 0                                                
SSE3 = 0                                                
AVX = 0                                                 
AVX2 = 0                                                
AVX512 = 0                                              
AVX512_VBMI = 0                                         
AVX512_VNNI = 0                                         
FMA = 0                                                 
NEON = 1                                                
ARM_FMA = 1                                             
F16C = 0                                                
FP16_VA = 0                                             
WASM_SIMD = 0                                           
VSX = 0                                             
[INFO]  stable-diffusion.cpp:2687 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q4_1.bin'      
[DEBUG] stable-diffusion.cpp:2695 - verifying magic     
[DEBUG] stable-diffusion.cpp:2706 - loading hparams     
[INFO]  stable-diffusion.cpp:2712 - ftype: q4_1         
[DEBUG] stable-diffusion.cpp:2718 - loading vocab       
[DEBUG] stable-diffusion.cpp:2746 - ggml tensor size = 272 bytes                                                
[DEBUG] stable-diffusion.cpp:2751 - clip params ctx size =  75.02 MB                                            
[DEBUG] stable-diffusion.cpp:2770 - unet params ctx size =  1287.24 MB                                          
[DEBUG] stable-diffusion.cpp:2791 - vae params ctx size =  95.51 MB                                             
[DEBUG] stable-diffusion.cpp:2812 - preparing memory for the weights                                            
[DEBUG] stable-diffusion.cpp:2828 - loading weights     
[DEBUG] stable-diffusion.cpp:2932 - model size = 1454.34MB                                                      
[INFO]  stable-diffusion.cpp:2941 - total params size = 1454.64MB (clip 73.80MB, unet 1286.34MB, vae 94.51MB)   
[INFO]  stable-diffusion.cpp:2943 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q4_1.bin' completed, taking 1.32s                                      
terminating with uncaught exception of type std::__ndk1::regex_error: The parser did not consume the entire regular expression.                                         
Aborted

Support clip skip

I think most of the models suggest some value for clip skip. It would be very useful if it was supported.

OpenCL seems to almost work

@leejet @Green-Sky @ggerganov

I do not know cpp and do not have a solid grasp on how ggml works. , but building the repo with cmake -dggml_clblast=ON seems to work as the GPU utilization goes up and it’s very fast (10s vs 80s per step on a higher end CPU). It does complete all the steps and completes sampling too, but then crashes at line 1505 of ggml-opencl.

If it is a matter of spending time to make this work, is it simple enough for one of you to explain what needs to be done? If so, would be happy to give it a shot but don’t know where to start.

My limited understanding is that sampling is what takes all the effort, so is there a way to maybe switch from GPU to CPU to save the file? Or am I missing some context/knowledge?

Edit: Fixed typo. Flag used is clblast, not openblas.

Support dynamic library build to the release build product

Can we add the dynamic library build to the release build product?

[Bug] width or height can not be 608

GGML_Assert ggml.c 5733 a->ne[0] .....

Feature request: Exif metadata

I think it's a great feature that tools like automatic1111 write all the parameters from inference into the images' metadata. That way you can re-discover old pictures on your harddrive and see what model hash, seed and prompt you used.

The Readme of the stb library mentiones this small lib:

https://github.com/mayanklahiri/easyexif

(Edit: I think that library is just to read metadata, not write it 😞 )

Benchmark ?

Can you share how many seconds or it/s you do with your hardware (CPU/GPU/RAM) ?

converting of .safetensors not working

i am trying to convert the v1-5-pruned-emaonly.safetensors but the file generated is not working.

convert.exe v1-5-pruned-emaonly.safetensors -t q4_0
loading model 'v1-5-pruned-emaonly.safetensors'
model type: checkpoint
Stable Diffusion 1.x - v1-5-pruned-emaonly.safetensors
preprocessing 0 tensors
using embedded vocab
converting 0 tensors
alphas_cumprod computed

CLIP Model Tensor count: 0
UNET Model Tensor count: 0
VAE Model Tensor count: 0

saving gguf file
model saved 'v1-5-pruned-emaonly-q4_0.gguf' correctly.

and then

sd.exe -m v1-5-pruned-emaonly-q4_0.gguf -p "anorange cat, realistic"
[INFO]  stable-diffusion.cpp:3715 - loading model from 'v1-5-pruned-emaonly-q4_0.gguf'
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | v1-5-pruned-emaonly.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: q4_0
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.embeddings.position_embedding.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.embeddings.token_embedding.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm1.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm1.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm2.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm2.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc1.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc1.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc2.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc2.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.k_proj.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.k_proj.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.out_proj.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.out_proj.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.q_proj.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.q_proj.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.v_proj.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.v_proj.weight' not in model file

.
.
.

model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.to_out.0.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.to_q.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.to_v.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.ff.net.0.proj.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.ff.net.0.proj.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.ff.net.2.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.ff.net.2.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.norm1.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.norm1.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.norm2.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.norm2.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.norm3.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.1.transformer_blocks.0.norm3.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.2.conv.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.8.2.conv.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.emb_layers.1.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.emb_layers.1.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.in_layers.0.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.in_layers.0.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.in_layers.2.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.in_layers.2.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.out_layers.0.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.out_layers.0.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.out_layers.3.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.out_layers.3.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.skip_connection.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.0.skip_connection.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.norm.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.norm.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.proj_in.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.proj_in.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.proj_out.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.proj_out.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn1.to_k.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn1.to_out.0.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn1.to_out.0.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn1.to_q.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn1.to_v.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_k.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_out.0.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_out.0.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_q.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_v.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.ff.net.0.proj.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.ff.net.0.proj.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.ff.net.2.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.ff.net.2.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.norm1.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.norm1.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.norm2.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.norm2.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.norm3.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.output_blocks.9.1.transformer_blocks.0.norm3.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.time_embed.0.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.time_embed.0.weight' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.time_embed.2.bias' not in model file
[ERROR] stable-diffusion.cpp:3889 - tensor 'model.diffusion_model.time_embed.2.weight' not in model file

bug: tokenizer only matches exact words and no subwords

eg spacestation should probably be tokenized as space + station</w>, but right now it is just an unhandled token.

[DEBUG] stable-diffusion.cpp:1077 - parse 'spacestation' to [['spacestation', 1], ]
[DEBUG] stable-diffusion.cpp:469  - split prompt "spacestation" to tokens ["<|endoftext|>", ]

(resulting images have nothing to do with a spacestation)

Poor img2img results. Is it ok?

txt2img works fine for me but img2img gives blurry abstract images

original image

./sd.exe -m v2.bin -p "old two-storied american mansion entrance porch, bushes, second floor, door and windows nailed up with boards" -t 6 --sampling-method dpm++2mv2 --mode img2img -i Untitled.jpg --strength 0.2 --seed -1

--strength 0.7

I tried several images and different sampling methods, tried negative prompt "blur, blurry", all gives results like these.
Model is 512-base-ema.ckpt (v2 base model, works fine for txt2img)

Full output

$ ./sd.exe -m v2.bin -p "old two-storied american mansion entrance porch, bushes, second floor, door and windows nailed up with boards" -t 6 --sampling-method dpm++2mv2 --seed -1 --mode img2img -i Untitled.jpg --strength 0.7 -v
Option:
    n_threads:       6
    mode:            img2img
    model_path:      v2.bin
    output_path:     output.png
    init_img:        Untitled.jpg
    prompt:          old two-storied american mansion entrance porch, bushes, se
cond floor, door and windows nailed up with boards
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   dpm++2mv2
    schedule:        default
    sample_steps:    20
    strength:        0.70
    rng:             cuda
    seed:            28994
System Info:
    BLAS = 0
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2832 - loading model from 'v2.bin'
[DEBUG] stable-diffusion.cpp:2840 - verifying magic
[DEBUG] stable-diffusion.cpp:2851 - loading hparams
[INFO]  stable-diffusion.cpp:2860 - model type: SD2.x
[INFO]  stable-diffusion.cpp:2868 - ftype: q8_0
[DEBUG] stable-diffusion.cpp:2874 - loading vocab
[DEBUG] stable-diffusion.cpp:2902 - ggml tensor size = 320 bytes
[DEBUG] stable-diffusion.cpp:2907 - clip params ctx size =  360.00 MB
[DEBUG] stable-diffusion.cpp:2926 - unet params ctx size =  1406.42 MB
[DEBUG] stable-diffusion.cpp:2947 - vae params ctx size =  179.12 MB
[DEBUG] stable-diffusion.cpp:2968 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2984 - loading weights
[DEBUG] stable-diffusion.cpp:3087 - model size = 1923.54MB
[INFO]  stable-diffusion.cpp:3096 - total params size = 1923.98MB (clip 358.70MB
, unet 1405.51MB, vae 159.77MB)
[INFO]  stable-diffusion.cpp:3098 - loading model from 'v2.bin' completed, takin
[DEBUG] stable-diffusion.cpp:3431 - diffusion context need 16.61MB static memory
, with work_size needing 5.31MB
[INFO]  stable-diffusion.cpp:3892 - sampling using modified DPM++ (2M) method
[INFO]  stable-diffusion.cpp:3561 - step 1 sampling completed, taking 31.30s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 2 sampling completed, taking 30.70s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 3 sampling completed, taking 32.15s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 4 sampling completed, taking 31.58s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 5 sampling completed, taking 31.49s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 6 sampling completed, taking 32.33s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 7 sampling completed, taking 32.68s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 8 sampling completed, taking 32.13s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 9 sampling completed, taking 31.19s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 10 sampling completed, taking 31.08s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 11 sampling completed, taking 31.47s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 12 sampling completed, taking 33.07s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 13 sampling completed, taking 39.65s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 14 sampling completed, taking 34.51s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3561 - step 15 sampling completed, taking 38.46s
[DEBUG] stable-diffusion.cpp:3565 - diffusion graph use 396.74MB runtime memory:
 static 16.61MB, dynamic 380.13MB
[DEBUG] stable-diffusion.cpp:3566 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:3960 - diffusion graph use 1802.26MB of memory: par
ams 1405.51MB, runtime 396.74MB (static 16.61MB, dynamic 380.13MB)
[DEBUG] stable-diffusion.cpp:3961 - 66560 bytes of dynamic memory has not been r
eleased yet
[INFO]  stable-diffusion.cpp:4367 - sampling completed, taking 493.82s
[DEBUG] stable-diffusion.cpp:4131 - vae context need 10.16MB static memory, with
 work_size needing 0.00MB
[DEBUG] stable-diffusion.cpp:4162 - computing vae graph completed, taking 71.56s
[INFO]  stable-diffusion.cpp:4185 - vae graph use 2220.92MB of memory: params 15
9.77MB, runtime 2061.16MB (static 10.16MB, dynamic 2051.00MB)
[DEBUG] stable-diffusion.cpp:4186 - 3146752 bytes of dynamic memory has not been
 released yet
[INFO]  stable-diffusion.cpp:4379 - decode_first_stage completed, taking 71.61s
[INFO]  stable-diffusion.cpp:4393 - img2img completed in 599.98s, use 3535.86MB
of memory: peak params memory 1923.98MB, peak runtime memory 2061.16MB
save result image to 'output.png'

--threads not respected by openblas / default --threads to number of CPU cores

By default openblas will utilize maximum available threads.

You could set the threads in openblas by using

void goto_set_num_threads(int num_threads);
void openblas_set_num_threads(int num_threads);

https://github.com/xianyi/OpenBLAS#setting-the-number-of-threads-at-runtime

The default for --threads is std::thread::hardware_concurrency() which returns the max threads including hyper-threads. This is not the same as the number of CPU cores. Using threads == cores usually gives the best performance. Here is how you can determinate the number of CPU cores: https://github.com/ggerganov/llama.cpp/blob/d783f7982e0e823a2626a9956359c0d36c1a7e21/examples/common.cpp#L34-L68

Windows OpenBLAS build

Could you add it, please? It is supposed to be faster than avx2 version, right? Sorry for dumb question, i'm a newbie in AI technology.

cblast

a Dcokerfile for the repo ?

it's a really amazing project but i think, it could be great to have a DockerFile to fastly test the app, what do you think ?

Windows x64 executable for stable-diffusion.cpp

I tried to compile your app using cmake but encountered an error. I'm a newbie and know nothing about c and c++ compilation. Could you make a binary release for Windows, please?

P.S: Cmake's error was "Invalid character escape '\M'".

Running img2img failed

Thanks for your great work. The text2image mode works fine, but I met an error when using image2image mode. Any suggestions?

(mlc)- stable-diffusion.cpp % ./cmake-build-debug/bin/sd --mode img2img -m models/stable-diffusion-nano-2-1-ggml-model-q8_0.bin -p "Cat" -i ./nano_cat_q8_0.png -o ./img2img_output_v21_1.png --strength 0.4
[INFO]  stable-diffusion.cpp:2830 - loading model from 'models/stable-diffusion-nano-2-1-ggml-model-q8_0.bin'
[INFO]  stable-diffusion.cpp:2858 - model type: SD2.x
[INFO]  stable-diffusion.cpp:2866 - ftype: q8_0
[WARN]  stable-diffusion.cpp:3028 - unknown tensor 'cond_stage_model.model.transformer.text_model.embeddings.position_ids' in model file
[INFO]  stable-diffusion.cpp:3094 - total params size = 1923.94MB (clip 358.69MB, unet 1405.49MB, vae 159.76MB)
[INFO]  stable-diffusion.cpp:3096 - loading model from 'models/stable-diffusion-nano-2-1-ggml-model-q8_0.bin' completed, taking 0.86s
[INFO]  stable-diffusion.cpp:3244 - check is_using_v_parameterization_for_sd2 completed, taking 0.99s
[INFO]  stable-diffusion.cpp:3121 - running in eps-prediction mode
[INFO]  stable-diffusion.cpp:4296 - img2img 128x128
[INFO]  stable-diffusion.cpp:4300 - target t_enc is 8 steps
Assertion failed: (sizeof(dst->nb[0]) == sizeof(float)), function asymmetric_pad, file stable-diffusion.cpp, line 1407.
zsh: abort      ./cmake-build-debug/bin/sd --mode img2img -m  -p "Cat" -i ./nano_cat_q8_0.png

~~The input image is generated by nano-SD2.1 with 128*128 resolution.~~

I try the example provided, and the same error goes.

[INFO]  stable-diffusion.cpp:2830 - loading model from './models/sd-v1-4-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2858 - model type: SD1.x
[INFO]  stable-diffusion.cpp:2866 - ftype: f16
[INFO]  stable-diffusion.cpp:3094 - total params size = 2035.23MB (clip 235.01MB, unet 1640.46MB, vae 159.76MB)
[INFO]  stable-diffusion.cpp:3096 - loading model from './models/sd-v1-4-ggml-model-f16.bin' completed, taking 1.75s
[INFO]  stable-diffusion.cpp:3121 - running in eps-prediction mode
[INFO]  stable-diffusion.cpp:4296 - img2img 512x512
[INFO]  stable-diffusion.cpp:4300 - target t_enc is 0 steps
Assertion failed: (sizeof(dst->nb[0]) == sizeof(float)), function asymmetric_pad, file stable-diffusion.cpp, line 1407.
zsh: abort      ./cmake-build-debug/bin/sd --mode img2img -m  -p "cat with blue eyes" -i  -o

Edit:
By temporarily removing the assertion below in lines 1407-1409, it works fine:

//        assert(sizeof(dst->nb[0]) == sizeof(float));
//        assert(sizeof(a->nb[0]) == sizeof(float));
//        assert(sizeof(b->nb[0]) == sizeof(float));

Also, I found that img2img can't change the resolution of the image. Can we pad the input image to change the output to target resolution?

CUDA cannot generate images

I encountered a strange problem. After using CUDA, I got a pure green picture when running.But it works fine on another computer.

sd_cuda.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 0.92s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 455 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 415 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 876 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 7.42s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 157.10s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 157.12s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.65s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.66s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.66s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 164.66s
save result image to 'output.png'

Please, add support for Segmind Distilled diffusion models

I found ckpt versions of Segmind Distilles diffusion ( https://github.com/segmind/distill-sd, https://huggingface.co/segmind ) models:

https://huggingface.co/ClashSAN/small-sd/resolve/main/smallSDdistilled.ckpt
https://huggingface.co/ClashSAN/small-sd/resolve/main/tinySDdistilled.ckpt

I ran convert.py script from your repo to make ggml f32 quant for tinySDdistilled.ckpt. Then i tried to launch generated ggml in stable-diffusion.cpp but got this error:

[ERROR] stable-diffusion.cpp:2898 - tensor 'model.diffusion_model.output_blocks.1.0.in_layers.0.weight' has wrong shape in model file: got [1920, 1, 1, 1], expected [2560, 1, 1, 1]

Google Pixel 8 Pro error during "cmake --build . --config Release"

looks like a new error as of clang 16 according to this article:
https://www.redhat.com/en/blog/new-warnings-and-errors-clang-16
I have clang version 17.0.5
Target: aarch64-unknown-linux-android24

~/stable-diffusion.cpp/build $ cmake --build . --config Release                                                             
[  7%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:1221:5: warning: implicit conversion increases floating-point precision: 'float32_t' (aka 'float') to 'ggml_float' (aka 'double') [-Wdouble-promotion]                             
1221 |     GGML_F16_VEC_REDUCE(sumf, sum);                         |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:748:41: note: expanded from macro 'GGML_F16_VEC_REDUCE'                                                               
748 |     #define GGML_F16_VEC_REDUCE         
GGML_F32Cx4_REDUCE                                                                |                                         ^             
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:738:38: note: expanded from macro 
'GGML_F32Cx4_REDUCE'  738 |     #define GGML_F32Cx4_REDUCE       GGML_F32x4_REDUCE      |                                      ^                
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:668:11: note: expanded from macro 'GGML_F32x4_REDUCE'   
668 |     res = GGML_F32x4_REDUCE_ONE(x[0]);         \            |         ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~                 
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:653:34: note: expanded from macro 
'GGML_F32x4_REDUCE_ONE'                                                             
653 | #define GGML_F32x4_REDUCE_ONE(x) vaddvq_f32(x)              |                                  ^~~~~~~~~~~~~        
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:1269:9: warning: implicit conversion increases floating-point 
precision: 'float32_t' (aka 'float') to 'ggml_float' (aka 'double') [-Wdouble-promotion]                             
1269 |         
GGML_F16_VEC_REDUCE(sumf[k], sum[k]);               |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~          
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:748:41: note: expanded from macro 
'GGML_F16_VEC_REDUCE'                                                               
748 |     #define GGML_F16_VEC_REDUCE         
GGML_F32Cx4_REDUCE                                                                |                                         ^             
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:738:38: note: expanded from macro 
'GGML_F32Cx4_REDUCE'  738 |     
#define GGML_F32Cx4_REDUCE       
GGML_F32x4_REDUCE      |                                      ^                
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:668:11: note: expanded from macro 'GGML_F32x4_REDUCE'   
668 |     res = GGML_F32x4_REDUCE_ONE(x[0]);         \            |         ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~                 
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:653:34: note: expanded from macro 
'GGML_F32x4_REDUCE_ONE'                                                             
653 | #define GGML_F32x4_REDUCE_ONE(x) vaddvq_f32(x)              |                                  ^~~~~~~~~~~~~        
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:3155:6: warning: no previous prototype for function 
'ggml_broadcast' [-Wmissing-prototypes]                          3155 | void ggml_broadcast(                                        |      ^                                                
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:3155:1: note: declare 'static' if the function is not intended to be used outside of this translation unit          
3155 | void ggml_broadcast(                                        | ^                                                           | static                                                
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:11953:11: error: type specifier missing, defaults to 'int'; ISO 
C99 and later do not support implicit int [-Wimplicit-int]                                                         
11953 |     const so2 = ne00 * ne01;                                |     ~~~~~ ^                                                 |     int                                              
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:11954:11: error: type specifier missing, defaults to 'int'; ISO 
C99 and later do not support implicit int [-Wimplicit-int]                                                         
11954 |     const so3 = ne00 * ne01 * ne02;                         |     ~~~~~ ^                                                 |     int                                              
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:11955:11: error: type specifier missing, defaults to 'int'; ISO 
C99 and later do not support implicit int [-Wimplicit-int]                                                         
11955 |     const do2 = ne0 * ne1;                                  |     ~~~~~ ^                                                 |     int                                              
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:11956:11: error: type specifier missing, defaults to 'int'; ISO 
C99 and later do not support implicit int [-Wimplicit-int]                                                         
11956 |     const do3 = ne0 * ne1 * ne2;                            |     ~~~~~ ^                                                 |     int                                              
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:11948:15: warning: unused variable 'padding_factor' [-
Wunused-variable]                                              
11948 |     const int padding_factor = dst->op_params[0];           |               ^~~~~~~~~~~~~~                         
/data/data/com.termux/files/home/stable-diffusion.cpp/ggml/src/ggml.c:19127:28: warning: comparison of integers of different signs: 
'const size_t' (aka 'const unsigned long') and 'const int' [-Wsign-compare]                                          
19127 |             if (offset_pad != cur_offset) {                 |                 ~~~~~~~~~~ ^  ~~~~~~~~~~             
5 warnings and 4 errors generated.                            
make[2]: *** [ggml/src/CMakeFiles/ggml.dir/build.make:76: ggml/src/CMakeFiles/ggml.dir/ggml.c.o] Error 1                    
make[1]: *** [CMakeFiles/Makefile2:212: ggml/src/CMakeFiles/ggml.dir/all] Error 2                                           
make: *** [Makefile:136: all] Error 2                         
~/stable-diffusion.cpp/build $

metal-flash-attention support

Can this project help for you? https://github.com/philipturner/metal-flash-attention

So far, metal-flash-attention can indeed provide the fastest generation speed for stable diffusion on MacOS.

insufficient memory error

The error starts from master-7620b92, and there is no problem with other versions.

OS: MacOS 14 M1 16G

Error message:

[INFO]  stable-diffusion.cpp:3084 - running in eps-prediction mode
ggml_aligned_malloc: insufficient memory (attempted to allocate 2268047721628.89 MB)
GGML_ASSERT: /Volumes/SOFT/Dev/stable-diffusion-cpp/stable-diffusion.cpp/ggml/src/ggml.c:4467: ctx->mem_buffer != NULL

Unable to convert model

I wanted to convert this model. It's fine-tuned model based on Stable Diffusion 1.5. I got this error message:

python3 convert.py ~/gameIconInstituteV10_v10.safetensors --out_type f16

loading model from ~/gameIconInstituteV10_v10.safetensors
loading model from ~/gameIconInstituteV10_v10.safetensors completed
Stable diffuison 1.x
no alphas_cumprod in file, generate new one
Saving GGML compatible file to /home/user/stable-diffusion.cpp/models/gameIconInstituteV10_v10-ggml-model-f16.bin
Traceback (most recent call last):
  File "/home/user/stable-diffusion.cpp/models/convert.py", line 369, in <module>
    convert(args.model_path, args.out_type, args.out_file)
  File "/home/user/stable-diffusion.cpp/models/convert.py", line 317, in convert
    data = state_dict[name].numpy()
TypeError: Got unsupported ScalarType BFloat16

Is it possible to add support for this kind of model please?

The updated ggml backend seems broken

I use the latest commit, but I got:

ggml_aligned_malloc: insufficient memory (attempted to allocate 12320886367328.45 MB)
GGML_ASSERT: /Users/raykkk/Desktop/llama.cpp/sd/sync_ggml/stable-diffusion.cpp/ggml/src/ggml.c:4767: ctx->mem_buffer != NULL

Edit:
The commit 09cab2a2ae5006718c334d1b0e285c9d655002cb works fine for me.
The bug appears in fbd18e10593fc71f3825d151bd5d8b0a29f8f8bd.

Is it possible to output images of intermediate sampling steps?

Thanks for your great work!
I wonder if it is possible to output images of intermediate sampling steps. Such as I set sampling steps to 50, but I want to save the image per 5 steps during the diffusion.

If not, can you kindly tell me how to reach this? (This is might be a feature request.) Great thx! :)

[Feature Request] taesd VAE (distilled VAE)

"taesd is tiny, distilled version of a stable diffusion vae."

Image generation results for this vae (showcased on their github) looks nearly identical, maybe having this supported in stable-diffusion-cpp this could increase generation speed.

Github: https://github.com/madebyollin/taesd