GithubHelp home page GithubHelp logo

namkazt / stable-diffusion.cpp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from leejet/stable-diffusion.cpp

0.0 0.0 0.0 24.11 MB

Stable Diffusion in pure C/C++

License: MIT License

Shell 0.01% C++ 99.96% CMake 0.03% Dockerfile 0.01%

stable-diffusion.cpp's Introduction

stable-diffusion.cpp

Inference of Stable Diffusion in pure C/C++

Features

  • Plain C/C++ implementation based on ggml, working in the same way as llama.cpp
  • Super lightweight and without external dependencies
  • SD1.x and SD2.x support
  • SD-Turbo support
  • 16-bit, 32-bit float support
  • 4-bit, 5-bit and 8-bit integer quantization support
  • Accelerated memory-efficient CPU inference
    • Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.
  • AVX, AVX2 and AVX512 support for x86 architectures
  • Full CUDA backend for GPU acceleration.
  • Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
    • No need to convert to .ggml or .gguf anymore!
  • Flash Attention for memory usage optimization (only cpu for now)
  • Original txt2img and img2img mode
  • Negative prompt
  • stable-diffusion-webui style tokenizer (not all the features, only token weighting for now)
  • LoRA support, same as stable-diffusion-webui
  • Latent Consistency Models support (LCM/LCM-LoRA)
  • Faster and memory efficient latent decoding with TAESD
  • Sampling method
  • Cross-platform reproducibility (--rng cuda, consistent with the stable-diffusion-webui GPU RNG)
  • Embedds generation parameters into png output as webui-compatible text string
  • Supported platforms
    • Linux
    • Mac OS
    • Windows
    • Android (via Termux)

TODO

  • More sampling methods
  • Make inference faster
    • The current implementation of ggml_conv_2d is slow and has high memory usage
    • Implement Winograd Convolution 2D for 3x3 kernel filtering
  • Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
  • Implement BPE Tokenizer
  • Implement Real-ESRGAN upscaler
  • k-quants support

Usage

Get the Code

git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
  • If you have already cloned the repository, you can use the following command to update the repository to the latest code.
cd stable-diffusion.cpp
git pull origin master
git submodule init
git submodule update

Download weights

Build

Build from scratch

mkdir build
cd build
cmake ..
cmake --build . --config Release
Using OpenBLAS
cmake .. -DGGML_OPENBLAS=ON
cmake --build . --config Release
Using CUBLAS

This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. apt install nvidia-cuda-toolkit) or from here: CUDA Toolkit. Recommended to have at least 4 GB of VRAM.

cmake .. -DSD_CUBLAS=ON
cmake --build . --config Release

Using Flash Attention

Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.

cmake .. -DSD_FLASH_ATTN=ON
cmake --build . --config Release

Run

usage: sd [arguments]

arguments:
  -h, --help                         show this help message and exit
  -M, --mode [txt2img or img2img]    generation mode (default: txt2img)
  -t, --threads N                    number of threads to use during computation (default: -1).
                                     If threads <= 0, then threads will be set to the number of CPU physical cores
  -m, --model [MODEL]                path to model
  --vae [VAE]                        path to vae
  --taesd [TAESD_PATH]               path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
  --type [TYPE]                      weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)
                                     If not specified, the default is the type of the weight file.
  --lora-model-dir [DIR]             lora model directory
  -i, --init-img [IMAGE]             path to the input image, required by img2img
  -o, --output OUTPUT                path to write result image to (default: ./output.png)
  -p, --prompt [PROMPT]              the prompt to render
  -n, --negative-prompt PROMPT       the negative prompt (default: "")
  --cfg-scale SCALE                  unconditional guidance scale: (default: 7.0)
  --strength STRENGTH                strength for noising/unnoising (default: 0.75)
                                     1.0 corresponds to full destruction of information in init image
  -H, --height H                     image height, in pixel space (default: 512)
  -W, --width W                      image width, in pixel space (default: 512)
  --sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, lcm}
                                     sampling method (default: "euler_a")
  --steps  STEPS                     number of sample steps (default: 20)
  --rng {std_default, cuda}          RNG (default: cuda)
  -s SEED, --seed SEED               RNG seed (default: 42, use random seed for < 0)
  -b, --batch-count COUNT            number of images to generate.
  --schedule {discrete, karras}      Denoiser sigma schedule (default: discrete)
  -v, --verbose                      print extra info

Quantization

You can specify the model weight type using the --type parameter. The weights are automatically converted when loading the model.

  • f16 for 16-bit floating-point
  • f32 for 32-bit floating-point
  • q8_0 for 8-bit integer quantization
  • q5_0 or q5_1 for 5-bit integer quantization
  • q4_0 or q4_1 for 4-bit integer quantization

txt2img example

./bin/sd -m ../models/sd-v1-4.ckpt -p "a lovely cat"
# ./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat"

Using formats of different precisions will yield results of varying quality.

f32 f16 q8_0 q5_0 q5_1 q4_0 q4_1

img2img example

  • ./output.png is the image generated from the above txt2img pipeline
./bin/sd --mode img2img -m ../models/sd-v1-4.ckpt -p "cat with blue eyes" -i ./output.png -o ./img2img_output.png --strength 0.4

with LoRA

  • You can specify the directory where the lora weights are stored via --lora-model-dir. If not specified, the default is the current working directory.

  • LoRA is specified via prompt, just like stable-diffusion-webui.

Here's a simple example:

./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:marblesh:1>" --lora-model-dir ../models

../models/marblesh.safetensors or ../models/marblesh.ckpt will be applied to the model

LCM/LCM-LoRA

  • Download LCM-LoRA form https://huggingface.co/latent-consistency/lcm-lora-sdv1-5
  • Specify LCM-LoRA by adding <lora:lcm-lora-sdv1-5:1> to prompt
  • It's advisable to set --cfg-scale to 1.0 instead of the default 7.0. For --steps, a range of 2-8 steps is recommended. For --sampling-method, lcm/euler_a is recommended.

Here's a simple example:

./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:lcm-lora-sdv1-5:1>" --steps 4 --lora-model-dir ../models -v --cfg-scale 1
without LCM-LoRA (--cfg-scale 7) with LCM-LoRA (--cfg-scale 1)

Using TAESD to faster decoding

You can use TAESD to accelerate the decoding of latent images by following these steps:

Or curl

curl -L -O https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors
  • Specify the model path using the --taesd PATH parameter. example:
sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --taesd ../models/diffusion_pytorch_model.safetensors

Docker

Building using Docker

docker build -t sd .

Run

docker run -v /path/to/models:/models -v /path/to/output/:/output sd [args...]
# For example
# docker run -v ./models:/models -v ./build:/output sd -m /models/sd-v1-4.ckpt -p "a lovely cat" -v -o /output/output.png

Memory Requirements

precision f32 f16 q8_0 q5_0 q5_1 q4_0 q4_1
Memory (txt2img - 512 x 512) ~2.8G ~2.3G ~2.1G ~2.0G ~2.0G ~2.0G ~2.0G
Memory (txt2img - 512 x 512) with Flash Attention ~2.4G ~1.9G ~1.6G ~1.5G ~1.5G ~1.5G ~1.5G

Contributors

Thank you to all the people who have already contributed to stable-diffusion.cpp!

Contributors

References

stable-diffusion.cpp's People

Contributors

leejet avatar ursg avatar green-sky avatar dmikey avatar fssrepo avatar ggerganov avatar eltociear avatar kreijstal avatar rbledsaw3 avatar spadi0 avatar drasticactions avatar cyberhan123 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.