GithubHelp home page GithubHelp logo

generative-models's Introduction

Generative Models by Stability AI

sample1

News

July 24, 2024

  • We are releasing Stable Video 4D (SV4D), a video-to-4D diffusion model for novel-view video synthesis. For research purposes:
    • SV4D was trained to generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 context frames (the input video), and 8 reference views (synthesised from the first frame of the input video, using a multi-view diffusion model like SV3D) of the same size, ideally white-background images with one object.
    • To generate longer novel-view videos (21 frames), we propose a novel sampling method using SV4D, by first sampling 5 anchor frames and then densely sampling the remaining frames while maintaining temporal consistency.
    • To run the community-build gradio demo locally, run python -m scripts.demo.gradio_app_sv4d.
    • Please check our project page, tech report and video summary for more details.

QUICKSTART : python scripts/sampling/simple_video_sample_4d.py --input_path assets/sv4d_videos/test_video1.mp4 --output_folder outputs/sv4d (after downloading sv4d.safetensors and sv3d_u.safetensors from HuggingFace into checkpoints/)

To run SV4D on a single input video of 21 frames:

  • Download SV3D models (sv3d_u.safetensors and sv3d_p.safetensors) from here and SV4D model (sv4d.safetensors) from here to checkpoints/

  • Run python scripts/sampling/simple_video_sample_4d.py --input_path <path/to/video>

    • input_path : The input video <path/to/video> can be
      • a single video file in gif or mp4 format, such as assets/sv4d_videos/test_video1.mp4, or
      • a folder containing images of video frames in .jpg, .jpeg, or .png format, or
      • a file name pattern matching images of video frames.
    • num_steps : default is 20, can increase to 50 for better quality but longer sampling time.
    • sv3d_version : To specify the SV3D model to generate reference multi-views, set --sv3d_version=sv3d_u for SV3D_u or --sv3d_version=sv3d_p for SV3D_p.
    • elevations_deg : To generate novel-view videos at a specified elevation (default elevation is 10) using SV3D_p (default is SV3D_u), run python scripts/sampling/simple_video_sample_4d.py --input_path assets/sv4d_videos/test_video1.mp4 --sv3d_version sv3d_p --elevations_deg 30.0
    • Background removal : For input videos with plain background, (optionally) use rembg to remove background and crop video frames by setting --remove_bg=True. To obtain higher quality outputs on real-world input videos with noisy background, try segmenting the foreground object using Clipdrop or SAM2 before running SV4D.
    • Low VRAM environment : To run on GPUs with low VRAM, try setting --encoding_t=1 (of frames encoded at a time) and --decoding_t=1 (of frames decoded at a time) or lower video resolution like --img_size=512.

    tile

March 18, 2024

  • We are releasing SV3D, an image-to-video model for novel multi-view synthesis, for research purposes:
    • SV3D was trained to generate 21 frames at resolution 576x576, given 1 context frame of the same size, ideally a white-background image with one object.
    • SV3D_u: This variant generates orbital videos based on single image inputs without camera conditioning..
    • SV3D_p: Extending the capability of SVD3_u, this variant accommodates both single images and orbital views allowing for the creation of 3D video along specified camera paths.
    • We extend the streamlit demo scripts/demo/video_sampling.py and the standalone python script scripts/sampling/simple_video_sample.py for inference of both models.
    • Please check our project page, tech report and video summary for more details.

To run SV3D_u on a single image:

  • Download sv3d_u.safetensors from https://huggingface.co/stabilityai/sv3d to checkpoints/sv3d_u.safetensors
  • Run python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png> --version sv3d_u

To run SV3D_p on a single image:

  1. Generate static orbit at a specified elevation eg. 10.0 : python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png> --version sv3d_p --elevations_deg 10.0
  2. Generate dynamic orbit at a specified elevations and azimuths: specify sequences of 21 elevations (in degrees) to elevations_deg ([-90, 90]), and 21 azimuths (in degrees) to azimuths_deg [0, 360] in sorted order from 0 to 360. For example: python scripts/sampling/simple_video_sample.py --input_path <path/to/image.png> --version sv3d_p --elevations_deg [<list of 21 elevations in degrees>] --azimuths_deg [<list of 21 azimuths in degrees>]

To run SVD or SV3D on a streamlit server: streamlit run scripts/demo/video_sampling.py

tile

November 30, 2023

  • Following the launch of SDXL-Turbo, we are releasing SD-Turbo.

November 28, 2023

  • We are releasing SDXL-Turbo, a lightning fast text-to image model. Alongside the model, we release a technical report

    • Usage:
      • Follow the installation instructions or update the existing environment with pip install streamlit-keyup.
      • Download the weights and place them in the checkpoints/ directory.
      • Run streamlit run scripts/demo/turbo.py.

    tile

November 21, 2023

  • We are releasing Stable Video Diffusion, an image-to-video model, for research purposes:

    • SVD: This model was trained to generate 14 frames at resolution 576x1024 given a context frame of the same size. We use the standard image encoder from SD 2.1, but replace the decoder with a temporally-aware deflickering decoder.
    • SVD-XT: Same architecture as SVD but finetuned for 25 frame generation.
    • You can run the community-build gradio demo locally by running python -m scripts.demo.gradio_app.
    • We provide a streamlit demo scripts/demo/video_sampling.py and a standalone python script scripts/sampling/simple_video_sample.py for inference of both models.
    • Alongside the model, we release a technical report.

    tile

July 26, 2023

sample2

July 4, 2023

  • A technical report on SDXL is now available here.

June 22, 2023

  • We are releasing two new diffusion models for research purposes:
    • SDXL-base-0.9: The base model was trained on a variety of aspect ratios on images with resolution 1024^2. The base model uses OpenCLIP-ViT/G and CLIP-ViT/L for text encoding whereas the refiner model only uses the OpenCLIP model.
    • SDXL-refiner-0.9: The refiner has been trained to denoise small noise levels of high quality data and as such is not expected to work as a text-to-image model; instead, it should only be used as an image-to-image model.

If you would like to access these models for your research, please apply using one of the following links: SDXL-0.9-Base model, and SDXL-0.9-Refiner. This means that you can apply for any of the two links - and if you are granted - you can access both. Please log in to your Hugging Face Account with your organization email to request access. We plan to do a full release soon (July).

The codebase

General Philosophy

Modularity is king. This repo implements a config-driven approach where we build and combine submodules by calling instantiate_from_config() on objects defined in yaml configs. See configs/ for many examples.

Changelog from the old ldm codebase

For training, we use PyTorch Lightning, but it should be easy to use other training wrappers around the base modules. The core diffusion model class (formerly LatentDiffusion, now DiffusionEngine) has been cleaned up:

  • No more extensive subclassing! We now handle all types of conditioning inputs (vectors, sequences and spatial conditionings, and all combinations thereof) in a single class: GeneralConditioner, see sgm/modules/encoders/modules.py.
  • We separate guiders (such as classifier-free guidance, see sgm/modules/diffusionmodules/guiders.py) from the samplers (sgm/modules/diffusionmodules/sampling.py), and the samplers are independent of the model.
  • We adopt the "denoiser framework" for both training and inference (most notable change is probably now the option to train continuous time models):
    • Discrete times models (denoisers) are simply a special case of continuous time models (denoisers); see sgm/modules/diffusionmodules/denoiser.py.
    • The following features are now independent: weighting of the diffusion loss function (sgm/modules/diffusionmodules/denoiser_weighting.py), preconditioning of the network (sgm/modules/diffusionmodules/denoiser_scaling.py), and sampling of noise levels during training (sgm/modules/diffusionmodules/sigma_sampling.py).
  • Autoencoding models have also been cleaned up.

Installation:

1. Clone the repo

git clone https://github.com/Stability-AI/generative-models.git
cd generative-models

2. Setting up the virtualenv

This is assuming you have navigated to the generative-models root after cloning it.

NOTE: This is tested under python3.10. For other python versions, you might encounter version conflicts.

PyTorch 2.0

# install required packages from pypi
python3 -m venv .pt2
source .pt2/bin/activate
pip3 install -r requirements/pt2.txt

3. Install sgm

pip3 install .

4. Install sdata for training

pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata

Packaging

This repository uses PEP 517 compliant packaging using Hatch.

To build a distributable wheel, install hatch and run hatch build (specifying -t wheel will skip building a sdist, which is not necessary).

pip install hatch
hatch build -t wheel

You will find the built package in dist/. You can install the wheel with pip install dist/*.whl.

Note that the package does not currently specify dependencies; you will need to install the required packages, depending on your use case and PyTorch version, manually.

Inference

We provide a streamlit demo for text-to-image and image-to-image sampling in scripts/demo/sampling.py. We provide file hashes for the complete file as well as for only the saved tensors in the file ( see Model Spec for a script to evaluate that). The following models are currently supported:

  • SDXL-base-1.0
    File Hash (sha256): 31e35c80fc4829d14f90153f4c74cd59c90b779f6afe05a74cd6120b893f7e5b
    Tensordata Hash (sha256): 0xd7a9105a900fd52748f20725fe52fe52b507fd36bee4fc107b1550a26e6ee1d7
    
  • SDXL-refiner-1.0
    File Hash (sha256): 7440042bbdc8a24813002c09b6b69b64dc90fded4472613437b7f55f9b7d9c5f
    Tensordata Hash (sha256): 0x1a77d21bebc4b4de78c474a90cb74dc0d2217caf4061971dbfa75ad406b75d81
    
  • SDXL-base-0.9
  • SDXL-refiner-0.9
  • SD-2.1-512
  • SD-2.1-768

Weights for SDXL:

SDXL-1.0: The weights of SDXL-1.0 are available (subject to a CreativeML Open RAIL++-M license) here:

SDXL-0.9: The weights of SDXL-0.9 are available and subject to a research license. If you would like to access these models for your research, please apply using one of the following links: SDXL-base-0.9 model, and SDXL-refiner-0.9. This means that you can apply for any of the two links - and if you are granted - you can access both. Please log in to your Hugging Face Account with your organization email to request access.

After obtaining the weights, place them into checkpoints/. Next, start the demo using

streamlit run scripts/demo/sampling.py --server.port <your_port>

Invisible Watermark Detection

Images generated with our code use the invisible-watermark library to embed an invisible watermark into the model output. We also provide a script to easily detect that watermark. Please note that this watermark is not the same as in previous Stable Diffusion 1.x/2.x versions.

To run the script you need to either have a working installation as above or try an experimental import using only a minimal amount of packages:

python -m venv .detect
source .detect/bin/activate

pip install "numpy>=1.17" "PyWavelets>=1.1.1" "opencv-python>=4.1.0.25"
pip install --no-deps invisible-watermark

To run the script you need to have a working installation as above. The script is then useable in the following ways (don't forget to activate your virtual environment beforehand, e.g. source .pt1/bin/activate):

# test a single file
python scripts/demo/detect.py <your filename here>
# test multiple files at once
python scripts/demo/detect.py <filename 1> <filename 2> ... <filename n>
# test all files in a specific folder
python scripts/demo/detect.py <your folder name here>/*

Training:

We are providing example training configs in configs/example_training. To launch a training, run

python main.py --base configs/<config1.yaml> configs/<config2.yaml>

where configs are merged from left to right (later configs overwrite the same values). This can be used to combine model, training and data configs. However, all of them can also be defined in a single config. For example, to run a class-conditional pixel-based diffusion model training on MNIST, run

python main.py --base configs/example_training/toy/mnist_cond.yaml

NOTE 1: Using the non-toy-dataset configs configs/example_training/imagenet-f8_cond.yaml, configs/example_training/txt2img-clipl.yaml and configs/example_training/txt2img-clipl-legacy-ucg-training.yaml for training will require edits depending on the used dataset (which is expected to stored in tar-file in the webdataset-format). To find the parts which have to be adapted, search for comments containing USER: in the respective config.

NOTE 2: This repository supports both pytorch1.13 and pytorch2for training generative models. However for autoencoder training as e.g. in configs/example_training/autoencoder/kl-f4/imagenet-attnfree-logvar.yaml, only pytorch1.13 is supported.

NOTE 3: Training latent generative models (as e.g. in configs/example_training/imagenet-f8_cond.yaml) requires retrieving the checkpoint from Hugging Face and replacing the CKPT_PATH placeholder in this line. The same is to be done for the provided text-to-image configs.

Building New Diffusion Models

Conditioner

The GeneralConditioner is configured through the conditioner_config. Its only attribute is emb_models, a list of different embedders (all inherited from AbstractEmbModel) that are used to condition the generative model. All embedders should define whether or not they are trainable (is_trainable, default False), a classifier-free guidance dropout rate is used (ucg_rate, default 0), and an input key (input_key), for example, txt for text-conditioning or cls for class-conditioning. When computing conditionings, the embedder will get batch[input_key] as input. We currently support two to four dimensional conditionings and conditionings of different embedders are concatenated appropriately. Note that the order of the embedders in the conditioner_config is important.

Network

The neural network is set through the network_config. This used to be called unet_config, which is not general enough as we plan to experiment with transformer-based diffusion backbones.

Loss

The loss is configured through loss_config. For standard diffusion model training, you will have to set sigma_sampler_config.

Sampler config

As discussed above, the sampler is independent of the model. In the sampler_config, we set the type of numerical solver, number of steps, type of discretization, as well as, for example, guidance wrappers for classifier-free guidance.

Dataset Handling

For large scale training we recommend using the data pipelines from our data pipelines project. The project is contained in the requirement and automatically included when following the steps from the Installation section. Small map-style datasets should be defined here in the repository (e.g., MNIST, CIFAR-10, ...), and return a dict of data keys/values, e.g.,

example = {"jpg": x,  # this is a tensor -1...1 chw
           "txt": "a beautiful image"}

where we expect images in -1...1, channel-first format.

generative-models's People

Contributors

akx avatar benjaminaubin avatar brycedrennan avatar chunhanyao-stable avatar eltociear avatar gen-ai-experts avatar jenuk avatar johngull avatar lantiga avatar palp avatar patrickvonplaten avatar pharmapsychotic avatar qp-qp avatar rromb avatar timudk avatar voletiv avatar ymxie97 avatar yvrjsharma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

generative-models's Issues

No module named 'scripts' when trying to do Inference

I am doing the inference part as the README step by step. But I suffered No module named 'scripts' when running

streamlit run scripts/demo/sampling.py --server.port <your_port>

Did anyone meet this issue and solve it?

The environment I choosed is pt2

wrong output for sdxl

I play with the provided streamlit demo, and set fp16 to the diffusion model (for saving vram), however, the output is not correct. for example,

13d27f03ee2224d98bc30901d2e2fc16476c374189c716d2c92b1df3

while I can get the right output for sd v2.1 768 version.

Can someone help me for this problem?

Potential bug in BasicTransformerBlock

In the forward-method of BasicTransformerBlock the keywords additional_tokens and n_times_crossframe_attn_in_self are ignored, because these keywords are added to kwargs, but kwargs is not passed to checkpoint, only context is.

So it is impossible to change these keywords right now from the default, which might cause unexpected behaviour.

https://github.com/Stability-AI/generative-models/blob/76e549dd94de6a09f9e75eff82d62377274c00f8/sgm/modules/attention.py#L460C47-L460C47

Preview of above:

    def forward(
        self, x, context=None, additional_tokens=None, n_times_crossframe_attn_in_self=0
    ):
        kwargs = {"x": x}

        if context is not None:
            kwargs.update({"context": context})

        if additional_tokens is not None:
            kwargs.update({"additional_tokens": additional_tokens})

        if n_times_crossframe_attn_in_self:
            kwargs.update(
                {"n_times_crossframe_attn_in_self": n_times_crossframe_attn_in_self}
            )

        # return mixed_checkpoint(self._forward, kwargs, self.parameters(), self.checkpoint)
        return checkpoint(
            self._forward, (x, context), self.parameters(), self.checkpoint
        )

training config for SDXL

Hi, thanks for sharing the model! I'm trying to finetune the SDXL model but found some inconsistencies in the training and inference configs. For example, configs/example_training/txt2img-clipl.yaml has a different conditioner setup than configs/inference/sd_xl_base.yaml. Could you provide a config file suitable for finetuning the model? Thanks!

FrozenT5Embedder

Could you tell me how to use FrozenT5Embeder 260 in the sgm/modules/encoders/modules. py file with the 2.1 model

Design Choices

Hi! First of all thanks for releasing such a great model and accompanying paper. Could you clarify few design choices in the SDXL?

  1. Why do you use both previous CLIP-L and new OpenCLIP ViT-bigG? Have you tried only using the later one, wouldn't it be enough?
  2. The crop-conditioning while avoid generating too many cropped images, seems to generate more duplicated cases, where the object of interest is present everywhere, instead of being a single instance. See this comparisons. I wonder why not to use multi-aspect ( aka rectangles) training during all training process, rather than only during fine-tuning.

Release of SD-XL unclip version

Hi, SD-XL is a great work for image generation with better visual quality and text-image alignment. I found there is a demo REIMAGINE XL which should be an XL version of SD v2.1 unclip. I am just kindly wondering if the weights of unclip version of SD-XL (i.e. image embedding as input for image variations) will be released? Thanks a lot!

ImportError: libGL.so.1 on reproducible environnement

Hi, Following the provided instructions, I've created a gitpod branch to automate these instructions, I keep getting this error even after installing required os packages:

  File "/workspace/generative-models/.pt2/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
  File "/workspace/generative-models/scripts/demo/sampling.py", line 2, in <module>
    from scripts.demo.streamlit_helpers import *
  File "/workspace/generative-models/scripts/demo/streamlit_helpers.py", line 10, in <module>
    from imwatermark import WatermarkEncoder
  File "/workspace/generative-models/.pt2/lib/python3.10/site-packages/imwatermark/__init__.py", line 1, in <module>
    from .watermark import WatermarkEncoder, WatermarkDecoder
  File "/workspace/generative-models/.pt2/lib/python3.10/site-packages/imwatermark/watermark.py", line 5, in <module>
    import cv2
  File "/workspace/generative-models/.pt2/lib/python3.10/site-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/workspace/generative-models/.pt2/lib/python3.10/site-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
  File "/home/gitpod/.pyenv/versions/3.10.9/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Do I have to download a specific model from huggingface?

Little inefficiencies in "sgm/util.py" code

There are a few parts in the code that are a little inefficient or unnecessary:

  1. Unnecessary Conversion: In the log_txt_as_img function, txts is converted to a NumPy array and then immediately converted to a Torch tensor. This conversion can be skipped, and txts can be directly returned as a NumPy array.

  2. Inefficient Loop: In the log_txt_as_img function, the loop that creates the txt images can be optimized. Instead of creating a new txt image in each iteration, the loop can be simplified by creating a single txt image and then modifying it for each caption.

I am also submitting a PR with these necessary changes.

Autoencoder issues

The published autoencoder weights do not seem to match the model defined in this repo. Specifically when I try to load the weights, the following keys are missing
post_quant_conv.bias
post_quant_conv.weight
quant_conv.bias
quant_conv.weight

Also would be really helpful if the actual yaml config used to train the SDXL autoencoder is published (the example in this repo does not seem to correspond to the original).

Stable Diffusion XL - M1 mac doesn't work with fp16 on tutorial script - LLVM ERROR: Failed to infer result type(s)

Getting this issue still on trying the basic tutorial for SDXL inference (16GB MacBook Pro M1).

This mostly works (if I strip out the tutorial's recommendation for fp16) - but takes forever (iteration time 66 seconds), and then dies on the 50th iteration due to "MPS backend out of memory":

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", use_safetensors=True)
pipe = pipe.to("mps")
pipe.enable_attention_slicing()
 prompt = "An astronaut riding a green horse"
 images = pipe(prompt=prompt).images[0]

The recommended call:

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")

results in the error previously mentioned:

loc("varianceEps"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<1x77x1xf16>' and 'tensor<1xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
Abort trap: 6

/Users/mike/miniconda3/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Running torch 2.0.1, installed from the requirements.txt as per the README on this repo.

Anything I can do? I've got it working successfully on a 1080 Ti and a T4 (just following tutorial with no modifications), but I'm stuck on the M1.

How to get stable-datasets?

I wanted to check toy training examples. But faced a problem in that it requires a stable-datasets submodule which is not accessible.
Please assist in getting this module.

Thanks.

Is it possible to implement openai-like image edit and image variation with SDXL

We are trying to replace the image genration of our App with SDXL, we have implemented the image generation api in an OpenAI style. Here is our llm-as-openai repo which tries to replace OpenAI chat apis with open source language model and replace image generation with SDXL.

However, I'm quite new to SDXL. Is it possible to implement the OpenAI Image Edit and the Create image variation with SDXL?

The Image Edit API is used to create an edited or extended image given an original image and a prompt. And the Create image variation is to create a variation of a given image.

Cannot use with error

Easy Diffusion v2.5.48
operating system Arch-based
got:
Error: Could not load the stable-diffusion model! Reason: Missing key unet_config full_key: model.params.unet_config object_type=dict

img2img

I tried passing in an image as I have with earlier models:
images = pipe(prompt=prompt, image=init_img).images[0],
and learned that this doesn't work.

Is there a way to do img2img operations with this model?

Image to Image - Masking Does not work in SDXL 0.9

Hello SD Team, while using the API and playing with the Image To Image / Masking endpoint, i noticed and has been confirmed by a developer in Discord tha the masking is beeing partially ignored .

Can anyone look into this BUG and fix it? it's important that nothing behind the black pixels are modified by the diffusion.

Awaiting response.

Thank you.
generates0-2023-07-22_15-39-21
generates1-2023-07-22_15-39-21
generates0-2023-07-22_15-29-58
generates3-2023-07-22_15-29-59
generates2-2023-07-22_14-35-56
image (7)
oilbottle

pip3.exe install -r requirements_pt2.txt error

when i run this command in windows 11

$ pip3.exe install -r requirements_pt2.txt
Obtaining taming-transformers from git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers (from -r requirements_pt2.txt (line 38))

......
ERROR: Ignored the following versions that require a different python version: 0.55.2 Requires-Python <3.5; 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement triton==2.0.0 (from versions: none)
ERROR: No matching distribution found for triton==2.0.0
(.pt2)

Tierney@BDM /d/Github/generative-models (main)
$ python -V
Python 3.10.6
(.pt2)

Tierney@BDM /d/Github/generative-models (main)
$ pip -V
pip 23.2 from D:\Github\generative-models.pt2\lib\site-packages\pip (python 3.10)
(.pt2)

Tierney@BDM /d/Github/generative-models (main)
$ ^C
(.pt2)

SD XL 0.9 base OutOfMemoryError with 24 GB VRAM GPU

I was wondering how much VRAM is required to use this model. An NVIDIA GeForce RTX 3090 GPU with 24 GB VRAM seems to be insufficient when using the default settings on a clean install of Lubuntu 18.04 with CUDA 11.4 and PyTorch 2.0.1+cu117.

Is this the expected behavior or are there things that can be done to reduce memory usage? I tried varying the resolution, but it did not seem to have much of an effect.

Here is a graph of the memory usage when trying to generate an image at either 512 or 1024 resolution with streamlit run scripts/demo/sampling.py --server.port <your_port>

gpu_usage

The script takes roughly 50 seconds to initialize and uses about 15 GB VRAM until the prompt can be entered. After submitting the prompt, it takes roughly 35 seconds to crash.

This is the corresponding output.

Global seed set to 42
Building a Downsample layer with 2 dims.
  --> settings are: 
 in-chn: 320, out-chn: 320, kernel-size: 3, stride: 2, padding: 1
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 2. Setting context_dim to [2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 2. Setting context_dim to [2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Building a Downsample layer with 2 dims.
  --> settings are: 
 in-chn: 640, out-chn: 640, kernel-size: 3, stride: 2, padding: 1
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 10. Setting context_dim to [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 10. Setting context_dim to [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 10. Setting context_dim to [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 10. Setting context_dim to [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 10. Setting context_dim to [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 10. Setting context_dim to [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 2048 and using 20 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 2. Setting context_dim to [2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 2. Setting context_dim to [2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
WARNING: SpatialTransformer: Found context dims [2048] of depth 1, which does not match the specified 'depth' of 2. Setting context_dim to [2048, 2048] now.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads with a dimension of 64.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 2048 and using 10 heads with a dimension of 64.
BasicTransformerBlock is using checkpointing
Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_model.encoder.layers.22.mlp.fc2.weight', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.4.layer_norm2.bias', 'vision_model.encoder.layers.13.self_attn.q_proj.weight', 'vision_model.encoder.layers.23.mlp.fc1.bias', 'vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_model.encoder.layers.9.layer_norm2.bias', 'vision_model.encoder.layers.17.self_attn.q_proj.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.weight', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.18.layer_norm2.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.layer_norm1.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.bias', 'vision_model.encoder.layers.19.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_model.encoder.layers.18.self_attn.v_proj.weight', 'vision_model.encoder.layers.19.layer_norm1.bias', 'vision_model.encoder.layers.22.layer_norm1.bias', 'vision_model.encoder.layers.18.layer_norm2.bias', 'vision_model.encoder.layers.22.layer_norm1.weight', 'vision_model.encoder.layers.23.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers.1.layer_norm2.bias', 'vision_model.encoder.layers.18.layer_norm1.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.encoder.layers.15.self_attn.q_proj.weight', 'vision_model.encoder.layers.22.layer_norm2.bias', 'vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_model.encoder.layers.1.layer_norm1.weight', 'vision_model.encoder.layers.16.mlp.fc1.bias', 'vision_model.encoder.layers.12.self_attn.q_proj.weight', 'vision_model.encoder.layers.20.mlp.fc2.weight', 'vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_model.encoder.layers.23.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.mlp.fc1.weight', 'vision_model.encoder.layers.13.layer_norm2.weight', 'vision_model.encoder.layers.23.self_attn.out_proj.bias', 'vision_model.encoder.layers.11.layer_norm2.weight', 'vision_model.encoder.layers.17.mlp.fc2.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.bias', 'vision_model.encoder.layers.8.layer_norm1.weight', 'vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_model.encoder.layers.12.mlp.fc1.weight', 'vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.mlp.fc2.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.self_attn.q_proj.weight', 'vision_model.encoder.layers.18.self_attn.q_proj.bias', 'vision_model.encoder.layers.18.self_attn.q_proj.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_model.encoder.layers.17.self_attn.v_proj.weight', 'vision_model.encoder.layers.19.self_attn.v_proj.bias', 'vision_model.encoder.layers.15.layer_norm2.weight', 'vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.mlp.fc1.weight', 'vision_model.encoder.layers.11.layer_norm1.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.weight', 'text_projection.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.self_attn.q_proj.weight', 'vision_model.encoder.layers.20.mlp.fc2.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_model.encoder.layers.1.layer_norm1.bias', 'vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.layer_norm2.weight', 'vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_model.encoder.layers.22.mlp.fc2.bias', 'vision_model.post_layernorm.bias', 'vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_model.encoder.layers.23.layer_norm2.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_model.encoder.layers.18.self_attn.out_proj.bias', 'vision_model.encoder.layers.12.self_attn.k_proj.bias', 'vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.mlp.fc2.weight', 'vision_model.encoder.layers.5.layer_norm1.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.18.mlp.fc2.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.encoder.layers.2.layer_norm1.weight', 'vision_model.encoder.layers.2.layer_norm2.bias', 'vision_model.encoder.layers.8.layer_norm1.bias', 'vision_model.encoder.layers.14.mlp.fc2.weight', 'vision_model.encoder.layers.23.layer_norm1.weight', 'vision_model.encoder.layers.19.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_model.encoder.layers.4.layer_norm1.weight', 'vision_model.encoder.layers.6.layer_norm2.weight', 'vision_model.encoder.layers.22.mlp.fc1.weight', 'vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_model.encoder.layers.17.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.mlp.fc2.bias', 'vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_model.encoder.layers.17.layer_norm1.bias', 'vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.mlp.fc1.bias', 'vision_model.encoder.layers.21.self_attn.k_proj.bias', 'vision_model.encoder.layers.16.self_attn.v_proj.weight', 'vision_model.encoder.layers.19.self_attn.q_proj.weight', 'vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_model.encoder.layers.12.layer_norm2.bias', 'vision_model.encoder.layers.20.layer_norm2.weight', 'vision_model.encoder.layers.22.mlp.fc1.bias', 'vision_model.encoder.layers.18.self_attn.k_proj.weight', 'vision_model.encoder.layers.15.self_attn.out_proj.weight', 'vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.weight', 'vision_model.encoder.layers.14.self_attn.out_proj.bias', 'vision_model.encoder.layers.16.mlp.fc1.weight', 'vision_model.encoder.layers.22.self_attn.q_proj.bias', 'vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_model.encoder.layers.18.mlp.fc1.weight', 'vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_model.encoder.layers.17.layer_norm2.weight', 'vision_model.encoder.layers.20.layer_norm2.bias', 'vision_model.encoder.layers.9.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_model.encoder.layers.19.mlp.fc2.weight', 'vision_model.encoder.layers.7.layer_norm2.bias', 'vision_model.encoder.layers.15.mlp.fc1.weight', 'vision_model.encoder.layers.14.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.layer_norm1.weight', 'vision_model.encoder.layers.3.layer_norm1.weight', 'vision_model.encoder.layers.16.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_model.encoder.layers.12.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.self_attn.out_proj.weight', 'vision_model.encoder.layers.21.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_model.encoder.layers.7.layer_norm1.weight', 'vision_model.encoder.layers.12.mlp.fc2.weight', 'vision_model.encoder.layers.20.self_attn.out_proj.weight', 'vision_model.encoder.layers.13.layer_norm2.bias', 'vision_model.encoder.layers.8.layer_norm2.bias', 'vision_model.encoder.layers.16.layer_norm1.bias', 'vision_model.encoder.layers.21.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.layer_norm2.weight', 'vision_model.encoder.layers.23.layer_norm2.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.bias', 'vision_model.encoder.layers.8.layer_norm2.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_model.encoder.layers.20.layer_norm1.bias', 'vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.self_attn.k_proj.weight', 'visual_projection.weight', 'vision_model.encoder.layers.11.layer_norm1.bias', 'vision_model.encoder.layers.13.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.mlp.fc2.bias', 'vision_model.encoder.layers.19.layer_norm2.weight', 'vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.weight', 'logit_scale', 'vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_model.encoder.layers.13.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_model.encoder.layers.13.self_attn.v_proj.weight', 'vision_model.encoder.layers.23.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_model.encoder.layers.20.self_attn.k_proj.weight', 'vision_model.encoder.layers.23.self_attn.k_proj.weight', 'vision_model.encoder.layers.22.self_attn.q_proj.weight', 'vision_model.encoder.layers.13.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_model.encoder.layers.7.layer_norm2.weight', 'vision_model.encoder.layers.16.self_attn.out_proj.weight', 'vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_model.encoder.layers.23.layer_norm1.bias', 'vision_model.encoder.layers.3.layer_norm2.bias', 'vision_model.encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.15.self_attn.out_proj.bias', 'vision_model.encoder.layers.17.layer_norm2.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.weight', 'vision_model.encoder.layers.6.layer_norm1.weight', 'vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.layer_norm2.bias', 'vision_model.encoder.layers.17.layer_norm1.weight', 'vision_model.encoder.layers.6.layer_norm1.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.mlp.fc1.bias', 'vision_model.post_layernorm.weight', 'vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_model.encoder.layers.19.layer_norm2.bias', 'vision_model.encoder.layers.22.layer_norm2.weight', 'vision_model.encoder.layers.17.self_attn.out_proj.bias', 'vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_model.encoder.layers.16.layer_norm2.bias', 'vision_model.encoder.layers.23.self_attn.q_proj.bias', 'vision_model.encoder.layers.15.layer_norm1.bias', 'vision_model.encoder.layers.16.self_attn.q_proj.weight', 'vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vision_model.encoder.layers.21.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_model.encoder.layers.19.layer_norm1.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_model.encoder.layers.14.mlp.fc2.bias', 'vision_model.encoder.layers.14.layer_norm1.bias', 'vision_model.encoder.layers.3.layer_norm2.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.2.mlp.fc1.weight', 'vision_model.encoder.layers.16.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.mlp.fc1.bias', 'vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_model.encoder.layers.16.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_model.encoder.layers.12.self_attn.v_proj.weight', 'vision_model.encoder.layers.14.mlp.fc1.bias', 'vision_model.encoder.layers.23.self_attn.v_proj.bias', 'vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.layer_norm1.weight', 'vision_model.encoder.layers.22.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_model.encoder.layers.17.self_attn.v_proj.bias', 'vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.layer_norm2.bias', 'vision_model.encoder.layers.16.mlp.fc2.bias', 'vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_model.encoder.layers.17.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_model.pre_layrnorm.weight', 'vision_model.encoder.layers.0.layer_norm2.weight', 'vision_model.encoder.layers.15.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_model.encoder.layers.23.mlp.fc2.bias', 'vision_model.encoder.layers.13.mlp.fc1.bias', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.10.layer_norm2.bias', 'vision_model.encoder.layers.12.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.self_attn.out_proj.weight', 'vision_model.encoder.layers.13.layer_norm1.bias', 'vision_model.encoder.layers.5.layer_norm2.bias', 'vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.self_attn.k_proj.bias', 'vision_model.encoder.layers.18.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.layer_norm2.weight', 'vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.layer_norm1.bias', 'vision_model.encoder.layers.22.self_attn.v_proj.bias', 'vision_model.encoder.layers.23.mlp.fc1.weight', 'vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_model.encoder.layers.16.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_model.encoder.layers.17.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.mlp.fc1.weight', 'vision_model.encoder.layers.21.mlp.fc2.weight', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_model.encoder.layers.17.mlp.fc1.weight', 'vision_model.encoder.layers.11.layer_norm2.bias', 'vision_model.encoder.layers.20.self_attn.v_proj.bias', 'vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_model.encoder.layers.21.layer_norm2.weight', 'vision_model.encoder.layers.15.layer_norm1.weight', 'vision_model.encoder.layers.12.layer_norm2.weight', 'vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_model.encoder.layers.16.layer_norm1.weight', 'vision_model.encoder.layers.14.layer_norm1.weight', 'vision_model.encoder.layers.14.self_attn.q_proj.bias', 'vision_model.encoder.layers.21.layer_norm1.weight', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_model.encoder.layers.21.self_attn.q_proj.bias', 'vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_model.encoder.layers.14.mlp.fc1.weight', 'vision_model.encoder.layers.18.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_model.encoder.layers.23.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_model.embeddings.class_embedding', 'vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_model.encoder.layers.12.layer_norm1.bias', 'vision_model.encoder.layers.9.layer_norm2.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.layer_norm2.weight', 'vision_model.encoder.layers.10.layer_norm1.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.bias', 'vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_model.encoder.layers.18.self_attn.v_proj.bias', 'vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_model.encoder.layers.13.mlp.fc2.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.self_attn.v_proj.bias', 'vision_model.encoder.layers.4.layer_norm1.bias', 'vision_model.encoder.layers.15.self_attn.v_proj.bias', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_model.encoder.layers.13.layer_norm1.weight', 'vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_model.encoder.layers.12.mlp.fc1.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_model.encoder.layers.20.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.layer_norm2.weight', 'vision_model.encoder.layers.10.layer_norm1.weight', 'vision_model.encoder.layers.15.self_attn.q_proj.bias', 'vision_model.encoder.layers.18.layer_norm1.bias', 'vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_model.encoder.layers.15.self_attn.k_proj.weight', 'vision_model.encoder.layers.15.mlp.fc2.bias', 'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_model.encoder.layers.15.mlp.fc2.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_model.encoder.layers.20.self_attn.out_proj.bias', 'vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_model.encoder.layers.21.self_attn.q_proj.weight', 'vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_model.pre_layrnorm.bias', 'vision_model.encoder.layers.5.layer_norm1.bias', 'vision_model.encoder.layers.14.layer_norm2.bias', 'vision_model.encoder.layers.21.layer_norm2.bias', 'vision_model.encoder.layers.20.mlp.fc1.bias', 'vision_model.encoder.layers.9.layer_norm1.bias', 'vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_model.embeddings.position_ids', 'vision_model.encoder.layers.21.mlp.fc1.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_model.encoder.layers.22.self_attn.out_proj.weight', 'vision_model.encoder.layers.21.layer_norm1.bias', 'vision_model.encoder.layers.18.mlp.fc1.bias']
- This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Initialized embedder #0: FrozenCLIPEmbedder with 123060480 params. Trainable: False
Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694659841 params. Trainable: False
Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False
Initialized embedder #3: ConcatTimestepEmbedderND with 0 params. Trainable: False
Initialized embedder #4: ConcatTimestepEmbedderND with 0 params. Trainable: False
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Loading model from checkpoints/sd_xl_base_0.9.safetensors
unexpected keys:
['denoiser.log_sigmas']
Global seed set to 42
Global seed set to 42
Global seed set to 42
txt [69, 69]
target_size_as_tuple torch.Size([2, 2])
original_size_as_tuple torch.Size([2, 2])
crop_coords_top_left torch.Size([2, 2])
##############################  Sampling setting  ##############################
Sampler: EulerEDMSampler
Discretization: LegacyDDPMDiscretization
Guider: VanillaCFG
Sampling with EulerEDMSampler for 51 steps:  98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▊  | 50/51 [00:27<00:00,  1.79it/s]
2023-07-08 11:23:12.395 Uncaught app exception
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/sdxl/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
  File "/home/username/Desktop/generative-models/scripts/demo/sampling.py", line 292, in <module>
    out = run_txt2img(
  File "/home/username/Desktop/generative-models/scripts/demo/sampling.py", line 133, in run_txt2img
    out = do_sample(
  File "/home/username/Desktop/generative-models/scripts/demo/streamlit_helpers.py", line 522, in do_sample
    samples_x = model.decode_first_stage(samples_z)
  File "/home/username/miniconda3/envs/sdxl/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/username/Desktop/generative-models/sgm/models/diffusion.py", line 121, in decode_first_stage
    out = self.first_stage_model.decode(z)
  File "/home/username/Desktop/generative-models/sgm/models/autoencoder.py", line 315, in decode
    dec = self.decoder(z, **decoder_kwargs)
  File "/home/username/miniconda3/envs/sdxl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/username/Desktop/generative-models/sgm/modules/diffusionmodules/model.py", line 728, in forward
    h = self.up[i_level].block[i_block](h, temb, **kwargs)
  File "/home/username/miniconda3/envs/sdxl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/username/Desktop/generative-models/sgm/modules/diffusionmodules/model.py", line 131, in forward
    h = nonlinearity(h)
  File "/home/username/Desktop/generative-models/sgm/modules/diffusionmodules/model.py", line 46, in nonlinearity
    return x * torch.sigmoid(x)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 23.70 GiB total capacity; 20.91 GiB already allocated; 464.69 MiB free; 21.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

SDXL 1.0 Tutorials - Not An Issue Thread

Design question: Why don't you use v-prediction target?

Hi! First of all thanks for a very good model. The Stable Diffusion v2 used v-prediction target and argued that it's better than default epsilon prediction, but why do you use the epsilon target for SDXL training again?

Can't get triton package on Windows

Is this repo only for linux? When I try install the requirements it stops at triton image

I checked around and the triton package only supports Linux. Do I need to have a seperate boot for Linux to get this repo working or can I get it working with WSL? Right now I'm stuck with WSL trying to get a disk mounted but that's a whole other issue based on inexperience. I'll work it out tomorrow. Hopefully someone can chime in before then.

fp16 in the UNet

Hi! SDXL was OOM when I finetuned the UNet. I find that it doesn't use the fp16 in here. Will this be supported?

RuntimeError: Error(s) in loading state_dict for LatentDiffusion

dear guys:
when i change model to sdxl, i got error below
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
size mismatch for model.diffusion_model.time_embed.0.weight: copying a param with shape torch.Size([1536, 384]) from checkpoint, the shape in current model is torch.Size([1280, 320])
………………………………………………………………
how can I resolve this problem?

SDXL Different Weights with CLIP2

There seems to be an inconsistency with the text encoder 2 weights. With the sd_xl_base_1.0.safetensors, I view conditioner.embedders.1.model.text_projection and it is:

tensor([[-0.0348, 0.0218, -0.0025, ..., -0.0034, -0.0138, -0.0075],
[-0.0037, 0.0202, -0.0014, ..., -0.0082, -0.0011, 0.0170],
[ 0.0261, -0.0221, -0.0099, ..., -0.0233, -0.0178, -0.0061],
...,
[ 0.0042, -0.0068, 0.0086, ..., 0.0008, -0.0030, -0.0042],
[-0.0236, 0.0094, 0.0040, ..., -0.0098, 0.0330, 0.0147],
[ 0.0084, -0.0021, -0.0049, ..., 0.0026, -0.0055, -0.0294]],
dtype=torch.float16)

However if I go into text_encoder_2/model.fp16.safetensors, I see text_projection.weight is:

tensor([[-0.0348, -0.0037, 0.0261, ..., 0.0042, -0.0236, 0.0084],
[ 0.0218, 0.0202, -0.0221, ..., -0.0068, 0.0094, -0.0021],
[-0.0025, -0.0014, -0.0099, ..., 0.0086, 0.0040, -0.0049],
...,
[-0.0034, -0.0082, -0.0233, ..., 0.0008, -0.0098, 0.0026],
[-0.0138, -0.0011, -0.0178, ..., -0.0030, 0.0330, -0.0055],
[-0.0075, 0.0170, -0.0061, ..., -0.0042, 0.0147, -0.0294]],
dtype=torch.float16)

Which one should be correct? They do lead to different results.

Question about the finetuning.

Hi, I would like to finetune SDXL on the custom dataset. I merge the inference config with the training config in order to load the pretrained model. However, there is an error about target_size_as_tuple.

Where should I put the target_hight and target_width in my own dataset?

Running SDXL on multi GPUs

Enabling Multi-GPU Support for SDXL

Dear developers,

I am currently using the SDXL for my project, and I am encountering some difficulties with enabling multi-GPU support. I have four Nvidia 3090 GPUs at my disposal, but so far, I have only been able to run the software on one of them, running such a large model on a single 3090 is very slow and ram cosuming.

My attempts to distribute the workload among all GPUs have been unsuccessful. These have included:

  • Attempt : Use the accelerate to config the multi GPUS and run.

Unfortunately, these methods have not resulted in successful multi-GPU utilization.

For your reference, here is some information about my setup:

  • Operating System: Ubuntu
  • Python Version: 3.10.12
  • CUDA Version:12.2
  • Deep Learning Framework: PyTorch 2.0.1

I'm not sure if I'm missing something or if there's an issue with SDXL itself. Therefore, I'm writing to ask if you could provide some guidance on this matter. Is there a built-in way to enable multi-GPU support, or are there any additional steps that I might have overlooked? Thank you!

How convert img2img to inpaint

Hi, this is a great job, now I want to write an inpaint script on the basis of img2img according to the stablediffusion2.1 project. I need to fuse img and mask information into the network, the following is the method I am using now, but unfortunately it is wrong, can you help me?
16909747001848

sampling.py demo crashing

I am pretty new to this stuff. I got wsl up and running and set up cuda and stuff. I got everything downloaded and I am able to start the streamlit application but it quickly get's up to 28 gb in my RAM and then kills the application.

This may be obvious to some but not to me right now. Am I supposed to be changing a setting somewhere to get this working right?

I've got 32gb ram and a 3090ti with 24gb vram. It doesn't seem to be using the gpu much if at all and then I assume it's running out of memory. I am not sure though. I've tried talking to people in the discord but usually my questions get ignored. I don't think many people there are trying to use this repo.

This is what it looks like when it crashes

  • This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Initialized embedder #0: FrozenCLIPEmbedder with 123060480 params. Trainable: False
    Killed

if you need any other information I am happy to provide it, just let me know. I'd really like to learn this and get this running as intended.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.