GithubHelp home page GithubHelp logo

cloneofsimo / paint-with-words-sd Goto Github PK

View Code? Open in Web Editor NEW
629.0 23.0 49.0 42.29 MB

Implementation of Paint-with-words with Stable Diffusion : method from eDiff-I that let you generate image from text-labeled segmentation map.

License: MIT License

Python 5.47% Jupyter Notebook 94.53%
diffusion generative-model stable-diffusion

paint-with-words-sd's Introduction

Paint-with-Words, Implemented with Stable diffusion

Subtle Control of the Image Generation

Notice how without PwW the cloud is missing.

Notice how without PwW, abandoned city is missing, and road becomes purple as well.

Shift the object : Same seed, just the segmentation map's positional difference

"A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed."

Notice how nearly all of the composition remains the same, other than the position of the moon.


Recently, researchers from NVIDIA proposed eDiffi. In the paper, they suggested method that allows "painting with word". Basically, this is like make-a-scene, but with just using adjusted cross-attention score. You can see the results and detailed method in the paper.

Their paper and their method was not open-sourced. Yet, paint-with-words can be implemented with Stable Diffusion since they share common Cross Attention module. So, I implemented it with Stable Diffusion.

Installation

pip install git+https://github.com/cloneofsimo/paint-with-words-sd.git

Basic Usage

Before running, fill in the variable HF_TOKEN in .env file with Huggingface token for Stable Diffusion, and load_dotenv().

Prepare segmentation map, and map-color : tag label such as below. keys are (R, G, B) format, and values are tag label.

{
    (0, 0, 0): "cat,1.0",
    (255, 255, 255): "dog,1.0",
    (13, 255, 0): "tree,1.5",
    (90, 206, 255): "sky,0.2",
    (74, 18, 1): "ground,0.2",
}

You neeed to have them so that they are in format "{label},{strength}", where strength is additional weight of the attention score you will give during generation, i.e., it will have more effect.

import dotenv
from PIL import Image

from paint_with_words import paint_with_words

settings = {
    "color_context": {
        (0, 0, 0): "cat,1.0",
        (255, 255, 255): "dog,1.0",
        (13, 255, 0): "tree,1.5",
        (90, 206, 255): "sky,0.2",
        (74, 18, 1): "ground,0.2",
    },
    "color_map_img_path": "contents/example_input.png",
    "input_prompt": "realistic photo of a dog, cat, tree, with beautiful sky, on sandy ground",
    "output_img_path": "contents/output_cat_dog.png",
}


dotenv.load_dotenv()

color_map_image = Image.open(settings["color_map_img_path"]).convert("RGB")
color_context = settings["color_context"]
input_prompt = settings["input_prompt"]

img = paint_with_words(
    color_context=color_context,
    color_map_image=color_map_image,
    input_prompt=input_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    device="cuda:0",
)

img.save(settings["output_img_path"])

There is minimal working example in runner.py that is self contained. Please have a look!


Weight Scaling

In the paper, they used $w \log (1 + \sigma) \max (Q^T K)$ to scale appropriate attention weight. However, this wasn't optimal after few tests, found by CookiePPP. You can check out the effect of the functions below:

$w' = w \log (1 + \sigma) std (Q^T K)$

$w' = w \log (1 + \sigma) \max (Q^T K)$

$w' = w \log (1 + \sigma^2) std (Q^T K)$

You can define your own weight function and further tweak the configurations by defining weight_function argument in paint_with_words.

Example:

w_f = lambda w, sigma, qk: 0.4 * w * math.log(sigma**2 + 1) * qk.std()

img = paint_with_words(
    color_context=color_context,
    color_map_image=color_map_image,
    input_prompt=input_prompt,
    num_inference_steps=20,
    guidance_scale=7.5,
    device="cuda:0",
    preloaded_utils=loaded,
    weight_function=w_f
)

More on the weight function, (but higher)

$w' = w \log (1 + \sigma) std (Q^T K)$

$w' = w \log (1 + \sigma) \max (Q^T K)$

$w' = w \log (1 + \sigma^2) std (Q^T K)$

Regional-based seeding

Following this example, where the random seed for whole image is 0,

"A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed."

the random seed for 'boat', 'moon', and 'mountain' are set to various values show in the top row.

Example:

EXAMPLE_SETTING_4_seed = {
    "color_context": {
        (7, 9, 182): "aurora,0.5,-1",
        (136, 178, 92): "full moon,1.5,-1",
        (51, 193, 217): "mountains,0.4,-1",
        (61, 163, 35): "a half-frozen lake,0.3,-1",
        (89, 102, 255): "boat,2.0,2077",
    },
    "color_map_img_path": "contents/aurora_1.png",
    "input_prompt": "A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed.",
    "output_img_path": "contents/aurora_1_seed_output.png",
}

where the 3rd item of context are random seed for the object. Use -1 to follow the seed set in paint_with_words function. In this example the random seed of boat is set to 2077.

Image inpainting

Following the previous example, the figure below shows the results of image inpainting with paint-with-word

where the top row shows the example of editing moon size by inpainting. The bottom row shows the example of re-synthesize the moon by inpainting with the original "input color map" for text-image paint-with-word.

Example

from paint_with_words import paint_with_words_inpaint


img = paint_with_words_inpaint(
    color_context=color_context,
    color_map_image=color_map_image,
    init_image=init_image,
    mask_image=mask_image,
    input_prompt=input_prompt,
    num_inference_steps=150,
    guidance_scale=7.5,
    device="cuda:0",
    seed=81,
    weight_function=lambda w, sigma, qk: 0.15 * w * math.log(1 + sigma) * qk.max(),
    strength = 1.0,
)

To run inpainting

python runner_inpaint.py

Using other Fine-tuned models

If you are from Automatic1111 community, you maybe used to using native LDM checkpoint formats, not diffuser-checkpoint format. Luckily, there is a quick script that allows conversion. this.

python change_model_path.py --checkpoint_path custom_model.ckpt --scheduler_type ddim --dump_path custom_model_diffusion_format

Now, use the converted model in paint_with_words function.

from paint_with_words import paint_with_words, pww_load_tools

loaded = pww_load_tools(
    "cuda:0",
    scheduler_type=LMSDiscreteScheduler,
    local_model_path="./custom_model_diffusion_format"
)
#...
img = paint_with_words(
    color_context=color_context,
    color_map_image=color_map_image,
    input_prompt=input_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    device="cuda:0",
    weight_function=lambda w, sigma, qk: 0.4 * w * math.log(1 + sigma) * qk.max(),
    preloaded_utils=loaded
)

Example Notebooks

You can view the minimal working notebook here or Open In Colab


Gradio interface

Paint-with-word

To launch gradio api

python gradio_pww.py

Noting that the "Color context" should follows the format defined as the example in runner.py. For example,

{(7, 9, 182): "aurora,0.5,-1",(136, 178, 92): "full moon,1.5,-1",(51, 193, 217): "mountains,0.4,-1",(61, 163, 35): "a half-frozen lake,0.3,-1",(89, 102, 255): "boat,2.0,2077",}

Color contenet extraction

One can extract the color content from "Segmentation map" by expanding the "Color content option". Press the button "Extract color content" to extract the unique color of images.

In "Color content option", the extracted colors are shown respectively for each item. One can then replace "obj" with the object appear in the prompt. Importantly, don't use "," in the object, as this is the separator of the color content.

Click the button "Generate color content" to collect all the contents into "Color content" the textbox as the formal input of Paint-with-word.

The same function is supported for Paint-with-word for image inpainting as shown below

Paint-with-word for image inpainting

To launch gradio api

python gradio_pww_inpaint.py

Paint with Word (PwW) + ControlNet Extension for AUTOMATIC1111(A1111) stable-diffusion-webui

This extension provide additional PwW control to ControlNet. See sd-webui-controlnet-pww for the repo of this module.

The demo is shown below.

screencapture-127-0-0-1-7860-2023-03-13-10_56_34

The implementation is based on the great controlnet extension for A1111

Benchmark of ControlNet + PwW

The following figure shows the comparison between the ControlNet results and the ControlNet+PwW results for the boat examples.

Noting that the PwW make the background, e.g. aurora and mountains, more realistic as weight function scales increases.

The setups are detailed as follows

Scribble and Segmentation map:

Prompts:

"A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed."

Color contents:

"{(7, 9, 182): "[email protected]@-1",(136, 178, 92): "full [email protected]@-1",(51, 193, 217): "[email protected]@-1",(61, 163, 35): "a half-frozen [email protected]@-1",(89, 102, 255): "[email protected]@-1",}"

Note that A1111 extension now use "@" as separator instead of ",".

Assign the material for the specific region in scribble

One can use PwW to assign the material upon scribble, see the results comparing ControlNet and ControlNet+PwW below.

Noting that the material of turtle shell specified by PwW is significantly improved showns in the right blocks. Please see sd-webui-controlnet-pww for the experimental setups.

Installation

(1) Clone the source code to A1111 webui extensions

one can install by cloning the 'pww_controlnet" directory into the extensions directory of A1111 webui

cp -rf pww_controlnet path/stable-diffusion-webui/extensions/

or simply

cd path/stable-diffusion-webui/extensions/
git clone [email protected]:lwchen6309/sd-webui-controlnet-pww.git

where path is the location of A1111 webui.

(2) Setup pretrained model of ControlNet

Please follow the instruction of controlnet extension to get the pretrained models.

IMPORTANT: This extension is currently NOT compatible with ControlNet extension as reported at this issue. Hence, please disable the ControlNet extension before you install ControlNet+PwW.

However, one can still make them compatible by following the instruction of installation.

TODO

  • Make extensive comparisons for different weight scaling functions.
  • Create word latent-based cross-attention generations.
  • Check if statement "making background weight smaller is better" is justifiable, by using some standard metrics
  • Create AUTOMATIC1111's interface
  • Create Gradio interface
  • Create tutorial
  • See if starting with some "known image latent" is helpful. If it is, we might as well hard-code some initial latent.
  • Region based seeding, where we set seed for each regions. Can be simply implemented with extra argument in COLOR_CONTEXT
  • sentence wise text seperation. Currently token is the smallest unit that influences cross-attention. This needs to be fixed. (Can be done pretty trivially)
  • Allow different models to be used. use this.
  • "negative region", where we can set some region to "not" have some semantics. can be done with classifier-free guidance.
  • Img2ImgPaintWithWords -> Img2Img, but with extra text segmentation map for better control
  • InpaintPaintwithWords -> inpaint, but with extra text segmentation map for better control
  • Support for other schedulers

Acknowledgement

Thanks for the inspiring gradio interface from ControlNet

Thanks for the wonderful A1111 extension of controlnet as the baseline of our implementation

paint-with-words-sd's People

Contributors

cloneofsimo avatar jinwonkim93 avatar lwchen6309 avatar shreydan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paint-with-words-sd's Issues

Paint with words with Lora?

Hi, I have been wondering if it was possible to use the concept of paint with words for Lora concepts considering how much better they are than textual inversion for teaching a model specific concepts. The issue is how different TIs and Loras are in how they work. TIs just add themselves to the dictionary of the model and Loras essentially modify to whole model itself. So I was wondering if it's possible to apply loras to specific areas of an image considering how loras work.

Is this something you happen to know about @lwchen6309 ? thank in advance

Suggestion about the tool to generate mask image?

Thanks for your great work. I am trying to try "paint with words" using your code, but I can not find an easy way to get the mask image. I have tried "Microsoft Paint", but there are many small holes around the edge. Can you share the tool you use to generate the mask image?

Unable to run runner.py

Running the script with python runner.py is stuck and no output is given. I checked and the script is stuck at

from paint_with_words import paint_with_words

Using with lora

Hello,

I imagine I can't just monkeypatch the pww pipeline with loras. Can you point me to what exactly needs to be adjusted in this repo to make them compatible with your lora library.

Thank you for your help!

Confliction of A1111 extension of PwW+Control to the original extension of ControlNet

The current implementation conflict with the original extension of ControlNet, as reported by @mykeehu, here.

This is caused by the same argument defined in preload.py in the extension.

def preload(parser):
    parser.add_argument("--controlnet-dir", type=str, help="Path to directory with ControlNet models", default=None)
    parser.add_argument("--no-half-controlnet", action='store_true', help="do not switch the ControlNet models to 16-bit floats (only needed without --no-half)", default=None)

I'm trying to fix this bug, and open this issue for further discussion if there is any.

NotImplementedError: Module [ModuleList] is missing the required "forward" function

Hi have you encounter this problem? I have download latest diffusers. my torch version is 1.12.1

Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.13.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.self_attn.k_proj.bias', 'vision_model.encoder.layers.23.self_attn.q_proj.weight', 'vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_model.post_layernorm.bias', 'vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_model.encoder.layers.22.layer_norm2.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.mlp.fc2.weight', 'vision_model.encoder.layers.12.layer_norm2.bias', 'vision_model.encoder.layers.14.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.layer_norm1.bias', 'vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_model.encoder.layers.1.layer_norm2.bias', 'vision_model.encoder.layers.17.mlp.fc1.bias', 'vision_model.encoder.layers.12.self_attn.out_proj.weight', 'vision_model.encoder.layers.12.layer_norm2.weight', 'vision_model.post_layernorm.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_model.encoder.layers.11.layer_norm1.bias', 'vision_model.encoder.layers.12.layer_norm1.bias', 'vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_model.encoder.layers.16.layer_norm2.bias', 'vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_model.pre_layrnorm.weight', 'vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.self_attn.q_proj.bias', 'vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_model.encoder.layers.23.layer_norm1.weight', 'vision_model.encoder.layers.15.mlp.fc2.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_model.encoder.layers.19.mlp.fc1.weight', 'vision_model.encoder.layers.23.layer_norm2.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_model.encoder.layers.18.self_attn.q_proj.bias', 'vision_model.encoder.layers.19.mlp.fc2.weight', 'vision_model.encoder.layers.12.self_attn.out_proj.bias', 'vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.bias', 'vision_model.encoder.layers.18.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.mlp.fc1.bias', 'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.weight', 'vision_model.encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.13.layer_norm2.bias', 'vision_model.encoder.layers.6.layer_norm2.bias', 'vision_model.encoder.layers.12.mlp.fc2.bias', 'vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_model.encoder.layers.15.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_model.encoder.layers.8.layer_norm2.weight', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_model.encoder.layers.9.layer_norm1.bias', 'vision_model.encoder.layers.20.layer_norm2.weight', 'vision_model.encoder.layers.21.layer_norm1.bias', 'vision_model.encoder.layers.23.self_attn.k_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_model.encoder.layers.11.layer_norm1.weight', 'vision_model.encoder.layers.16.layer_norm2.weight', 'vision_model.encoder.layers.19.mlp.fc2.bias', 'vision_model.encoder.layers.20.mlp.fc2.weight', 'vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_model.encoder.layers.21.self_attn.q_proj.bias', 'vision_model.encoder.layers.14.mlp.fc2.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_model.encoder.layers.18.mlp.fc2.weight', 'vision_model.encoder.layers.20.self_attn.out_proj.weight', 'vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_model.encoder.layers.23.layer_norm1.bias', 'vision_model.encoder.layers.3.layer_norm1.weight', 'vision_model.encoder.layers.22.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_model.encoder.layers.16.self_attn.v_proj.bias', 'vision_model.encoder.layers.22.mlp.fc1.bias', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.16.layer_norm1.bias', 'vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_model.encoder.layers.15.mlp.fc1.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.bias', 'vision_model.encoder.layers.14.mlp.fc1.weight', 'vision_model.encoder.layers.7.layer_norm1.weight', 'vision_model.encoder.layers.18.mlp.fc1.bias', 'vision_model.encoder.layers.10.layer_norm2.bias', 'vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_model.encoder.layers.14.mlp.fc2.bias', 'vision_model.encoder.layers.22.layer_norm2.weight', 'vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.11.layer_norm2.weight', 'vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_model.encoder.layers.13.layer_norm2.weight', 'vision_model.encoder.layers.22.mlp.fc1.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.layer_norm1.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.mlp.fc1.weight', 'vision_model.encoder.layers.18.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_model.encoder.layers.22.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_model.encoder.layers.19.self_attn.q_proj.bias', 'vision_model.encoder.layers.13.self_attn.q_proj.weight', 'vision_model.encoder.layers.8.layer_norm2.bias', 'vision_model.encoder.layers.14.self_attn.v_proj.bias', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.weight', 'vision_model.encoder.layers.16.self_attn.k_proj.bias', 'vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_model.encoder.layers.16.self_attn.q_proj.bias', 'vision_model.encoder.layers.12.self_attn.v_proj.bias', 'vision_model.encoder.layers.21.layer_norm2.bias', 'vision_model.encoder.layers.18.self_attn.v_proj.weight', 'vision_model.encoder.layers.18.layer_norm1.weight', 'vision_model.encoder.layers.2.mlp.fc1.weight', 'vision_model.encoder.layers.9.layer_norm2.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_model.encoder.layers.19.layer_norm2.weight', 'vision_model.encoder.layers.15.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_model.encoder.layers.5.layer_norm2.bias', 'vision_model.encoder.layers.8.layer_norm1.weight', 'vision_model.encoder.layers.17.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_model.encoder.layers.15.layer_norm1.weight', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.encoder.layers.10.layer_norm1.bias', 'vision_model.encoder.layers.20.layer_norm1.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_model.encoder.layers.22.mlp.fc2.weight', 'vision_model.encoder.layers.23.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_model.encoder.layers.23.mlp.fc1.weight', 'vision_model.encoder.layers.23.mlp.fc2.bias', 'vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_model.encoder.layers.21.self_attn.v_proj.bias', 'vision_model.pre_layrnorm.bias', 'vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_model.encoder.layers.19.layer_norm1.weight', 'vision_model.encoder.layers.21.self_attn.v_proj.weight', 'text_projection.weight', 'vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_model.encoder.layers.19.self_attn.out_proj.weight', 'vision_model.encoder.layers.13.layer_norm1.weight', 'vision_model.encoder.layers.13.layer_norm1.bias', 'vision_model.encoder.layers.18.mlp.fc2.bias', 'vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_model.encoder.layers.14.mlp.fc1.bias', 'vision_model.embeddings.class_embedding', 'vision_model.encoder.layers.23.self_attn.v_proj.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.10.self_attn.v_proj.weight', 'visual_projection.weight', 'vision_model.encoder.layers.20.self_attn.k_proj.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.mlp.fc2.bias', 'vision_model.encoder.layers.16.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.self_attn.out_proj.bias', 'vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.self_attn.q_proj.weight', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_model.encoder.layers.22.mlp.fc2.bias', 'vision_model.encoder.layers.14.layer_norm1.weight', 'vision_model.encoder.layers.23.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.self_attn.q_proj.weight', 'vision_model.encoder.layers.22.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.layer_norm1.weight', 'vision_model.encoder.layers.10.layer_norm2.weight', 'vision_model.encoder.layers.15.mlp.fc2.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_model.embeddings.position_ids', 'vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_model.encoder.layers.18.self_attn.out_proj.weight', 'vision_model.encoder.layers.17.self_attn.v_proj.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.weight', 'vision_model.encoder.layers.12.mlp.fc1.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_model.encoder.layers.13.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.layer_norm2.bias', 'vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_model.encoder.layers.13.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.layer_norm2.weight', 'vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_model.encoder.layers.23.self_attn.out_proj.weight', 'vision_model.encoder.layers.12.mlp.fc2.weight', 'vision_model.encoder.layers.17.self_attn.v_proj.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.layer_norm2.weight', 'vision_model.encoder.layers.13.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_model.encoder.layers.21.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_model.encoder.layers.21.layer_norm1.weight', 'vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_model.encoder.layers.18.layer_norm2.weight', 'vision_model.encoder.layers.2.layer_norm1.weight', 'vision_model.encoder.layers.1.layer_norm1.bias', 'vision_model.encoder.layers.16.mlp.fc1.weight', 'vision_model.encoder.layers.20.self_attn.v_proj.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_model.encoder.layers.15.self_attn.out_proj.weight', 'vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_model.encoder.layers.16.mlp.fc2.weight', 'vision_model.encoder.layers.7.layer_norm2.bias', 'vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_model.encoder.layers.16.mlp.fc1.bias', 'vision_model.encoder.layers.21.mlp.fc1.weight', 'vision_model.encoder.layers.23.layer_norm2.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.weight', 'logit_scale', 'vision_model.encoder.layers.13.mlp.fc1.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_model.encoder.layers.15.mlp.fc1.bias', 'vision_model.encoder.layers.23.mlp.fc1.bias', 'vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_model.encoder.layers.18.self_attn.q_proj.weight', 'vision_model.encoder.layers.23.self_attn.out_proj.bias', 'vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.self_attn.out_proj.bias', 'vision_model.encoder.layers.20.self_attn.q_proj.weight', 'vision_model.encoder.layers.6.layer_norm2.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.layer_norm1.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.bias', 'vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_model.encoder.layers.15.self_attn.k_proj.bias', 'vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_model.encoder.layers.21.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.mlp.fc1.weight', 'vision_model.encoder.layers.11.layer_norm2.bias', 'vision_model.encoder.layers.3.layer_norm2.weight', 'vision_model.encoder.layers.19.mlp.fc1.bias', 'vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_model.encoder.layers.9.layer_norm2.weight', 'vision_model.encoder.layers.4.layer_norm2.weight', 'vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_model.encoder.layers.6.layer_norm1.weight', 'vision_model.encoder.layers.17.self_attn.k_proj.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_model.encoder.layers.18.layer_norm1.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_model.encoder.layers.19.layer_norm2.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_model.encoder.layers.15.layer_norm2.weight', 'vision_model.encoder.layers.19.self_attn.q_proj.weight', 'vision_model.encoder.layers.9.layer_norm1.weight', 'vision_model.encoder.layers.0.layer_norm2.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_model.encoder.layers.10.layer_norm1.weight', 'vision_model.encoder.layers.20.self_attn.q_proj.bias', 'vision_model.encoder.layers.14.layer_norm2.bias', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.encoder.layers.23.self_attn.v_proj.bias', 'vision_model.encoder.layers.21.layer_norm2.weight', 'vision_model.encoder.layers.17.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_model.encoder.layers.19.layer_norm1.bias', 'vision_model.encoder.layers.18.self_attn.out_proj.bias', 'vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_model.encoder.layers.14.self_attn.q_proj.weight', 'vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.layer_norm2.weight', 'vision_model.encoder.layers.22.self_attn.q_proj.weight', 'vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_model.encoder.layers.4.layer_norm1.bias', 'vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_model.encoder.layers.14.layer_norm2.weight', 'vision_model.encoder.layers.12.layer_norm1.weight', 'vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_model.encoder.layers.17.layer_norm1.weight', 'vision_model.encoder.layers.4.layer_norm2.bias', 'vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_model.encoder.layers.14.layer_norm1.bias', 'vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_model.encoder.layers.22.self_attn.k_proj.bias', 'vision_model.encoder.layers.12.mlp.fc1.bias', 'vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_model.encoder.layers.19.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.layer_norm1.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_model.encoder.layers.22.layer_norm1.bias', 'vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_model.encoder.layers.17.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_model.encoder.layers.15.layer_norm2.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.4.layer_norm1.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.weight', 'vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_model.encoder.layers.17.mlp.fc2.weight', 'vision_model.encoder.layers.18.mlp.fc1.weight', 'vision_model.encoder.layers.13.self_attn.q_proj.bias', 'vision_model.encoder.layers.22.layer_norm1.weight', 'vision_model.encoder.layers.6.layer_norm1.bias', 'vision_model.encoder.layers.17.layer_norm1.bias', 'vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_model.encoder.layers.15.layer_norm1.bias', 'vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.self_attn.v_proj.weight', 'vision_model.encoder.layers.16.self_attn.q_proj.weight', 'vision_model.encoder.layers.15.self_attn.k_proj.weight', 'vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_model.encoder.layers.22.self_attn.v_proj.bias', 'vision_model.encoder.layers.21.mlp.fc2.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.layer_norm2.bias', 'vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.self_attn.out_proj.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_model.encoder.layers.20.mlp.fc1.bias', 'vision_model.encoder.layers.3.layer_norm2.bias', 'vision_model.encoder.layers.12.self_attn.v_proj.weight', 'vision_model.encoder.layers.20.layer_norm2.bias', 'vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_model.encoder.layers.16.self_attn.out_proj.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_model.encoder.layers.20.self_attn.v_proj.bias', 'vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_model.encoder.layers.18.layer_norm2.bias', 'vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_model.encoder.layers.5.layer_norm1.weight', 'vision_model.encoder.layers.3.layer_norm1.bias', 'vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_model.encoder.layers.1.self_attn.out_proj.bias']

  • This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    0%
    0/30 [00:01<?, ?it/s]
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ :1 in โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27 in decorate_context โ”‚
    โ”‚ โ”‚
    โ”‚ 24 โ”‚ โ”‚ @functools.wraps(func) โ”‚
    โ”‚ 25 โ”‚ โ”‚ def decorate_context(*args, **kwargs): โ”‚
    โ”‚ 26 โ”‚ โ”‚ โ”‚ with self.clone(): โ”‚
    โ”‚ โฑ 27 โ”‚ โ”‚ โ”‚ โ”‚ return func(*args, **kwargs) โ”‚
    โ”‚ 28 โ”‚ โ”‚ return cast(F, decorate_context) โ”‚
    โ”‚ 29 โ”‚ โ”‚
    โ”‚ 30 โ”‚ def _wrap_generator(self, func): โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/torch/amp/autocast_mode.py:12 in decorate_autocast โ”‚
    โ”‚ โ”‚
    โ”‚ 9 โ”‚ @functools.wraps(func) โ”‚
    โ”‚ 10 โ”‚ def decorate_autocast(*args, **kwargs): โ”‚
    โ”‚ 11 โ”‚ โ”‚ with autocast_instance: โ”‚
    โ”‚ โฑ 12 โ”‚ โ”‚ โ”‚ return func(*args, **kwargs) โ”‚
    โ”‚ 13 โ”‚ decorate_autocast.__script_unsupported = '@autocast() decorator is not supporte โ”‚
    โ”‚ 14 โ”‚ return decorate_autocast โ”‚
    โ”‚ 15 โ”‚
    โ”‚ โ”‚
    โ”‚ /workspace/docker/jw93/paint-with-words-sd/paint_with_words/paint_with_words.py:244 in โ”‚
    โ”‚ paint_with_words โ”‚
    โ”‚ โ”‚
    โ”‚ 241 โ”‚ โ”‚ โ”‚
    โ”‚ 242 โ”‚ โ”‚ latent_model_input = scheduler.scale_model_input(latents, t) โ”‚
    โ”‚ 243 โ”‚ โ”‚ โ”‚
    โ”‚ โฑ 244 โ”‚ โ”‚ noise_pred_text = unet( โ”‚
    โ”‚ 245 โ”‚ โ”‚ โ”‚ latent_model_input, โ”‚
    โ”‚ 246 โ”‚ โ”‚ โ”‚ t, โ”‚
    โ”‚ 247 โ”‚ โ”‚ โ”‚ encoder_hidden_states={ โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1130 in _call_impl โ”‚
    โ”‚ โ”‚
    โ”‚ 1127 โ”‚ โ”‚ # this function, and just call forward. โ”‚
    โ”‚ 1128 โ”‚ โ”‚ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_h โ”‚
    โ”‚ 1129 โ”‚ โ”‚ โ”‚ โ”‚ or _global_forward_hooks or _global_forward_pre_hooks): โ”‚
    โ”‚ โฑ 1130 โ”‚ โ”‚ โ”‚ return forward_call(*input, **kwargs) โ”‚
    โ”‚ 1131 โ”‚ โ”‚ # Do not call functions when jit is used โ”‚
    โ”‚ 1132 โ”‚ โ”‚ full_backward_hooks, non_full_backward_hooks = [], [] โ”‚
    โ”‚ 1133 โ”‚ โ”‚ if self._backward_hooks or _global_backward_hooks: โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py:307 in โ”‚
    โ”‚ forward โ”‚
    โ”‚ โ”‚
    โ”‚ 304 โ”‚ โ”‚ down_block_res_samples = (sample,) โ”‚
    โ”‚ 305 โ”‚ โ”‚ for downsample_block in self.down_blocks: โ”‚
    โ”‚ 306 โ”‚ โ”‚ โ”‚ if hasattr(downsample_block, "attentions") and downsample_block.attenti โ”‚
    โ”‚ โฑ 307 โ”‚ โ”‚ โ”‚ โ”‚ sample, res_samples = downsample_block( โ”‚
    โ”‚ 308 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ hidden_states=sample, โ”‚
    โ”‚ 309 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ temb=emb, โ”‚
    โ”‚ 310 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ encoder_hidden_states=encoder_hidden_states, โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1130 in _call_impl โ”‚
    โ”‚ โ”‚
    โ”‚ 1127 โ”‚ โ”‚ # this function, and just call forward. โ”‚
    โ”‚ 1128 โ”‚ โ”‚ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_h โ”‚
    โ”‚ 1129 โ”‚ โ”‚ โ”‚ โ”‚ or _global_forward_hooks or _global_forward_pre_hooks): โ”‚
    โ”‚ โฑ 1130 โ”‚ โ”‚ โ”‚ return forward_call(*input, **kwargs) โ”‚
    โ”‚ 1131 โ”‚ โ”‚ # Do not call functions when jit is used โ”‚
    โ”‚ 1132 โ”‚ โ”‚ full_backward_hooks, non_full_backward_hooks = [], [] โ”‚
    โ”‚ 1133 โ”‚ โ”‚ if self._backward_hooks or global_backward_hooks: โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/diffusers/models/unet_2d_blocks.py:598 in forward โ”‚
    โ”‚ โ”‚
    โ”‚ 595 โ”‚ โ”‚ โ”‚ โ”‚ )[0] โ”‚
    โ”‚ 596 โ”‚ โ”‚ โ”‚ else: โ”‚
    โ”‚ 597 โ”‚ โ”‚ โ”‚ โ”‚ hidden_states = resnet(hidden_states, temb) โ”‚
    โ”‚ โฑ 598 โ”‚ โ”‚ โ”‚ โ”‚ hidden_states = attn(hidden_states, encoder_hidden_states=encoder
    โ”‚
    โ”‚ 599 โ”‚ โ”‚ โ”‚ โ”‚
    โ”‚ 600 โ”‚ โ”‚ โ”‚ output_states += (hidden_states,) โ”‚
    โ”‚ 601 โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1130 in _call_impl โ”‚
    โ”‚ โ”‚
    โ”‚ 1127 โ”‚ โ”‚ # this function, and just call forward. โ”‚
    โ”‚ 1128 โ”‚ โ”‚ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_h โ”‚
    โ”‚ 1129 โ”‚ โ”‚ โ”‚ โ”‚ or _global_forward_hooks or _global_forward_pre_hooks): โ”‚
    โ”‚ โฑ 1130 โ”‚ โ”‚ โ”‚ return forward_call(*input, **kwargs) โ”‚
    โ”‚ 1131 โ”‚ โ”‚ # Do not call functions when jit is used โ”‚
    โ”‚ 1132 โ”‚ โ”‚ full_backward_hooks, non_full_backward_hooks = [], [] โ”‚
    โ”‚ 1133 โ”‚ โ”‚ if self._backward_hooks or _global_backward_hooks: โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/diffusers/models/attention.py:202 in forward โ”‚
    โ”‚ โ”‚
    โ”‚ 199 โ”‚ โ”‚ โ”‚
    โ”‚ 200 โ”‚ โ”‚ # 2. Blocks โ”‚
    โ”‚ 201 โ”‚ โ”‚ for block in self.transformer_blocks: โ”‚
    โ”‚ โฑ 202 โ”‚ โ”‚ โ”‚ hidden_states = block(hidden_states, context=encoder_hidden_states, tim โ”‚
    โ”‚ 203 โ”‚ โ”‚ โ”‚
    โ”‚ 204 โ”‚ โ”‚ # 3. Output โ”‚
    โ”‚ 205 โ”‚ โ”‚ if self.is_input_continuous: โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1130 in _call_impl โ”‚
    โ”‚ โ”‚
    โ”‚ 1127 โ”‚ โ”‚ # this function, and just call forward. โ”‚
    โ”‚ 1128 โ”‚ โ”‚ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_h โ”‚
    โ”‚ 1129 โ”‚ โ”‚ โ”‚ โ”‚ or _global_forward_hooks or _global_forward_pre_hooks): โ”‚
    โ”‚ โฑ 1130 โ”‚ โ”‚ โ”‚ return forward_call(*input, **kwargs) โ”‚
    โ”‚ 1131 โ”‚ โ”‚ # Do not call functions when jit is used โ”‚
    โ”‚ 1132 โ”‚ โ”‚ full_backward_hooks, non_full_backward_hooks = [], [] โ”‚
    โ”‚ 1133 โ”‚ โ”‚ if self._backward_hooks or _global_backward_hooks: โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/diffusers/models/attention.py:404 in forward โ”‚
    โ”‚ โ”‚
    โ”‚ 401 โ”‚ โ”‚ norm_hidden_states = ( โ”‚
    โ”‚ 402 โ”‚ โ”‚ โ”‚ self.norm1(hidden_states, timestep) if self.use_ada_layer_norm else sel โ”‚
    โ”‚ 403 โ”‚ โ”‚ ) โ”‚
    โ”‚ โฑ 404 โ”‚ โ”‚ hidden_states = self.attn1(norm_hidden_states) + hidden_states โ”‚
    โ”‚ 405 โ”‚ โ”‚ โ”‚
    โ”‚ 406 โ”‚ โ”‚ # 2. Cross-Attention โ”‚
    โ”‚ 407 โ”‚ โ”‚ norm_hidden_states = ( โ”‚
    โ”‚ โ”‚
    โ”‚ /workspace/docker/jw93/paint-with-words-sd/paint_with_words/paint_with_words.py:83 in โ”‚
    โ”‚ inj_forward โ”‚
    โ”‚ โ”‚
    โ”‚ 80 โ”‚ โ”‚
    โ”‚ 81 โ”‚ hidden_states = self.reshape_batch_dim_to_heads(hidden_states) โ”‚
    โ”‚ 82 โ”‚ โ”‚
    โ”‚ โฑ 83 โ”‚ return self.to_out(hidden_states) โ”‚
    โ”‚ 84 โ”‚
    โ”‚ 85 โ”‚
    โ”‚ 86 def _load_tools(device: str, scheduler_type): โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1130 in _call_impl โ”‚
    โ”‚ โ”‚
    โ”‚ 1127 โ”‚ โ”‚ # this function, and just call forward. โ”‚
    โ”‚ 1128 โ”‚ โ”‚ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_h โ”‚
    โ”‚ 1129 โ”‚ โ”‚ โ”‚ โ”‚ or _global_forward_hooks or _global_forward_pre_hooks): โ”‚
    โ”‚ โฑ 1130 โ”‚ โ”‚ โ”‚ return forward_call(*input, **kwargs) โ”‚
    โ”‚ 1131 โ”‚ โ”‚ # Do not call functions when jit is used โ”‚
    โ”‚ 1132 โ”‚ โ”‚ full_backward_hooks, non_full_backward_hooks = [], [] โ”‚
    โ”‚ 1133 โ”‚ โ”‚ if self._backward_hooks or _global_backward_hooks: โ”‚
    โ”‚ โ”‚
    โ”‚ /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:201 in โ”‚
    โ”‚ _forward_unimplemented โ”‚
    โ”‚ โ”‚
    โ”‚ 198 โ”‚ โ”‚ instead of this since the former takes care of running the โ”‚
    โ”‚ 199 โ”‚ โ”‚ registered hooks while the latter silently ignores them. โ”‚
    โ”‚ 200 โ”‚ """ โ”‚
    โ”‚ โฑ 201 โ”‚ raise NotImplementedError(f"Module [{type(self).name}] is missing the requ โ”‚
    โ”‚ 202 โ”‚
    โ”‚ 203 โ”‚
    โ”‚ 204 class Module: โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    NotImplementedError: Module [ModuleList] is missing the required "forward" function

SMB World 1-1 Segmentation Map and Key.

0-0
(0,255,255): "Sky,1.0",
(0,0,0): "Black,1.0",
(1,1,1): "Cave,1.0",
(5,5,5): "Green Sewer Pipe,1.0",
(22,22,22): "Red Soil,1.0",
(25,25,25): "Sky,1.0",
(28,28,28): "Grassy Hill,1.0",
(30,30,30): "Green Bush,1.0",
(31,31,31): "Green,1.0",
(35,35,35): "Orange Block,1.0",
(37,37,37): "Orange Bricks,1.0",
(53,53,53): "Castle Door,1.0",
(55,55,55): "Castle,1.0",
(59,59,59): "Mario Bros Question Mark Block,1.0",
(64,64,64): "Fluffy Cloud,1.0",
(73,73,73): "Grassy Cliff Edge,1.0",

I have written a TOOL which dissects PNG 2D videogame maps (NES, SNES, GBA, SEGA ETC) into Segmentation maps / keys for use with this REPO.

EDIT : GOT IT WORKING!

I have also heavily modified an NES Emulator and am working on loading these outputs right back INTO the games so the new graphics replace the old ones in-game.

I have all the maps for SMB1 output like this.
I will post some results here in a little while. :)

  • MiLO

Torch missing

there is an import of torch required, that seems to be missing in requirements.txt

should be added or documented

having trouble getting cuda installed with this

I've been using python 3.10.6. is that incompatible with this repo?

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ D:\paint\runner.py:71 in โ”‚
โ”‚ โ”‚
โ”‚ 68 โ”‚ color_context = settings["color_context"] โ”‚
โ”‚ 69 โ”‚ input_prompt = settings["input_prompt"] โ”‚
โ”‚ 70 โ”‚ โ”‚
โ”‚ โฑ 71 โ”‚ img = paint_with_words( โ”‚
โ”‚ 72 โ”‚ โ”‚ color_context=color_context, โ”‚
โ”‚ 73 โ”‚ โ”‚ color_map_image=color_map_image, โ”‚
โ”‚ 74 โ”‚ โ”‚ input_prompt=input_prompt, โ”‚
โ”‚ โ”‚
โ”‚ D:\Python\Python310\lib\site-packages\torch\autograd\grad_mode.py:27 in decorate_context โ”‚
โ”‚ โ”‚
โ”‚ 24 โ”‚ โ”‚ @functools.wraps(func) โ”‚
โ”‚ 25 โ”‚ โ”‚ def decorate_context(*args, **kwargs): โ”‚
โ”‚ 26 โ”‚ โ”‚ โ”‚ with self.clone(): โ”‚
โ”‚ โฑ 27 โ”‚ โ”‚ โ”‚ โ”‚ return func(*args, **kwargs) โ”‚
โ”‚ 28 โ”‚ โ”‚ return cast(F, decorate_context) โ”‚
โ”‚ 29 โ”‚ โ”‚
โ”‚ 30 โ”‚ def _wrap_generator(self, func): โ”‚
โ”‚ โ”‚
โ”‚ D:\Python\Python310\lib\site-packages\torch\amp\autocast_mode.py:14 in decorate_autocast โ”‚
โ”‚ โ”‚
โ”‚ 11 โ”‚ @functools.wraps(func) โ”‚
โ”‚ 12 โ”‚ def decorate_autocast(*args, **kwargs): โ”‚
โ”‚ 13 โ”‚ โ”‚ with autocast_instance: โ”‚
โ”‚ โฑ 14 โ”‚ โ”‚ โ”‚ return func(*args, **kwargs) โ”‚
โ”‚ 15 โ”‚ decorate_autocast.__script_unsupported = '@autocast() decorator is not supported in โ”‚
โ”‚ 16 โ”‚ return decorate_autocast โ”‚
โ”‚ 17 โ”‚
โ”‚ โ”‚
โ”‚ D:\paint\paint_with_words\paint_with_words.py:255 in paint_with_words โ”‚
โ”‚ โ”‚
โ”‚ 252 ): โ”‚
โ”‚ 253 โ”‚ โ”‚
โ”‚ 254 โ”‚ vae, unet, text_encoder, tokenizer, scheduler = ( โ”‚
โ”‚ โฑ 255 โ”‚ โ”‚ pww_load_tools( โ”‚
โ”‚ 256 โ”‚ โ”‚ โ”‚ device, โ”‚
โ”‚ 257 โ”‚ โ”‚ โ”‚ scheduler_type, โ”‚
โ”‚ 258 โ”‚ โ”‚ โ”‚ local_model_path=local_model_path, โ”‚
โ”‚ โ”‚
โ”‚ D:\paint\paint_with_words\paint_with_words.py:142 in pww_load_tools โ”‚
โ”‚ โ”‚
โ”‚ 139 โ”‚ โ”‚ local_files_only=local_path_only, โ”‚
โ”‚ 140 โ”‚ ) โ”‚
โ”‚ 141 โ”‚ โ”‚
โ”‚ โฑ 142 โ”‚ vae.to(device), unet.to(device), text_encoder.to(device) โ”‚
โ”‚ 143 โ”‚ โ”‚
โ”‚ 144 โ”‚ for _module in unet.modules(): โ”‚
โ”‚ 145 โ”‚ โ”‚ if _module.class.name == "CrossAttention": โ”‚
โ”‚ โ”‚
โ”‚ D:\Python\Python310\lib\site-packages\torch\nn\modules\module.py:987 in to โ”‚
โ”‚ โ”‚
โ”‚ 984 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ non_blocking, memory_format=convert_to_format) โ”‚
โ”‚ 985 โ”‚ โ”‚ โ”‚ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No โ”‚
โ”‚ 986 โ”‚ โ”‚ โ”‚
โ”‚ โฑ 987 โ”‚ โ”‚ return self._apply(convert) โ”‚
โ”‚ 988 โ”‚ โ”‚
โ”‚ 989 โ”‚ def register_backward_hook( โ”‚
โ”‚ 990 โ”‚ โ”‚ self, hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]] โ”‚
โ”‚ โ”‚
โ”‚ D:\Python\Python310\lib\site-packages\torch\nn\modules\module.py:639 in _apply โ”‚
โ”‚ โ”‚
โ”‚ 636 โ”‚ โ”‚
โ”‚ 637 โ”‚ def _apply(self, fn): โ”‚
โ”‚ 638 โ”‚ โ”‚ for module in self.children(): โ”‚
โ”‚ โฑ 639 โ”‚ โ”‚ โ”‚ module._apply(fn) โ”‚
โ”‚ 640 โ”‚ โ”‚ โ”‚
โ”‚ 641 โ”‚ โ”‚ def compute_should_use_set_data(tensor, tensor_applied): โ”‚
โ”‚ 642 โ”‚ โ”‚ โ”‚ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): โ”‚
โ”‚ โ”‚
โ”‚ D:\Python\Python310\lib\site-packages\torch\nn\modules\module.py:639 in _apply โ”‚
โ”‚ โ”‚
โ”‚ 636 โ”‚ โ”‚
โ”‚ 637 โ”‚ def _apply(self, fn): โ”‚
โ”‚ 638 โ”‚ โ”‚ for module in self.children(): โ”‚
โ”‚ โฑ 639 โ”‚ โ”‚ โ”‚ module._apply(fn) โ”‚
โ”‚ 640 โ”‚ โ”‚ โ”‚
โ”‚ 641 โ”‚ โ”‚ def compute_should_use_set_data(tensor, tensor_applied): โ”‚
โ”‚ 642 โ”‚ โ”‚ โ”‚ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): โ”‚
โ”‚ โ”‚
โ”‚ D:\Python\Python310\lib\site-packages\torch\nn\modules\module.py:662 in _apply โ”‚
โ”‚ โ”‚
โ”‚ 659 โ”‚ โ”‚ โ”‚ # track autograd history of param_applied, so we have to use โ”‚
โ”‚ 660 โ”‚ โ”‚ โ”‚ # with torch.no_grad(): โ”‚
โ”‚ 661 โ”‚ โ”‚ โ”‚ with torch.no_grad(): โ”‚
โ”‚ โฑ 662 โ”‚ โ”‚ โ”‚ โ”‚ param_applied = fn(param) โ”‚
โ”‚ 663 โ”‚ โ”‚ โ”‚ should_use_set_data = compute_should_use_set_data(param, param_applied) โ”‚
โ”‚ 664 โ”‚ โ”‚ โ”‚ if should_use_set_data: โ”‚
โ”‚ 665 โ”‚ โ”‚ โ”‚ โ”‚ param.data = param_applied โ”‚
โ”‚ โ”‚
โ”‚ D:\Python\Python310\lib\site-packages\torch\nn\modules\module.py:985 in convert โ”‚
โ”‚ โ”‚
โ”‚ 982 โ”‚ โ”‚ โ”‚ if convert_to_format is not None and t.dim() in (4, 5): โ”‚
โ”‚ 983 โ”‚ โ”‚ โ”‚ โ”‚ return t.to(device, dtype if t.is_floating_point() or t.is_complex() els โ”‚
โ”‚ 984 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ non_blocking, memory_format=convert_to_format) โ”‚
โ”‚ โฑ 985 โ”‚ โ”‚ โ”‚ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No โ”‚
โ”‚ 986 โ”‚ โ”‚ โ”‚
โ”‚ 987 โ”‚ โ”‚ return self.apply(convert) โ”‚
โ”‚ 988 โ”‚
โ”‚ โ”‚
โ”‚ D:\Python\Python310\lib\site-packages\torch\cuda_init
.py:221 in _lazy_init โ”‚
โ”‚ โ”‚
โ”‚ 218 โ”‚ โ”‚ โ”‚ โ”‚ "Cannot re-initialize CUDA in forked subprocess. To use CUDA with " โ”‚
โ”‚ 219 โ”‚ โ”‚ โ”‚ โ”‚ "multiprocessing, you must use the 'spawn' start method") โ”‚
โ”‚ 220 โ”‚ โ”‚ if not hasattr(torch._C, '_cuda_getDeviceCount'): โ”‚
โ”‚ โฑ 221 โ”‚ โ”‚ โ”‚ raise AssertionError("Torch not compiled with CUDA enabled") โ”‚
โ”‚ 222 โ”‚ โ”‚ if _cudart is None: โ”‚
โ”‚ 223 โ”‚ โ”‚ โ”‚ raise AssertionError( โ”‚
โ”‚ 224 โ”‚ โ”‚ โ”‚ โ”‚ "libcudart functions unavailable. It looks like you have a broken build? โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
AssertionError: Torch not compiled with CUDA enabled

Only exception after runner.py

OS: Win10, no CUDA

  1. Token (type: read) was set
  2. file ".env" available
  3. runner.py was startet (unchanged), just being curious what happens

Result: only a cascade of exceptions on the command line, as following:

Traceback (most recent call last):
File "E:\Programme\Anaconda3\envs\apps\lib\site-packages\huggingface_hub\utils_errors.py", line 213, in hf_raise_for_status
response.raise_for_status()
File "E:\Programme\Anaconda3\envs\apps\lib\site-packages\requests\models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/vae/diffusion_pytorch_model.bin

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "E:\Programme\Anaconda3\envs\apps\lib\site-packages\diffusers\modeling_utils.py", line 394, in from_pretrained
model_file = hf_hub_download(
File "E:\Programme\Anaconda3\envs\apps\lib\site-packages\huggingface_hub\file_download.py", line 1053, in hf_hub_download
metadata = get_hf_file_metadata(
File "E:\Programme\Anaconda3\envs\apps\lib\site-packages\huggingface_hub\file_download.py", line 1359, in get_hf_file_metadata
hf_raise_for_status(r)
File "E:\Programme\Anaconda3\envs\apps\lib\site-packages\huggingface_hub\utils_errors.py", line 254, in hf_raise_for_status
raise HfHubHTTPError(str(HTTPError), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: <class 'requests.exceptions.HTTPError'> (Request ID: iJKkSfoiLnBa4bMVHLKZ2)

Incredibly High VRAM Usage

When running 2 ControlNet models on a 768x768 SD 1.5 model image generation through WebUI, my VRAM usage sits at around 6gb, and then goes up to 8gb when using the hires fix to continue generating at 1420x1420.

When I do the same setup but with PWW enabled, my VRAM usage spikes up to 21gb during the 768x768 generation, and then crashes with an OOM error when trying to use the hires fix at 1420x1420, saying that it tried to allocate 29gb of VRAM (!!!), which is more than my 3090 has.

I don't recall having this same issue when running a generation at 512x512 and then hires fixing at 1024x1024 around a month ago; I'm just going based on memory, but I'm pretty sure it only used 12gb and 16gb respectively. I don't currently have a 512 model to test it on at the moment, but running a 768 model at 512x512 resulted in 14gb of VRAM usage, and a crash with an OOM error trying to allocate 2gb when hires fixed to 1024x1024.

Not sure if this is intentional behavior, but if it is, this would make the extension pretty much useless to anyone with any consumer grade GPU, even the highest end available.
But if it's not intentional behavior, I thought I'd bring it up.

AUTOMATIC1111 extension of PwW + ControlNet

Hi @cloneofsimo, I just implemented PwW extension for AUTOMATIC1111 webui here. (I haven't update any readme yet)

Now the issue is, my implementation combines PwW and ControlNet, which is based on the controller extension, thus making code much more complicated.

I wonder shall we just import all the repo of PwW+ControlNet as submodule here, or just leave it as another independent repo. What do you say?

For UI of PwW+ControlNet, please see the following example:

screencapture-127-0-0-1-7860-2023-03-13-10_56_34

I can't seem to run this

I am getting this error, I don't know if I have done something wrong.

File "B:\paint-with-words-sd\paint_with_words\paint_with_words.py", line 174, in _tokens_img_attention_weight
assert is_in == 1, f"token {v_as_tokens} not found in text"

image

"CROSS_ATTENTION_WEIGHT_6400" Error

I hit error 'CROSS_ATTENTION_WEIGHT_6400'.
But I don't know how fix this issue.

Could you tell me should I do next?

#Executed script
import math
import os
import dotenv
from PIL import Image
from paint_with_words import paint_with_words

settings = {
    "color_context": {
        (7, 9, 175): "sky,1.1",
        (145, 177, 102): "moon,0.8",
        (98, 190, 214): "mountain,0.5",
        (90, 161, 58): "ground,0.2",
        (90, 102, 246): "lake,0.3"
    },
    "color_map_img_path": "input.png",
    "input_prompt": "illustration of a beautiful sky and moon with snowy mountain and tiny lake on sandy ground",
    "output_img_path": "/output.png",
}

if __name__ == "__main__":
    
    try:
        
        dotenv.load_dotenv()

        color_map_image = Image.open(settings["color_map_img_path"]).convert("RGB")
        color_context = settings["color_context"]
        input_prompt = settings["input_prompt"]

        img = paint_with_words(
            color_context=color_context,
            color_map_image=color_map_image,
            input_prompt=input_prompt,
            num_inference_steps=30,
            guidance_scale=7.5,
            device="cuda:0"
            #weight_function=lambda w, sigma, qk: 0.4 * w * math.log(1 + sigma) * qk.max(),
        )

        img.save(settings["output_img_path"])

    except Exception as e:
        print(f'Error:{e}')
#output
CompVis/stable-diffusion-v1-4
Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.13.self_attn.q_proj.weight', 'vision_model.encoder.layers.16.self_attn.q_proj.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.23.layer_norm2.weight', 'vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_model.pre_layrnorm.weight', 'vision_model.encoder.layers.13.self_attn.k_proj.bias', 'vision_model.encoder.layers.12.self_attn.out_proj.bias', 'vision_model.encoder.layers.12.layer_norm1.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_model.encoder.layers.15.layer_norm2.bias', 'vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_model.encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.16.layer_norm2.bias', 'vision_model.encoder.layers.21.layer_norm1.weight', 'vision_model.pre_layrnorm.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_model.encoder.layers.23.layer_norm2.bias', 'vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.bias', 'vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_model.encoder.layers.20.mlp.fc1.bias', 'vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_model.encoder.layers.15.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.layer_norm2.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.bias', 'visual_projection.weight', 'vision_model.encoder.layers.15.mlp.fc2.weight', 'vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_model.encoder.layers.12.mlp.fc1.bias', 'vision_model.encoder.layers.21.self_attn.v_proj.weight', 'vision_model.encoder.layers.19.layer_norm1.weight', 'vision_model.encoder.layers.14.self_attn.q_proj.bias', 'vision_model.encoder.layers.21.layer_norm2.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_model.embeddings.class_embedding', 'vision_model.encoder.layers.4.layer_norm2.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.mlp.fc2.bias', 'vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_model.encoder.layers.3.layer_norm1.weight', 'vision_model.encoder.layers.19.self_attn.out_proj.bias', 'vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_model.encoder.layers.22.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.weight', 'vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.layer_norm1.weight', 'vision_model.encoder.layers.13.mlp.fc2.bias', 'vision_model.encoder.layers.18.self_attn.k_proj.bias', 'vision_model.encoder.layers.22.self_attn.q_proj.weight', 'vision_model.encoder.layers.20.self_attn.v_proj.bias', 'vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_model.encoder.layers.22.layer_norm2.weight', 'vision_model.encoder.layers.8.layer_norm2.bias', 'vision_model.encoder.layers.9.layer_norm2.bias', 'vision_model.encoder.layers.23.self_attn.v_proj.weight', 'vision_model.embeddings.position_ids', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.16.mlp.fc1.bias', 'vision_model.encoder.layers.1.layer_norm1.weight', 'vision_model.encoder.layers.15.mlp.fc1.bias', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.layers.21.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.23.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.22.mlp.fc2.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_model.encoder.layers.4.layer_norm1.weight', 'vision_model.encoder.layers.19.layer_norm2.bias', 'vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_model.encoder.layers.0.layer_norm2.weight', 'vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_model.encoder.layers.17.mlp.fc2.weight', 'vision_model.encoder.layers.16.mlp.fc2.bias', 'vision_model.encoder.layers.17.layer_norm2.bias', 'vision_model.encoder.layers.12.mlp.fc1.weight', 'vision_model.encoder.layers.17.self_attn.q_proj.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.weight', 'vision_model.encoder.layers.13.self_attn.v_proj.weight', 'vision_model.encoder.layers.18.self_attn.q_proj.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_model.encoder.layers.13.self_attn.v_proj.bias', 'vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_model.encoder.layers.20.self_attn.q_proj.bias', 'vision_model.encoder.layers.18.self_attn.v_proj.bias', 'vision_model.encoder.layers.15.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.layer_norm2.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_model.encoder.layers.5.layer_norm1.bias', 'vision_model.encoder.layers.13.layer_norm1.bias', 'vision_model.encoder.layers.2.layer_norm2.bias', 'vision_model.encoder.layers.22.self_attn.q_proj.bias', 'vision_model.encoder.layers.5.layer_norm1.weight', 'vision_model.encoder.layers.12.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.mlp.fc1.weight', 'vision_model.encoder.layers.21.self_attn.k_proj.weight', 'vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_model.encoder.layers.23.mlp.fc1.weight', 'vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_model.encoder.layers.18.self_attn.v_proj.weight', 'vision_model.encoder.layers.20.self_attn.q_proj.weight', 'vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.14.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.mlp.fc2.weight', 'vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_model.encoder.layers.20.mlp.fc2.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.layer_norm1.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.self_attn.out_proj.bias', 'vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.self_attn.q_proj.weight', 'vision_model.encoder.layers.18.mlp.fc2.bias', 'vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.layer_norm2.bias', 'vision_model.encoder.layers.10.layer_norm1.bias', 'vision_model.encoder.layers.18.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.mlp.fc2.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_model.encoder.layers.23.mlp.fc2.bias', 'vision_model.encoder.layers.13.layer_norm2.weight', 'vision_model.post_layernorm.weight', 'vision_model.encoder.layers.15.self_attn.q_proj.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_model.encoder.layers.14.layer_norm1.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_model.encoder.layers.23.layer_norm1.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.mlp.fc1.weight', 'logit_scale', 'vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_model.encoder.layers.13.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.13.mlp.fc2.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.3.layer_norm2.bias', 'vision_model.encoder.layers.21.self_attn.q_proj.bias', 'vision_model.encoder.layers.7.layer_norm2.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_model.encoder.layers.1.layer_norm1.bias', 'vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_model.encoder.layers.11.layer_norm1.bias', 'vision_model.encoder.layers.10.layer_norm1.weight', 'text_projection.weight', 'vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_model.encoder.layers.15.layer_norm1.bias', 'vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_model.encoder.layers.9.layer_norm2.weight', 'vision_model.encoder.layers.12.layer_norm2.weight', 'vision_model.encoder.layers.14.layer_norm1.weight', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.layer_norm1.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_model.encoder.layers.3.layer_norm1.bias', 'vision_model.encoder.layers.21.mlp.fc1.bias', 'vision_model.encoder.layers.22.layer_norm1.bias', 'vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_model.encoder.layers.9.layer_norm1.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision_model.encoder.layers.16.layer_norm2.weight', 'vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_model.encoder.layers.18.layer_norm1.bias', 'vision_model.encoder.layers.19.layer_norm2.weight', 'vision_model.encoder.layers.23.self_attn.q_proj.bias', 'vision_model.encoder.layers.13.mlp.fc1.bias', 'vision_model.encoder.layers.22.layer_norm2.bias', 'vision_model.encoder.layers.16.mlp.fc1.weight', 'vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_model.encoder.layers.14.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.layer_norm2.bias', 'vision_model.encoder.layers.21.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.layer_norm1.weight', 'vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_model.encoder.layers.23.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_model.encoder.layers.14.mlp.fc1.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_model.encoder.layers.11.layer_norm2.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.encoder.layers.21.mlp.fc1.weight', 'vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_model.encoder.layers.12.layer_norm1.bias', 'vision_model.encoder.layers.22.layer_norm1.weight', 'vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.layer_norm2.weight', 'vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.mlp.fc2.bias', 'vision_model.encoder.layers.12.self_attn.v_proj.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_model.encoder.layers.22.mlp.fc2.bias', 'vision_model.encoder.layers.14.self_attn.q_proj.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_model.encoder.layers.11.layer_norm1.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_model.encoder.layers.6.layer_norm1.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.weight', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.encoder.layers.15.self_attn.k_proj.weight', 'vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_model.encoder.layers.18.mlp.fc1.weight', 'vision_model.encoder.layers.20.layer_norm2.weight', 'vision_model.encoder.layers.20.self_attn.k_proj.bias', 'vision_model.encoder.layers.8.layer_norm1.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.layer_norm2.weight', 'vision_model.encoder.layers.14.mlp.fc2.bias', 'vision_model.encoder.layers.22.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.layer_norm1.bias', 'vision_model.encoder.layers.19.mlp.fc2.weight', 'vision_model.encoder.layers.18.mlp.fc1.bias', 'vision_model.encoder.layers.19.mlp.fc1.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.mlp.fc1.bias', 'vision_model.encoder.layers.17.layer_norm1.weight', 'vision_model.encoder.layers.15.layer_norm2.weight', 'vision_model.encoder.layers.16.self_attn.k_proj.bias', 'vision_model.encoder.layers.6.layer_norm2.weight', 'vision_model.encoder.layers.19.self_attn.out_proj.weight', 'vision_model.encoder.layers.23.mlp.fc1.bias', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.17.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.layer_norm2.bias', 'vision_model.encoder.layers.14.layer_norm2.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.v_proj.weight', 'vision_model.encoder.layers.7.layer_norm2.bias', 'vision_model.encoder.layers.16.self_attn.v_proj.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_model.encoder.layers.22.mlp.fc1.weight', 'vision_model.encoder.layers.2.layer_norm1.weight', 'vision_model.encoder.layers.23.self_attn.k_proj.weight', 'vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_model.encoder.layers.11.layer_norm2.bias', 'vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_model.encoder.layers.15.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_model.encoder.layers.18.mlp.fc2.weight', 'vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_model.encoder.layers.17.mlp.fc1.weight', 'vision_model.encoder.layers.4.layer_norm2.bias', 'vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_model.encoder.layers.17.mlp.fc1.bias', 'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.12.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.mlp.fc2.weight', 'vision_model.encoder.layers.19.self_attn.q_proj.bias', 'vision_model.encoder.layers.16.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.layer_norm1.bias', 'vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_model.encoder.layers.4.layer_norm1.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_model.encoder.layers.17.self_attn.v_proj.bias', 'vision_model.encoder.layers.13.mlp.fc1.weight', 'vision_model.encoder.layers.14.mlp.fc1.weight', 'vision_model.encoder.layers.21.mlp.fc2.weight', 'vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_model.encoder.layers.22.self_attn.v_proj.weight', 'vision_model.encoder.layers.16.self_attn.q_proj.bias', 'vision_model.encoder.layers.17.layer_norm1.bias', 'vision_model.encoder.layers.6.layer_norm2.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_model.encoder.layers.15.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_model.encoder.layers.1.layer_norm2.weight', 'vision_model.encoder.layers.15.self_attn.out_proj.bias', 'vision_model.encoder.layers.23.self_attn.out_proj.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_model.encoder.layers.6.layer_norm1.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_model.encoder.layers.17.layer_norm2.weight', 'vision_model.encoder.layers.20.layer_norm2.bias', 'vision_model.encoder.layers.12.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.layer_norm2.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_model.post_layernorm.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_model.encoder.layers.16.layer_norm1.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.bias', 'vision_model.encoder.layers.2.mlp.fc1.weight', 'vision_model.encoder.layers.13.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.mlp.fc2.bias', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.12.layer_norm2.bias', 'vision_model.encoder.layers.12.mlp.fc2.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.bias', 'vision_model.encoder.layers.16.self_attn.k_proj.weight', 'vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_model.encoder.layers.8.layer_norm1.weight', 'vision_model.encoder.layers.15.layer_norm1.weight', 'vision_model.encoder.layers.18.self_attn.out_proj.weight', 'vision_model.encoder.layers.23.self_attn.q_proj.weight', 'vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_model.encoder.layers.22.mlp.fc1.bias', 'vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_model.encoder.layers.17.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.weight', 'vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_model.encoder.layers.14.layer_norm2.bias', 'vision_model.encoder.layers.13.layer_norm2.bias', 'vision_model.encoder.layers.18.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.layer_norm1.bias', 'vision_model.encoder.layers.23.layer_norm1.bias', 'vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.23.self_attn.k_proj.bias', 'vision_model.encoder.layers.9.layer_norm1.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.bias', 'vision_model.encoder.layers.18.layer_norm2.weight', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.20.layer_norm1.weight', 'vision_model.encoder.layers.20.self_attn.k_proj.weight']
- This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
-------
0% 0/30 [00:00<?, ?it/s]
-------
Error:'CROSS_ATTENTION_WEIGHT_6400'

Cannot reproduce results

Hi, when running runner.py I obtain different results from the ones showed in readme file

aurora_0 5-full moon_1 5-mountains_0 4-a half-frozen lake_0 3-boat_2 0

this is what I got using EXAMPLE_SETTING_3.
I tried increasing the strength of "full moon" and "boat", I'm able to obtain the moon but the boat never showed up.

Any guess?

Suggestion

Make this as a extension for Automatic 1111 script and the development will grow exponentially...

diffusers multicontrolnet pipeline with paint with words

Hi, I'm not at your level and was wondering how I could add paint with words to my multicontrolnet pipeline. Here's code that works for example (partial):

controlnet = [
        ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_openpose", torch_dtype=torch.float16),
        ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-depth", torch_dtype=torch.float16),
        ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_canny", torch_dtype=torch.float16),
    ]
pipe = StableDiffusionControlNetPipeline.from_pretrained(
  "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16,safety_checker=None, requires_safety_checker=False,
).to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()

fimage=pipe(
      prompt,
      images,
      num_inference_steps=20,
      negative_prompt=n_prompt,
      controlnet_conditioning_scale=weights,
    )

I would appreciate your input on it and am ready to pay if necessary

I am trying to run this on Stable Diffusion 2.1 but I keep getting black images

Trying to run the code on Stable Diffusion 2.1 returns black images (filled with nan values)

After investigating the noise_pred_text gets all values -Inf on SD2.1, whereas in 1.4 they get valid values.

Any idea on what had changed between the two that might have caused this?

(Running on linux, am able to work out SD2.1 image generation and dream booth training)

token [3293] not found in text

OS: Win11, CUDA, jupyternotebook

This issue happened to run test.py(original but almost same as "Basic Usage").

  1. Token (type: read) is set to .env
  2. .env is set to same folder as test.py
  3. CUDA &cuDNN is installed correctly
  4. diffusers simple script can run

I don't know why this script can't get token from ".env" file.
Or, are there the other issue?

#test.py

import math
import os
import dotenv
from PIL import Image
from paint_with_words import paint_with_words

settings = {
    "color_context": {
        (7, 9, 175): "sky,1.0",
        (145, 177, 102): "moon,1.0",
        (98, 190, 214): "mountain,0.2",
        (90, 161, 58): "ground,0.2",
        (90, 102, 246): "lake,0.2"
    },
    "color_map_img_path": "input.png",
    "input_prompt": "realistic photo of a dog, cat, tree, with beautiful sky, on sandy ground",
    "output_img_path": "/output.png",
}

if __name__ == "__main__":
    
    try:
        
        dotenv.load_dotenv()

        color_map_image = Image.open(settings["color_map_img_path"]).convert("RGB")
        color_context = settings["color_context"]
        input_prompt = settings["input_prompt"]

        img = paint_with_words(
            color_context=color_context,
            color_map_image=color_map_image,
            input_prompt=input_prompt,
            num_inference_steps=30,
            guidance_scale=7.5,
            device="cuda:0",
            weight_function=lambda w, sigma, qk: 0.4 * w * math.log(1 + sigma) * qk.max(),
        )

        img.save(settings["output_img_path"])


    except Exception as e:
        print(e)

#output
CompVis/stable-diffusion-v1-4
CompVis/stable-diffusion-v1-4
Some weights of the model c....
......
.....
...
..
.
token [3293] not found in text
token [3293] not found in text
#diffusers simple script

from diffusers import StableDiffusionPipeline
import os

TOKEN = os.getenv("HF_TOKEN")

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", 
    use_auth_token=TOKEN
).to("cuda:0")

from torch import autocast

prompt = "cute cat with colorful eye, jumping on the grass."
with autocast("cuda"):
    images = pipe(prompt, guidance_scale=7.5).images
images[0].save("output.png")

Commas, Periods and Tokens

This isn't technically an issue, but I thought posting this here could help alleviate an issue some people might face.

Because of the way Paint with Words is implemented, you cannot use commas in your prompt for the reasons mentioned in the description of this GitHub. This can be an issue since commas are commonly used to separate tokens in a prompt. As such, a prompt without commas can have drastically less tokens, and thus the image generated can be both drastically less detailed, and drastically worse quality than an image with commas, as the model has less information to work with. In my testing without PWW enabled, the tokens used in the prompt I was testing went from 52 to 34 tokens after removing commas, and the output was far lower quality.

However, simply replacing your commas with periods seems to act perfectly fine as a substitute, and results in an equal number of tokens as when using commas. Further, whilst there is a tiny difference in very small details of the generation (clouds and other background details can be moved slightly, and other small, inconsequential changes) the image generated with the same seed is almost exactly the same, with the quality of the image being about equal as a result.
Importantly, with this change you can then also implement the exact text with periods into the PWW obj field without it throwing up an error like it does if you use commas.

I thought I would bring this up since it isn't mentioned in the description of the GitHub, as the description simply says "don't use commas". I think modifying the description to instead say "replace commas with periods in both prompt and obj field" could be a much more helpful to people who don't know about this workaround.

text and control input align

Thank you for your great work!

I have a question about the controlnet extension. It seems the text is spatially aligend witt the latemt embeddings orginally from SD, but how is the spatially alilgn between text and geometric control (ex. scribble) done?

Reading througth the code here, I think there is no alignment between the text embeddings and geometric control embeddings. Am I right?

Thank you!

About porting to the diffusers pipeline

Thank you very much for implementing this awesome paint-with-words technique based on Stable Diffusion.

I am in the process of porting this project to the diffusers pipeline as a custom pipeline so that it can be used more easily by a wider variety of people.

If the porting is completed, I will clearly indicate that my implementation is based on yours, would you allow me to do so?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.