Describe the bug I'm trying to use a single IP adapter with multip

Cc: <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

More thorough guidance for multiple IP adapter images/masks and a single IP Adapter about diffusers HOT 9 OPEN

chrismaltais commented on July 20, 2024

More thorough guidance for multiple IP adapter images/masks and a single IP Adapter

from diffusers.

Comments (9)

sayakpaul commented on July 20, 2024

Cc: @fabiorigano @asomoza

from diffusers.

asomoza commented on July 20, 2024

Is it possible to use the style and layout from 2 reference images with a single IP Adapter?

If you want to use the style of one image and the layout from the other one, you'll need to load two IP Adapters, if you pass multiple images to just one IP Adapter it will grab the features of both of them combined.

You shouldn't be able to pass a list of scales to a single IP Adapter so we're missing a check there I think.

from diffusers.

darshats commented on July 20, 2024

I think there is an issue with scale function. The docs show this syntax in context of using two masks:
pipeline.set_ip_adapter_scale([[0.7, 0.7]])

however as @chrismaltais notes above, and I also got the same error, if we do this:
pipeline.set_ip_adapter_scale([[layout, style]])

we get the error.
TypeError: unsupported operand type(s) for *: 'dict' and 'Tensor'\n

So the block specification is not allowed but scalar values are allowed?

from diffusers.

asomoza commented on July 20, 2024

oh yeah, you're right, the dict (block) scaling was added later with InstantStyle and this affects the IP Adapter attention layers, the list of scale values (float) was added to be able to set a different scale for each image.

I can see why this gets confusing really fast, so maybe we need to improve the docs?

You can't use one IP Adapter for two images where you want to use one as style and the second as a layout.
If you use a dict with the blocks or a list of dicts you're using InstanStyle
If you pass a list of floats you're using scales for each IP Adapter
If you pass a list of lists of floats, you need to pass multiple images and you're setting the scale for each image.
You can't pass a list of a list of dicts because what you're trying to do here is to set the scale of the attention layers for each image.

I think this is correct but can you confirm please @fabiorigano

cc: @stevhliu

from diffusers.

darshats commented on July 20, 2024

Isnt that a code bug though, that scalars are possible but not a instant style specification in a nested list? From what I understood, the block specification in the default case == 1 as scalar config, but also permits finer grained spec. It does look like a instantstyle parsing issue.

from diffusers.

whiterose199187 commented on July 20, 2024

Hello,

Would be great to get guidance on how to use IP adapter masks. I am getting some unpredictable results with IP Adapter. The output is sometimes just 1 person with both identities sort of blended together. Please advise if I'm doing something incorrect

Thanks in advance.

Input Images:

Result:

Code:

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-ema").to(dtype=torch.float16)
image_encoder = CLIPVisionModelWithProjection.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K").to(dtype=torch.float16)
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"
pipeline = StableDiffusionPipeline.from_pretrained(
    "SG161222/Realistic_Vision_V5.1_noVAE",
    torch_dtype=torch.float16,
    vae=vae,
    image_encoder=image_encoder,
    safety_checker=None,
).to("cuda")
pipeline.load_lora_weights(lcm_lora_id)
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name=["ip-adapter-plus-face_sd15.bin"], image_encoder_folder=None)
pipeline.set_ip_adapter_scale([[0.9, 0.9]])

# Load and preprocess masks
mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")

output_height = 512
output_width = 512

processor = IPAdapterMaskProcessor()
masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)
masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]

# face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
# face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")

# these are same as above but resized to 512x512
face_image1 = load_image("/content/ip_mask_girl1.png")
face_image2 = load_image("/content/ip_mask_girl2.png")

ip_images = [[face_image1, face_image2]]

# Set generator
generator = torch.Generator(device="cpu").manual_seed(1480)
prompts = ["2 girls"]
negative_prompt="(deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime:1.4), black and white, text, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"


# Run pipeline
# Run pipeline
images = pipeline(
    prompt=prompts,
    ip_adapter_image=ip_images,
    negative_prompt=[negative_prompt],
    num_inference_steps=10, num_images_per_prompt=3,
    generator=generator,
    cross_attention_kwargs={"ip_adapter_masks": masks},
    strength=0.45,
    width=512,
    height=512,
    guidance_scale=2.0,
).images

from diffusers.

asomoza commented on July 20, 2024

Hi, you're using scales that are to high, at most you should use 0.7, but ideally with 0.5, the higher the scale the more probability you'll get a one person blended.

The ones in the doc are examples and you should use something better to your use case, the example is with SDXL which has a higher resolution and IMO it understand better the input from IP Adapters also the masks are more precise with a 1024x1024 resolution.

Other issues I found in your code if you're interested:

The prompt is too simple and with "2girls" in a realistic model you're not giving it too much to work with.
You're using the same mask, the same prompt, higher scales and using "non-realistic images" with a model that was only trained with realistic images.
Your negative prompt has weighting too like (anime, ...., :1.4) which is just noise with the default diffusers.
On top of all, you're using LCM with a low guidance and fewer steps, so you're giving the model even less space to work with.

I think the results you got are really good if we take all this into account and I'm kind of surprised that you got them.

from diffusers.

whiterose199187 commented on July 20, 2024

hi @asomoza

Thanks you for the detailed feedback. I will incorporate your suggestions going forward. To give details on some of the points:

I did have a more verbose prompt with realistic images but did not share those to preserve privacy of subjects involved and tried to reproduce the issue with the documentation examples for this report. Even then, I had to try a few times to get this result (blended identity), it was fine for initial few tests.

For my use-case I decided to use openpose controlnet for both subjects, so far I did not see this problem when I clearly segregate the subjects with controlnet.

One question on scale: does higher/lower scale impact the likeness of the result to input images?

Thanks again for taking the time to provide this feedback! :)

from diffusers.

asomoza commented on July 20, 2024

yeah, using controlnet really helps with this, I can even generate a group of people with each one having different characteristics or even styles.

does higher/lower scale impact the likeness of the result to input images?

Yes, the scale affects the likeness, but it all depends on the type of IP Adapter and the image. The plus ip adapters are a lot stronger so you'll need to lower the scale, and for the faces, if you're going to use a plus face IP Adapter, you can also use a separate mask for each face and can give a higher scale to each one to improve the likeness.

So I recommend using controlnet, I like a lot better to use something like mistoline with the contour of the people, a plus IP Adapter with masks for each person with lower scale and a Face IP Adapter with face masks for each one with higher scales.

from diffusers.

More thorough guidance for multiple IP adapter images/masks and a single IP Adapter about diffusers HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs