Comments (9)
Cc: @fabiorigano @asomoza
from diffusers.
Is it possible to use the style and layout from 2 reference images with a single IP Adapter?
If you want to use the style of one image and the layout from the other one, you'll need to load two IP Adapters, if you pass multiple images to just one IP Adapter it will grab the features of both of them combined.
You shouldn't be able to pass a list of scales to a single IP Adapter so we're missing a check there I think.
from diffusers.
I think there is an issue with scale function. The docs show this syntax in context of using two masks:
pipeline.set_ip_adapter_scale([[0.7, 0.7]])
however as @chrismaltais notes above, and I also got the same error, if we do this:
pipeline.set_ip_adapter_scale([[layout, style]])
we get the error.
TypeError: unsupported operand type(s) for *: 'dict' and 'Tensor'\n
So the block specification is not allowed but scalar values are allowed?
from diffusers.
oh yeah, you're right, the dict (block) scaling was added later with InstantStyle and this affects the IP Adapter attention layers, the list of scale values (float) was added to be able to set a different scale for each image.
I can see why this gets confusing really fast, so maybe we need to improve the docs?
- You can't use one IP Adapter for two images where you want to use one as style and the second as a layout.
- If you use a dict with the blocks or a list of dicts you're using InstanStyle
- If you pass a list of floats you're using scales for each IP Adapter
- If you pass a list of lists of floats, you need to pass multiple images and you're setting the scale for each image.
- You can't pass a list of a list of dicts because what you're trying to do here is to set the scale of the attention layers for each image.
I think this is correct but can you confirm please @fabiorigano
cc: @stevhliu
from diffusers.
Isnt that a code bug though, that scalars are possible but not a instant style specification in a nested list? From what I understood, the block specification in the default case == 1 as scalar config, but also permits finer grained spec. It does look like a instantstyle parsing issue.
from diffusers.
Hello,
Would be great to get guidance on how to use IP adapter masks. I am getting some unpredictable results with IP Adapter. The output is sometimes just 1 person with both identities sort of blended together. Please advise if I'm doing something incorrect
Thanks in advance.
Input Images:
Result:
Code:
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-ema").to(dtype=torch.float16)
image_encoder = CLIPVisionModelWithProjection.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K").to(dtype=torch.float16)
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"
pipeline = StableDiffusionPipeline.from_pretrained(
"SG161222/Realistic_Vision_V5.1_noVAE",
torch_dtype=torch.float16,
vae=vae,
image_encoder=image_encoder,
safety_checker=None,
).to("cuda")
pipeline.load_lora_weights(lcm_lora_id)
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name=["ip-adapter-plus-face_sd15.bin"], image_encoder_folder=None)
pipeline.set_ip_adapter_scale([[0.9, 0.9]])
# Load and preprocess masks
mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")
output_height = 512
output_width = 512
processor = IPAdapterMaskProcessor()
masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)
masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]
# face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
# face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")
# these are same as above but resized to 512x512
face_image1 = load_image("/content/ip_mask_girl1.png")
face_image2 = load_image("/content/ip_mask_girl2.png")
ip_images = [[face_image1, face_image2]]
# Set generator
generator = torch.Generator(device="cpu").manual_seed(1480)
prompts = ["2 girls"]
negative_prompt="(deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime:1.4), black and white, text, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"
# Run pipeline
# Run pipeline
images = pipeline(
prompt=prompts,
ip_adapter_image=ip_images,
negative_prompt=[negative_prompt],
num_inference_steps=10, num_images_per_prompt=3,
generator=generator,
cross_attention_kwargs={"ip_adapter_masks": masks},
strength=0.45,
width=512,
height=512,
guidance_scale=2.0,
).images
from diffusers.
Hi, you're using scales that are to high, at most you should use 0.7, but ideally with 0.5, the higher the scale the more probability you'll get a one person blended.
The ones in the doc are examples and you should use something better to your use case, the example is with SDXL which has a higher resolution and IMO it understand better the input from IP Adapters also the masks are more precise with a 1024x1024 resolution.
Other issues I found in your code if you're interested:
- The prompt is too simple and with "2girls" in a realistic model you're not giving it too much to work with.
- You're using the same mask, the same prompt, higher scales and using "non-realistic images" with a model that was only trained with realistic images.
- Your negative prompt has weighting too like
(anime, ...., :1.4)
which is just noise with the default diffusers. - On top of all, you're using LCM with a low guidance and fewer steps, so you're giving the model even less space to work with.
I think the results you got are really good if we take all this into account and I'm kind of surprised that you got them.
from diffusers.
hi @asomoza
Thanks you for the detailed feedback. I will incorporate your suggestions going forward. To give details on some of the points:
I did have a more verbose prompt with realistic images but did not share those to preserve privacy of subjects involved and tried to reproduce the issue with the documentation examples for this report. Even then, I had to try a few times to get this result (blended identity), it was fine for initial few tests.
For my use-case I decided to use openpose controlnet for both subjects, so far I did not see this problem when I clearly segregate the subjects with controlnet.
One question on scale: does higher/lower scale impact the likeness of the result to input images?
Thanks again for taking the time to provide this feedback! :)
from diffusers.
yeah, using controlnet really helps with this, I can even generate a group of people with each one having different characteristics or even styles.
does higher/lower scale impact the likeness of the result to input images?
Yes, the scale affects the likeness, but it all depends on the type of IP Adapter and the image. The plus ip adapters are a lot stronger so you'll need to lower the scale, and for the faces, if you're going to use a plus face
IP Adapter, you can also use a separate mask for each face and can give a higher scale to each one to improve the likeness.
So I recommend using controlnet, I like a lot better to use something like mistoline with the contour of the people, a plus IP Adapter with masks for each person with lower scale and a Face IP Adapter with face masks for each one with higher scales.
from diffusers.
Related Issues (20)
- AnimateDiff bug not sure if it use adapter or not HOT 2
- AnimateDiffSDXL + Multi Controlnets support HOT 2
- need a download folder button HOT 5
- Attention api changes no documentation ? HOT 4
- Tensor size mismatch for non pow of 2 sized image, SD3ControlNetModel HOT 8
- StableDiffusion3Img2ImgPipeline fails when input image is not aligned to 16 HOT 1
- Potentially redundant code in `train_dreambooth_lora_sd3.py` HOT 2
- Attention masks are missing in SD3 to mask out text padding tokens HOT 2
- Why do Diffusers schedulers produce lower quality outputs compared to ComfyUI? HOT 20
- `SDXLCFGCutoffCallback` does not work with `StableDiffusionXLControlNetPipeline` HOT 1
- A bug about one-step inference in PixArtAlphaPipeline HOT 2
- SD3 + SDXL refine fix lying on grass. How to do in diffusers colab workflow? HOT 1
- [PAG] add `StableDiffusionXLControlNetPAGImg2ImgPipeline` HOT 4
- PAG is now supported in core 🤗 HOT 4
- Trying to train sdxl text to image model, but getting the following error. HOT 5
- [SD3] vae.config.shift_factor missing in dreambooth training examples HOT 6
- Add PAG support to SD1.5 HOT 9
- Error thrown in `scheduling_dpmsolver_multistep` when re-used between validation steps HOT 5
- how to unload a pipeline HOT 3
- Error(s) in initializing SD3ControlNetModel by from_transformer HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffusers.