MAAE: Mask-Guided Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Jiaqi Wu

advisor: Prof. Eytan Adar

Description

This work is built upon Attend-and-Excite:Attention-Based Semantic Guidance for Text-to-Image Diffusion Models (SIGGRAPH 2023).

Highlights:

The method can generate images guided by multi-object prompts and follow the soft mask layout condition by user input. in other words, input mask images don't need to be specific, but can only roughly tell the model where the objects should be. Also, in some cases, this method can generate reasonable results when original AAE and stable diffusion fail to aligh with the specific spatial relationship of the prompts. (e.g., a monkey riding a tiger)

Research question: How can we enable soft conditions on multiple objects layout in stable diffusion?

contributions:

• Built a new algorithm upon Attend and Excite (Attention-Based Semantic Guidance for Diffusion Models)

• Deployed a new loss function, directly regulating attention maps by input masks at the inference stage

• The new architecture can generate images guided by multi-object prompts and follow the mask layout

Method

We provide addtional binary mask input for selected key tokens of the prompt. The instinct is to regulize the attention map of each token as close to the mask as possible. Here is how we construct our new loss function:

1.We want to maximize the values inside the mask (we call the matrix keeping only these values Inner), and also minimize the values outside of the mask (we call the matrix keeping only these values Outer).

To compute this, for each attention map, we compute the Diff matrix = Inner-Outer, and try to minimize value: mean (Diff) in our loss function

2.Also, we want to make sure the attention map also regulated by the edge of the masks. To do this, we compute the edge matrix of each attention map:

In our loss function, we also try to maximize the edge matrix, by minimizing value: -sum (edge matrix).

Putting together, our loss function would be:

notice that in real implementation, the edge part of the loss function can be eliminated, given the situation whether edge regularization is important in specific tasks. α is a hyperparameter that is usually 100.

Full architecture overview

MAAE overview:

In comparison, original AAE overview from paper:

Limitations

Although this is an efficient and effective way of mask guided layout in stable diffusion which requires no training cost, There are several limitations:

The method is not suitable for tasks that requires a precise mask or shape layout.

In some cases, the images generated by MAAE have worse quality than original AAE or stable diffusion

The method also extend the limitations of stable diffusion in poor performance in specific prompts (like usually prompts).

The method sometimes fail to generate reasonable images due to the limitation of inference-stage training, whereas changes in seeds can partly solve this problem.

kolvacs-w / mask-guided-attention-semantic-guidance-for-text-to-image-diffusion-models Goto Github PK

mask-guided-attention-semantic-guidance-for-text-to-image-diffusion-models's Introduction

MAAE: Mask-Guided Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Description

Highlights:

Method

Full architecture overview

Limitations

result gallery

mask-guided-attention-semantic-guidance-for-text-to-image-diffusion-models's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs