GithubHelp home page GithubHelp logo

kolvacs-w / mask-guided-attention-semantic-guidance-for-text-to-image-diffusion-models Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 15 KB

The method can generate images guided by multi-object prompts and follow the soft mask layout condition by user input. This work is built upon Attend-and-Excite:Attention-Based Semantic Guidance for Text-to-Image Diffusion Models (SIGGRAPH 2023).

mask-guided-attention-semantic-guidance-for-text-to-image-diffusion-models's Introduction

MAAE: Mask-Guided Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Jiaqi Wu

advisor: Prof. Eytan Adar

Description

This work is built upon Attend-and-Excite:Attention-Based Semantic Guidance for Text-to-Image Diffusion Models (SIGGRAPH 2023).

Highlights:

The method can generate images guided by multi-object prompts and follow the soft mask layout condition by user input. in other words, input mask images don't need to be specific, but can only roughly tell the model where the objects should be. Also, in some cases, this method can generate reasonable results when original AAE and stable diffusion fail to aligh with the specific spatial relationship of the prompts. (e.g., a monkey riding a tiger)

Screen Shot 2023-12-03 at 10 26 25 PM

Research question: How can we enable soft conditions on multiple objects layout in stable diffusion?

contributions:

• Built a new algorithm upon Attend and Excite (Attention-Based Semantic Guidance for Diffusion Models)

• Deployed a new loss function, directly regulating attention maps by input masks at the inference stage

• The new architecture can generate images guided by multi-object prompts and follow the mask layout

Method

We provide addtional binary mask input for selected key tokens of the prompt. The instinct is to regulize the attention map of each token as close to the mask as possible. Here is how we construct our new loss function:

1.We want to maximize the values inside the mask (we call the matrix keeping only these values Inner), and also minimize the values outside of the mask (we call the matrix keeping only these values Outer).

Screen Shot 2023-12-03 at 5 46 10 PM

To compute this, for each attention map, we compute the Diff matrix = Inner-Outer, and try to minimize value: mean (Diff) in our loss function

2.Also, we want to make sure the attention map also regulated by the edge of the masks. To do this, we compute the edge matrix of each attention map:

Screen Shot 2023-12-03 at 5 51 58 PM

In our loss function, we also try to maximize the edge matrix, by minimizing value: -sum (edge matrix).

Putting together, our loss function would be:

Screen Shot 2023-12-03 at 6 05 54 PM

notice that in real implementation, the edge part of the loss function can be eliminated, given the situation whether edge regularization is important in specific tasks. α is a hyperparameter that is usually 100.

Full architecture overview

MAAE overview:

Screen Shot 2023-12-03 at 5 28 22 PM

In comparison, original AAE overview from paper:

Screen Shot 2023-12-03 at 5 27 07 PM

Limitations

Although this is an efficient and effective way of mask guided layout in stable diffusion which requires no training cost, There are several limitations:

The method is not suitable for tasks that requires a precise mask or shape layout.

In some cases, the images generated by MAAE have worse quality than original AAE or stable diffusion

The method also extend the limitations of stable diffusion in poor performance in specific prompts (like usually prompts).

The method sometimes fail to generate reasonable images due to the limitation of inference-stage training, whereas changes in seeds can partly solve this problem.

result gallery

Untitled - Frame 2

mask-guided-attention-semantic-guidance-for-text-to-image-diffusion-models's People

Contributors

kolvacs-w avatar

Stargazers

piaoliang li avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.