StructuredDiffusion

Training-Free Structured Diffusion Guidance
for Compositional Text-to-Image Synthesis

¹UC Santa Barbara, ²UC Santa Cruz, ³Google

Abstract

We improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To achieve this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies.

Results

Concept Conjunction: two objects with different colors.

General prompts with multiple objects or colors

Method

Cross Attention Control

The spatial layouts depend on the cross attention maps. These maps control the layout and structure of generated images, while the values contain rich semantics mapped into attended regions. We assume that the image layout and content can be disentangled by controlling attention maps and values separately. (See prompt-to-prompt)

Structured Diffusion Guidance

We fuse the structured representations (e.g. constituency tree or scene graph) into the guidance process by encoding the individual concepts (i.e. noun phrases) separately. Features from these individual concepts are used to replace features from the full input prompt. The semantics of each individual words is enhanced after such replacement. In each cross-attention layer, the keys are computed from the unmodified prompt features while the values are from multiple replaced features.

Analysis

More to come. Please refer to the paper (appendix) for now.

Stable Diffusion and Latent Diffusion Models

Compositional Visual Generation with Composable Diffusion Models

Prompt-to-Prompt Image Editing with Cross Attention Control

DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models

Citation


                        @article{feng2022training,

                          title={Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis},

                          author={Feng, Weixi and He, Xuehai and Fu, Tsu-Jui and Jampani, Varun and Akula, Arjun and Narayana, Pradyumna and Basu, Sugato and Wang, Xin Eric and Wang, William Yang},

                          journal={arXiv preprint arXiv:2212.05032},

                          year={2022}

                    }

Acknowledgements

This project is funded by an unrestricted gift from Google.
The website template was borrowed from Jon Barron.

Training-Free Structured Diffusion Guidance
for Compositional Text-to-Image Synthesis

Paper

Code

Demo [Coming Soon]

Results

Abstract

Results

Concept Conjunction: two objects with different colors.

General prompts with multiple objects or colors

Method

Cross Attention Control

Structured Diffusion Guidance

Analysis

Citation

Acknowledgements

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Paper

Code

Demo [Coming Soon]

Results

Abstract

Results

Concept Conjunction: two objects with different colors.

General prompts with multiple objects or colors

Method

Cross Attention Control

Structured Diffusion Guidance

Analysis

Related Works

Citation

Acknowledgements

Training-Free Structured Diffusion Guidance
for Compositional Text-to-Image Synthesis