Training-Free Structured Diffusion Guidance
for Compositional Text-to-Image Synthesis

Abstract

overview

We improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To achieve this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies.

Results


Concept Conjunction: two objects with different colors.

overview

General prompts with multiple objects or colors

overview

Method

Cross Attention Control

overview

The spatial layouts depend on the cross attention maps. These maps control the layout and structure of generated images, while the values contain rich semantics mapped into attended regions. We assume that the image layout and content can be disentangled by controlling attention maps and values separately. (See prompt-to-prompt)

Structured Diffusion Guidance

overview

We fuse the structured representations (e.g. constituency tree or scene graph) into the guidance process by encoding the individual concepts (i.e. noun phrases) separately. Features from these individual concepts are used to replace features from the full input prompt. The semantics of each individual words is enhanced after such replacement. In each cross-attention layer, the keys are computed from the unmodified prompt features while the values are from multiple replaced features.

Analysis

More to come. Please refer to the paper (appendix) for now.

overview

Citation

@article{feng2022training,
  title={Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis},
  author={Feng, Weixi and He, Xuehai and Fu, Tsu-Jui and Jampani, Varun and Akula, Arjun and Narayana, Pradyumna and Basu, Sugato and Wang, Xin Eric and Wang, William Yang},
  journal={arXiv preprint arXiv:2212.05032},
  year={2022}
}

Acknowledgements

This project is funded by an unrestricted gift from Google.
The website template was borrowed from Jon Barron.