Training-Free Structured Diffusion Guidance
for Compositional Text-to-Image Synthesis
Abstract
We improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To achieve this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies.
Results
Concept Conjunction: two objects with different colors.
General prompts with multiple objects or colors
Method
Cross Attention Control
The spatial layouts depend on the cross attention maps. These maps control the layout and structure of generated images, while the values contain rich semantics mapped into attended regions. We assume that the image layout and content can be disentangled by controlling attention maps and values separately. (See prompt-to-prompt)
Structured Diffusion Guidance
We fuse the structured representations (e.g. constituency tree or scene graph) into the guidance process by encoding the individual concepts (i.e. noun phrases) separately. Features from these individual concepts are used to replace features from the full input prompt. The semantics of each individual words is enhanced after such replacement. In each cross-attention layer, the keys are computed from the unmodified prompt features while the values are from multiple replaced features.
Analysis
More to come. Please refer to the paper (appendix) for now.
Citation
@article{feng2022training,
title={Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis},
author={Feng, Weixi and He, Xuehai and Fu, Tsu-Jui and Jampani, Varun and Akula, Arjun and Narayana, Pradyumna and Basu, Sugato and Wang, Xin Eric and Wang, William Yang},
journal={arXiv preprint arXiv:2212.05032},
year={2022}
}
Acknowledgements
This project is funded by an unrestricted gift from Google.
The website template was borrowed from Jon Barron.