Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing
Investigates the significance of self-attention maps and the potential drawbacks of cross-attention maps in Text-Guided Image Editing
The paper delves into the analysis of attention layers in Stable Diffusion models for Text-Guided Image Editing (TIE). It focuses on understanding the impact of attention layers, particularly cross-attention and self-attention maps, on the effectiveness of TIE. The authors explore the modification of attention maps and its contribution to diffusion-based TIE. They conduct probe analysis and systematic exploration of attention map modification with different blocks in the diffusion model to gain comprehensive insights into the underlying mechanisms of TIE using diffusion-based models.
The paper highlights the optional nature of editing cross-attention maps in diffusion models for image editing. It emphasizes that replacing or refining cross-attention maps between the source and target image generation process is dispensable and can result in failed image editing. The authors also emphasize the significance of self-attention maps in ensuring that the edited image retains the original layout information and shape details. Based on their findings, the authors propose a simplified and effective algorithm called Free-Prompt-Editing (FPE), which performs image editing by replacing the self-attention map in specific attention layers during denoising, without needing a source prompt. This method is beneficial for real image editing scenarios.
Comments
None