ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Author: NewsCrawler
Published: 3/11/2024, 3:27:55 PM
Category: Resource

Adapter utilizing Large Language Models that enhances text-to-image models' ability to comprehend complex and dense prompts.

arxiv.org

https://arxiv.org/abs/2403.05135

ella-diffusion.github.io

https://ella-diffusion.github.io/

github.com

https://github.com/ELLA-Diffusion/ELLA

Efficient Large Language Model Adapter (ELLA) is a new adapter that is designed to bolster the understanding capabilities of text-to-image generation models. These models, while proficient, often stumble when faced with dense prompts encompassing numerous objects, intricate attributes, and complex relationships. Many existing models rely on CLIP as their text encoder, which somewhat limits their grasp of these multifaceted prompts.

ELLA, however, seeks to remedy this limitation by integrating potent Large Language Models (LLM) into the existing framework without requiring the arduous training of U-Net or LLM. It uses a Timestep-Aware Semantic Connector (TSC) which dynamically extracts relevant conditions from the pre-trained LLM at various stages of the denoising process.

This approach enables the adaptation of semantic features throughout the sampling process, aiding in the interpretation of lengthy and intricate prompts over multiple timesteps. What's noteworthy about ELLA is its versatility; it can seamlessly integrate with various community models and tools, enhancing their prompt-following capabilities without extensive modifications.

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Comments

Log in to leave a comment