ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Adapter utilizing Large Language Models that enhances text-to-image models' ability to comprehend complex and dense prompts.
Efficient Large Language Model Adapter (ELLA) is a new adapter that is designed to bolster the understanding capabilities of text-to-image generation models. These models, while proficient, often stumble when faced with dense prompts encompassing numerous objects, intricate attributes, and complex relationships. Many existing models rely on CLIP as their text encoder, which somewhat limits their grasp of these multifaceted prompts.
ELLA, however, seeks to remedy this limitation by integrating potent Large Language Models (LLM) into the existing framework without requiring the arduous training of U-Net or LLM. It uses a Timestep-Aware Semantic Connector (TSC) which dynamically extracts relevant conditions from the pre-trained LLM at various stages of the denoising process.
This approach enables the adaptation of semantic features throughout the sampling process, aiding in the interpretation of lengthy and intricate prompts over multiple timesteps. What's noteworthy about ELLA is its versatility; it can seamlessly integrate with various community models and tools, enhancing their prompt-following capabilities without extensive modifications.
Comments
None