MobileDiffusion: Rapid text-to-image generation on-device
Efficient latent diffusion model specifically designed for mobile devices
Text-to-image generation on mobile devices has been a challenging task due to the computational demands of leading models. MobileDiffusion, a novel approach developed by Google researchers, aims to address this challenge by introducing an efficient latent diffusion model specifically designed for mobile devices. This model, with a comparably small size of 520M parameters, enables rapid text-to-image generation on mobile devices, running in half a second to generate high-quality 512x512 images.
The relative inefficiency of text-to-image diffusion models on mobile devices arises from the iterative denoising required to generate images and the complexity of the network architecture, which involves a substantial number of parameters. Previous studies have focused on reducing the number of function evaluations and addressing the architectural efficiency of text-to-image diffusion models. However, MobileDiffusion takes a comprehensive approach, examining each constituent and computational operation within the model's architecture to achieve efficiency.
MobileDiffusion's design follows that of latent diffusion models, consisting of a text encoder, a diffusion UNet, and an image decoder. The research team conducted a detailed examination of the diffusion UNet architecture, optimizing the transformer blocks and convolution blocks to improve efficiency. Additionally, they trained a variational autoencoder (VAE) to encode and decode images, resulting in a significant performance boost with better quality metrics.
In addition to optimizing the model architecture, MobileDiffusion adopts a DiffusionGAN hybrid to achieve one-step sampling during inference. This approach involves using a pre-trained diffusion UNet to initialize the generator and discriminator, streamlining the training process and achieving convergence in less than 10K iterations.
The results of MobileDiffusion demonstrate its capability to generate high-quality diverse images for various domains on both iOS and Android devices, with latency measurements showing its efficiency in generating 512x512 images within half a second. This rapid image generation experience has the potential to enable various use cases on mobile devices, making MobileDiffusion a promising option for mobile deployments.


Comments
None