3D-aware Image Generation and Editing with Multi-modal Conditions
End-to-end 3D-aware image generation and editing model with disentanglement strategy and multi-modal control
The paper proposes a novel end-to-end 3D-aware image generation and editing model. The key contributions include exploring the latent space of 3D-aware Generative Adversarial Networks (GANs) to identify meaningful latent features associated with semantic labels. The authors introduce a disentanglement strategy based on a cross-attention mechanism to separate appearance from shape features, ensuring consistent synthesis of appearance using the same noise latent for various semantic masks. Additionally, a multi-modal interactive 3D-aware generation framework is proposed, incorporating various conditional inputs such as pure noises, text, and reference images. The framework enables flexible image generation and editing tasks, including generating diverse images with distinct noises, editing the appearance through a text description, and conducting style transfer using a reference RGB image.
The proposed approach involves an interactive module to estimate appearance-aware shape features, which are used to generate shape structures with coarse appearance via modulating the first several layers of 3D GANs. The appearance code is then employed to generate detailed texture features by adapting the rest layers of the generator. The training process is guided by losses including semantic similarity and appearance consistency.
The authors delve into the latent space of 3D GANs and propose a dissection strategy to separate shape and appearance features during the generation process. They conduct style mixing experiments to explore the semantic latent space and propose a quantitative metric based on semantic labels to identify meaningful semantic properties embedded in each dimension of the style vector. The paper also introduces an innovative strategy to ensure the disentangled generation of shape and appearance, addressing the limitations of existing methods in generating arbitrary appearance due to the lack of interaction with the appearance code.


Comments
None