LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis
LATTE3D is a novel approach that efficiently generates high-quality textured 3D meshes from text prompts in just 400ms by combining 3D priors, amortized optimization, and surface rendering.
LATTE3D is NVIDIA's latest entry in large-scale amortized text-to-enhanced 3D synthesis. The method aims to address the limitations of existing text-to-3D generation approaches by combining 3D priors, amortized optimization, and a second stage of surface rendering to achieve fast and high-quality generation on a significantly larger prompt set. LATTE3D leverages 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts. The key components of LATTE3D include a scalable architecture and the amortization of both neural field and textured surface generation to produce highly detailed textured meshes in a single forward pass.
LATTE3D consists of two stages: the first stage uses volumetric rendering to train the texture and geometry, incorporating an SDS gradient from a 3D-aware image prior and a regularization loss comparing predicted shape masks with 3D assets in a library. The second stage employs surface-based rendering to train only the texture for enhanced quality. Both stages utilize amortized optimization over a set of prompts to maintain fast generation. The method involves two networks, a texture network T and a geometry network G, with shared encoders in the first stage and frozen geometry network G in the second stage. The texture network T is updated in the second stage, and triplanes are upsampled with an MLP inputting the text embedding.
LATTE3D's trained models enable users to provide various text prompts and interactively view high-quality 3D assets. The method allows for the generation of high-quality 3D assets in just 400ms for a wide range of text prompts, with the option to regularize towards a user-specified 3D shape. Additionally, LATTE3D supports user stylization, enabling the training on a large set of prompts for realistic animals and the generation of stylized animals based on user-supplied point clouds. The method also enhances user-controllability through interpolations between user-provided shapes and text prompts, allowing users to guide generalization towards a provided point cloud.
Comments
None