VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation
Visual Prompt-guided text-to-3D diffusion model to enhance the generation of realistic 3D models from text prompts
This work introduces a novel approach for text-to-3D generation called VP3D, which aims to enhance the quality of 3D models by incorporating visual prompts. The method involves a two-step process: first generating an image from text using text-to-image diffusion models, then using this image as a visual prompt to guide the optimization of a Neural Radiance Field (NeRF) with Score Distillation Sampling (SDS). The visual prompt helps improve the fidelity of the 3D assets by providing realistic appearance and rich texture details. Additionally, a differentiable reward function is utilized to ensure alignment with both the visual and text prompts.
In the implementation, the 3D representation used is DMTet, which is optimized in two stages: coarse and fine. The coarse stage initializes the 3D shape and texture from the density and color fields, while the fine stage further refines the high-fidelity mesh and texture. The optimization process involves rendering images from the 3D model at different camera poses and optimizing with a combination of losses, including LV P SDS, Lvcreward, and Lhfreward. Trade-off parameters and weights are set accordingly, and the visual prompt condition weight is fixed for consistency.


Comments
None