RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion
Technique for generating general forward-facing 3D scenes from text descriptions with flexibility in synthesizing high-quality 3D scenes in various styles.
RealmDreamer aims to generate 3D scenes from text descriptions by optimizing a 3D Gaussian Splatting representation. The method leverages pretrained 2D inpainting and depth diffusion models to initialize the splats and compute occlusion volumes. By optimizing this representation as a 3D inpainting task with image-conditional diffusion models, the technique learns correct geometric structure. Additionally, a depth diffusion model is incorporated to further enhance the geometric structure. The model is finetuned using sharpened samples from image generators, resulting in high-quality 3D scene synthesis without the need for video or multi-view data.
The framework of RealmDreamer consists of several key components. Firstly, a robust scene initialization is achieved by utilizing 2D priors from pretrained models. The method then learns consistent 3D representations by incorporating 2D inpainting priors within an occluded volume. This is followed by distillation from diffusion-based depth estimators to enhance the geometric structure. The technique excels in generating general forward-facing 3D scenes from text prompts, showcasing state-of-the-art results with detailed appearance, high-fidelity geometry, and parallax effects.
The method utilizes Dreambooth for personalizing the text-to-image diffusion model during fine-tuning, enhancing the adherence to the reference image stylistically. Gaussian Splatting is employed for initializing the model during the inpainting stage, with parameters such as learning rates and decay schedules carefully tuned for optimal performance. Comparison with existing baselines like Text2Room, ProlificDreamer, and Dreamfusion is conducted using official implementations, ensuring a fair evaluation of the proposed technique's efficacy
Comments
None