SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
Latent video diffusion model for generating high-resolution images of orbital videos around 3D objects
The paper discusses advancements in 3D generation, particularly focusing on the use of Neural Radiance Fields (NeRFs) and diffusion-based generative models. It highlights the progress made in representing 3D scenes using NeRFs and their variants, such as Instant-NGP and DMTet, which enable the generation of high-resolution 3D shapes with improved memory efficiency. Additionally, the paper explores the use of diffusion-based models for 3D generation, emphasizing the challenges related to generalization due to the scarcity of 3D data and the use of image/multi-view diffusion models as guidance for 3D generation.
The main contribution of the paper is the introduction of SV3D, a novel multi-view synthesis approach that leverages temporal consistency in a video diffusion model for spatial 3D consistency of an object. The SV3D approach not only utilizes SV3D-generated images but also incorporates a soft-masked SDS loss for unseen areas, resulting in denser, controllable, and more consistent multi-view images for improved 3D generation quality. The paper also discusses the 3D generation framework, which involves coarse-to-fine training for 3D generation, utilizing Instant-NGP NeRF for learning a rough object and refining it using the DMTet representation.
Furthermore, the paper delves into the 3D optimization strategies and losses employed in the 3D generation process, including reconstruction losses, normal loss, smooth depth loss, and spatial smoothness regularization on the albedo. It also details the evaluation of the 3D generation framework on various metrics, comparing the SV3D-guided 3D generations with several prior methods. The paper provides visual comparisons and quantitative comparisons, demonstrating the high-fidelity texture and geometry of the output meshes generated using the SV3D approach.
Comments
None