VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
The paper introduces a novel approach to creating scalable 3D generative models using pre-trained video diffusion models, overcoming data limitations by leveraging a video diffusion model trained on text, images, and videos to generate a synthetic multi-view dataset for training a feed-forward 3D generative model called VFusion3D.
VFusion3D generates high-quality 3D assets from a single image with any viewing angles. The method leverages video diffusion models trained on a large amount of texts, images, and videos as a 3D data generator. Post-training, the model is further fine-tuned with renderings from the 3D dataset to enhance its 3D generative capabilities. The proposed model is compared against several distillation-based and feed-forward 3D generative models using user studies and automated metrics. Ablation studies were used to evaluate the impact of different design choices, including fine-tuning strategies and the amount of 3D data used. It demonstrated the effectiveness of the proposed training strategies and the scalability of the approach with the collection of more 3D data.
The framework utilize a multi-stage training approach to stabilize the training process and incorporate opacity supervision to further enhance performance. They fine-tune the pre-trained VFusion3D model with 3D data using a dataset of rendered multi-view images and employ the L2 loss function for novel view supervision. The study also explores the impact of the amount of 3D data used for fine-tuning the video diffusion models, showing that using more 3D data enhances the multi-view sequence generation ability without compromising visual quality.


Comments
None