InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models
Feed-forward framework for instant 3D mesh generation from a single image
TencentARC introduces a novel method called InstantMesh for generating high-quality 3D assets from image tokens. This method leverages a large reconstruction model (LRM) based on a scalable transformer architecture to map image tokens directly to 3D representations, specifically triplanes. By combining LRMs with image generation models, InstantMesh significantly improves generalization ability. The 3D representation used in this method is triplanes, and novel views are synthesized using a Multi-Layer Perceptron (MLP). However, decoding triplanes requires a memory-intensive volume rendering process, which hinders training scales. To address this, recent works have explored using Gaussians as the 3D representation for rendering efficiency, although it may not be ideal for geometric modeling.
InstantMesh is compared with four baselines: TripoSR, LGM, CRM, and SV3D. TripoSR is an LRM implementation known for its single-view reconstruction performance, while LGM and CRM are unet-based models reconstructing Gaussians and 3D meshes from multi-view images, respectively. SV3D is an image-conditioned diffusion model generating orbital videos of objects. The evaluation metrics include PSNR, SSIM, LPIPS for 2D visual quality, and Chamfer Distance (CD) and F-Score (FS) for 3D geometric quality. InstantMesh demonstrates competitive performance across these metrics, showcasing its effectiveness in generating high-quality 3D assets from image tokens using a combination of LRMs and image generation models.
Comments
None