GSEdit: Efficient Text-Guided Editing of 3D Objects via Gaussian Splatting
GSEdit is a pipeline for text-guided 3D object editing using Gaussian Splatting models, enabling efficient and accurate editing of object shape and appearance while maintaining coherence and detail.
The paper introduces GSEdit, a method for efficient text-guided editing of 3D objects based on image diffusion models and the Gaussian splatting scene representation. The method leverages the Score Distillation Sampling (SDS) loss for Gaussian Splatting (GS) generation proposed by Tang et al. and adapts it for GS editing, deriving analytical gradients. The pipeline for 3D object editing consists of four steps: input representation, Gaussian splatting scene extraction, texture refinement, and mesh extraction. The method is designed to support any Gaussian splatting scene as input, particularly focusing on single foreground objects. The input representation includes GS reconstructions computed from multi-view renders of a given textured 3D mesh. The Gaussian splatting scene extraction involves local density query and color back-projection, while texture refinement utilizes Stable Diffusion for guidance. The mesh extraction step involves post-processing the mesh with decimation and remeshing.
The method employs a U-Net architecture to modulate edits as instructed by text embeddings, derived from user-provided textual instructions, guiding the model to perform contextually relevant edits while maintaining coherence with the source image. The interaction between the image condition and text embeddings within the U-Net framework enables precise, instruction-aligned modifications, highlighting the model's ability to intuitively adapt to diverse editing tasks. The method's efficiency is demonstrated through qualitative and quantitative results, showing its remarkable capabilities in consistently editing Gaussian splatting objects to modify shape, color, and overall style. The method is compared with Instruct-NeRF2NeRF, showing notable benefits in coherence with given prompts and requiring only 16% of the time to run from start to finish on average. However, the methodology is fundamentally dependent on the Instruct-Pix2Pix framework, which introduces certain constraints, such as perspective bias and limitations in substantial spatial transformations.
Comments
None