SC-Diff: 3D Shape Completion with Latent Diffusion Models

Author: VRAMrod
Published: 3/24/2024, 12:08:20 AM
Category: Resource

Utilizes a 3D latent diffusion model for completing shapes from partial 3D scans

arxiv.org

SC-Diff offers an innovative approach for 3D shape completion from partial scans, aiming to produce realistic, high-fidelity, complete shapes. The approach involves compressing 3D shapes into a latent space, enabling the processing of voxel grids of higher resolutions and the learning of diffusion-based shape completion for multiple classes with a single model. This is achieved through an auto-encoder architecture trained on both 3D and 2D objectives to compress the input Truncated Signed Distance Function (TSDF) into a lower dimensional latent space, allowing for efficient processing of higher resolution shapes in the diffusion model. The model utilizes an integrated image-based and spatially consistent conditioning mechanism inspired by ControlNet, incorporating both 3D supervision and volume rendering-enabled 2D supervision for the learning of the Vector Quantised Variational AutoEncoder (VQ-VAE) resulting in a compact representation of the input TSDFs. Additionally, the model incorporates two independent and complementary conditioning mechanisms: image-based conditioning with cross-attention and spatial conditioning, integrating 3D features from partial scans inspired by DiffComplete.

The diffusion model architecture is connected to a control branch with projection 1x1 convolutional layers, allowing the feature maps to have consistent sizes and be aggregated. The model leverages global shape cues from images and detailed 3D information from partial scans, facilitating realistic and high-fidelity shape completion. The diffusion model is implemented using a 3D U-Net with an encoder, a middle block, and a decoder, connected to the encoder through skip connections. Both the encoder and decoder contain ResNetBlocks, accompanied by downsampling blocks in the encoder and upsampling convolution layers in the decoder. The encoder, decoder, and middle block include two spatial transformers each. Images are encoded using a CLIP encoder, and diffusion timesteps are encoded with a 2-layer MLP using positional encoding. The control branch shares the same architecture as the diffusion model up to the middle block, then uses 3D convolutional projection layers to feed partial shape’s intermediate representation into the diffusion model spatially consistently.