SemCity: Semantic Scene Generation with Triplane Diffusion

Author: VRAMrod
Published: 3/13/2024, 5:42:14 PM
Category: Resource

The paper introduces "SemCity," a 3D diffusion model for generating real-world outdoor scenes by learning from a real-outdoor dataset and using a triplane representation.

Paper

https://arxiv.org/abs/2403.07773

Code

https://github.com/zoomin-lee/SemCity

Project

https://sglab.kaist.ac.kr/SemCity/

The paper introduces "SemCity," a 3D diffusion model designed for generating semantic scenes in real-world outdoor environments. Unlike most 3D diffusion models that focus on generating single objects or synthetic indoor scenes, SemCity specifically targets the generation of real-world outdoor scenes. The model addresses the challenge of learning real-outdoor distributions from datasets that often contain more empty spaces due to sensor limitations. To overcome this issue, the authors exploit a triplane representation as a proxy form of scene distributions to be learned by the diffusion model. Additionally, they propose a triplane manipulation that seamlessly integrates with the diffusion model, enhancing its applicability in various downstream tasks related to outdoor scene generation, such as scene inpainting, outpainting, and semantic scene completion refinements.

The diffusion model in SemCity follows a forward process, gradually transforming a given data distribution into a Gaussian distribution by corrupting the original data with noise over a series of steps. This process is modeled as a Markov chain, where each step adds a small amount of noise, making it computationally efficient and invertible. The model also incorporates a triplane autoencoder consisting of a triplane encoder and an implicit MLP decoder. The encoder comprises six 3D convolutional layers with a skip connection, while the MLP decoder consists of four 128-dimensional fully-connected layers with a skip connection. Positional encoding is used as sinusoidal functions to enhance the encoding process.

The authors set the norm factor of the triplane diffusion loss to 1 or 2 to control the sample diversity in the generation results. They use a variance schedule similar to DDPM for the diffusion settings. Furthermore, the model is extended to practical applications such as scene inpainting and outpainting, allowing for seamless manipulation of triplanes during the diffusion process to achieve realistic and diverse generation results. Additionally, the authors leverage ControlNet to generate RGB images from the semantic scenes produced by SemCity, further expanding the capabilities of the diffusion model to several practical applications.