CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

Author: VRAMrod
Published: 2/28/2024, 5:53:05 PM
Category: Research

Image-conditioned diffusion model aimed at generating multi-view consistent images of anime characters from varying input poses

arxiv.org

https://arxiv.org/abs/2402.17214

The paper presents a novel image-conditioned diffusion model called Anime3D, designed to generate multi-view consistent images of anime characters in a controlled canonical pose from varying input poses. This addresses challenges such as self-occlusion and pose ambiguity. The diffusion model is combined with a transformer-based reconstruction model in a streamlined pipeline to efficiently transform single-view inputs into detailed 3D character models. The authors also introduce a curated dataset of 13,746 anime characters rendered in multiple poses and views, providing a diverse training and evaluation resource for the model and future research in 3D character generation.

The paper discusses the diffusion-based 3D object generation, highlighting the effectiveness of diffusion methods in guiding 3D object generation tasks. It mentions several pioneering works in this area, such as DreamFusion, SJC, Magic3D, Fantasia3D, ProlificDreamer, Zero123, MVDream, ImageDream, and SyncDreamer. These methods utilize various techniques such as score distillation sampling, implicit tetrahedral fields, VSD, and multi-view diffusion models to provide guidance in the 3D object generation process.

The paper also introduces the IDUNet, a structure inspired by ControlNet, which is used to generate four views of consistent images. It explains how the IDUNet extracts local pixel-level features to strengthen the multi-view UNet. Additionally, the multi-view UNet is detailed, outlining its target to generate multi-view A-pose images with highly consistent appearance from a single posed input image. The transformer block in the multi-view UNet consists of a spatial self-attention module and a cross-attention module, enabling the denoising process on the four-view noisy latent.