GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting

Author: VRAMrod
Published: 4/23/2024, 6:48:20 AM
Category: Research

Audio-driven talking head synthesis using 3D Gaussian Splatting that addresses limitations in pose and expression control seen with Neural Radiance Fields.

arxiv.org

https://arxiv.org/abs/2404.14037

The GaussianTalker framework for audio-driven talking head synthesis is based on 3D Gaussian Splatting, integrating the FLAME model to bridge facial animation and rendering. The framework comprises two main modules: the Speaker-specific Motion Translator and the Dynamic Gaussian Renderer. The Speaker-specific Motion Translator converts audio signals into speaker-specific FLAME parameters sequences for facial animation control. It is trained on a multilingual, multi-individual dataset to improve adaptability to diverse audio inputs. The Dynamic Gaussian Renderer utilizes FLAME to drive 3D Gaussians and render dynamic talking heads in real-time.

The Speaker-specific Motion Translator module employs identity decoupling and personalized embedding to achieve synchronized and natural lip movements specific to the target speaker. It uses a universal audio feature extraction method and customized lip motion generation to accurately capture the speaker's facial nuances. The Dynamic Gaussian Renderer module introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. It also incorporates a Gaussian semantic loss to clarify the binding relationship between 3D Gaussians and FLAME for normalized motion.

The FLAME model is manually modified to include vertices and triangles for teeth, ensuring accurate depiction of teeth and inner mouth regions. The training process involves using the Adam Optimizer across all modules, with the Speaker-specific Motion Translator trained for 100,000 iterations. The framework achieves rendering speeds of 130 FPS on an NVIDIA RTX4090 GPU, significantly exceeding real-time rendering performance thresholds. The GaussianTalker framework demonstrates superior performance in talking head synthesis, delivering precise lip synchronization and exceptional visual quality.