ID-Animator: Zero-Shot Identity-Preserving Human Video Generation
Zero-shot human-video generation approach that can generate personalized videos based on a single reference facial image
ID-Animator is designed for zero-shot human-video generation. It harnesses a pre-trained text-to-video diffusion model coupled with a lightweight face adapter module. This innovative approach enables the framework to achieve identity-specific video generation without the need for extensive training. Leveraging the AnimateDiff model for video generation and the IP-Adapter for image prompting capabilities, ID-Animator ensures compatibility and extendability in real-world applications where identity preservation is crucial.
The dataset construction pipeline of ID-Animator plays a pivotal role in enhancing its performance. By decoupling captions into human attributes and actions and rewriting captions using advanced models like ShareGPT4V and Video-LLava, the framework creates a unified human caption tailored for training. Additionally, the dataset reconstruction process focuses on constructing an identity-oriented human dataset, incorporating detailed human attributes and actions to improve model accuracy.
Efficiency is a cornerstone of the ID-Animator framework, with a streamlined training process completing within a day on a single A100 GPU. The face adapter module, trained separately, ensures faster training and generation times without compromising performance. During training, only the parameters of the face adapter are updated, while the pre-trained text-to-video model remains frozen, optimizing resource utilization.
Comments
The ID-Animator checkpoints and inference scripts have been released