LLAniMAtion: LLAMA Driven Gesture Animation

Author: VRAMrod
Published: 5/15/2024, 10:34:38 PM
Category: Research

Explores the use of LLAMA2 features extracted from text for generating gestures in character animation, comparing them to audio features

arxiv.org

https://arxiv.org/abs/2405.08042

The study utilizes the GENEA challenge 2023 dataset, derived from the Talking With Hands dataset, containing dyadic conversations with motion capture data in BVH format. The dataset includes speech audio and text transcripts from both speakers, divided into train, validation, and test sets. Two approaches for combining audio and text modalities are explored: post-extraction concatenation and cross-attention, referred to as Llanimation-+ and Llanimation-×.

For single modalities, the speaker matrices are concatenated with audio or text along the feature dimension to form inputs for training audio and text-based models. To combine modalities, the audio, text, and speaker matrices are concatenated along the feature dimension and passed through a linear layer. Additionally, a cross-attention approach is used to combine the modalities.

Beats are detected using the root mean square onset of the audio and motion beats are identified by local velocity minimums. Objective measures indicate that models using Llama2-based features show lower scores compared to PASE+ features, suggesting more realistic motion generation. A user study confirms the preference for Llama2-based models over the PASE+ approach, with no significant difference between Llanimation methods. The study concludes that Llama2 features are effective for gesture generation, with the concatenation of features performing slightly better than cross-attention.

The study compares the Llanimation and Llanimation-+ approaches against ground truth and the state-of-the-art CSMP-Diff method. Objective performance metrics show CSMP-Diff outperforming in some aspects, while user study results indicate Llama2-based models are preferred over the PASE+ approach. The study suggests that integrating Llama2 features into gesture-generation models can bridge the gap between machine-generated and natural gesturing.