Dance Any Beat: Blending Beats with Visuals in Dance Video Generation
The DabFusion model introduces music as a conditioning factor in generating dance videos from still images with rhythm synchronization
The task of generating dance from music is addressed by introducing DabFusion, a Dance Any Beat Diffusion model that uses music as a conditional input to generate dance videos from still images. This method, a first in utilizing music for image-to-video synthesis, is developed in two stages. Initially, an auto-encoder predicts latent optical flow between reference and driving frames, bypassing the need for precise joint annotations. Subsequently, a U-Net-based diffusion model generates these latent optical flows, guided by music rhythm encoded by CLAP. This two-stage process allows for the creation of high-quality dance videos, though initial models faced challenges with rhythm alignment, which were mitigated by integrating beat information, thereby enhancing synchronization.
The training process of DabFusion involves predicting latent flow with an auto-encoder and generating latent flows using a diffusion model conditioned on music and a starting image. The latent flow generation approach employs a diffusion model to reverse the process that transforms data into Gaussian noise, reconstructing the original data from noise by learning the conditional distribution. This method minimizes the difference between generated and actual data distributions, optimizing the negative log-likelihood of the data under the model. Music encoding is performed using CLAP, which decodes dance style information from musical cues, while Librosa is utilized for beat extraction to synchronize dance poses with music beats, improving the coherence and rhythm of the generated dance videos.


Comments
None