FlexiFilm: Long Video Generation with Flexible Conditions
Diffusion model tailored for long video generation, addressing issues of temporal inconsistency and overexposure.
In the domain of video generation, the challenge of producing long and consistent videos has gained considerable attention. Existing diffusion-based video generation models, though adept at short video generation, falter when tasked with longer videos due to their simplistic conditioning mechanisms and sampling strategies borrowed from image generation models.
To tackle these challenges, the FlexiFilm framework emerges as a tailored solution for long video generation. Comprising a Variational Autoencoder (VAE), a 3D U-Net, and a temporal conditioner, FlexiFilm introduces novel components to enhance the generation process.
The VAE projects input videos frame by frame into video latents, while the 3D U-Net, built upon a pretrained VideoCrafter-1 model, facilitates latent video diffusion. The temporal conditioner plays a pivotal role in enriching the generation process by establishing a consistent relationship between conditional frames and generating frames, leveraging multi-modal guidance with long text descriptions and reference images or frames.
A co-training method is implemented to bolster inter-frame consistency in video generation. By collaboratively training the temporal modules in the conditioner and U-Net, this method encourages temporal coherence in the generated videos, contributing to their overall consistency.
Furthermore, a resampling strategy is introduced for multi-round inference, addressing issues of color overexposure and non-zero signal-to-noise ratio (SNR) encountered during prolonged video generation. This strategy enables recursive multi-round inference without compromising quality, thereby facilitating the generation of longer videos while maintaining visual fidelity.
The paper also introduces the FF-drive1 dataset tailored for long video generation tasks. Curated with meticulous attention to legal compliance and privacy protection, the dataset comprises detailed video-text pairs, each over 20 seconds in length and at a frame rate of 24 frames per second (fps), totaling approximately 112 hours.
Comments
None