ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
Enhances visual consistency in Image-to-Video generation by introducing spatiotemporal attention and noise initialization to maintain subject, background, and style consistency
ConsistI2V is a novel diffusion-based approach to Image-to-Video (I2V) generation that prioritizes visual consistency and coherence throughout the video narrative. At its core, ConsistI2V leverages spatiotemporal attention mechanisms and innovative noise initialization techniques to uphold the integrity of subject, background, and style from the initial frame, ensuring a harmonious visual progression in the generated videos. By incorporating the low-frequency components of the first frame, ConsistI2V effectively guides the video generation process, mitigating inconsistencies and enhancing overall visual fidelity.
Beyond its foundational contributions to I2V generation, ConsistI2V features versatility through two innovative applications: autoregressive long video generation and camera motion control. By leveraging FrameInit to guide the generation process, ConsistI2V enables the seamless creation of extended video sequences while maintaining visual coherence. The integration of synthetic camera motions offers additional possibilities for dynamic video effects, such as panning and zooming, enhancing the overall cinematic experience.
To validate the efficacy of ConsistI2V, the researchers introduce I2V-Bench, an evaluation benchmark curated to assess key dimensions of I2V performance. Through automatic and human evaluations on established datasets like UCF-101 and MSR-VTT, ConsistI2V outperforms baseline models in terms of visual quality, consistency, and adherence to user-provided text prompts.
Comments
None