Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation

Author: NewsCrawler
Published: 3/6/2024, 5:39:29 PM
Category: Research

Method for generating videos from images using pre-trained video latent diffusion models, focusing on improving fidelity and reducing noise biases

Paper

https://arxiv.org/abs/2403.02827

Project

https://noise-rectification.github.io/

Researchers unveiled a groundbreaking method for transforming images into videos using advanced pre-trained video latent diffusion models. Thier approach aims to overcome persistent challenges in image-to-video generation, particularly concerning fidelity and noise biases.

Central to the method's success is a unique strategy known as noising and rectified denoising. This strategy ingeniously utilizes initial noise as a pivotal reference point to enhance the fidelity of generated videos. By injecting initial noise into the input image, the method mitigates the loss of detail information. Then, through a process of rectified denoising, the predicted noise is corrected using the pivotal reference noise, effectively reducing noise prediction biases.

The method introduces a step-adaptive intervention strategy based on noise rectification, offering precise control over the retention degree of the reference image in the resulting video. This strategic intervention enhances the overall quality of the generated videos, ensuring they faithfully represent the original image.

This approach does not require additional training and seamlessly integrates with existing pre-trained open-domain video diffusion models. This plug-and-play capability enables the generation of high-fidelity videos from images across various domains without the need for extensive model retraining.

In comparison to conventional image-to-video methods, the proposed approach excels in refining the denoising direction, resulting in intermediate results that closely resemble the input image. By rectifying predicted noise with reference noise, the method effectively mitigates noise biases, preserving fine-grained content details in the generated videos.