StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
Enhances content consistency in generated images and videos through a framework for text-based story generation with rich content variation.
StoryDiffusion presents a pioneering method in visual storytelling through subject-consistent images and transition videos. At its core, StoryDiffusion employs Consistent Self-Attention, a novel approach that significantly enhances image consistency by incorporating reference tokens from reference images during token similarity matrix calculation and merging. This technique ensures subject-consistent images, crucial for effective storytelling.
Moreover, StoryDiffusion divides a story text into prompts corresponding to individual images, enabling the generation of highly consistent images that effectively narrate a story. To support long story generation, the method combines Consistent Self-Attention with a sliding window along the temporal dimension, reducing peak memory consumption dependency on input text length.
StoryDiffusion prioritizes controllability and consistency. It integrates ControlNet and T2I-Adapter to introduce control conditions such as depth maps, pose images, or sketches to direct image generation. Additionally, StoryDiffusion merges Consistent Self-Attention with PhotoMaker to generate consistent images with specified IDs, showcasing scalability and plug-and-play capability.
Furthermore, the method incorporates a Semantic Motion Predictor for predicting transitions between images in the semantic space, resulting in stable transition videos with smooth motion and physical plausibility. These transition videos demonstrate superior performance compared to state-of-the-art methods like SEINE and SparseCtrl, highlighting the effectiveness of StoryDiffusion in generating seamless and consistent videos.
Comments
None