Learning Long-form Video Prior via Generative Pre-Training
Uses generative pre-training to learn the implicit prior of long-form videos by representing them as a set of tokens extracted from visual locations like bounding boxes and keypoints
This work method attempts to tackle the intricate nature of long-form videos, harnessing the power of generative pre-training and a meticulously curated dataset named Storyboard20K. This approach aims to learn the implicit prior of lengthy video sequences, thereby unlocking new avenues for understanding complex visual narratives.
This method departs from pixel-based processing, opting instead to operate in a token space, where key video elements such as bounding boxes and keypoints are discretized and tokenized for consumption by a generative pre-training model. The storyboard20K dataset serves as the cornerstone of this endeavor, comprising textual synopses, shot-by-shot keyframes, and detailed annotations of characters and film sets, all meticulously structured to ensure coherence and richness of information.
Through a neural network trained on Storyboard20K, the method approximates a joint probabilistic prior distribution, capturing the essence of long-form videos in the token space. This innovative approach not only facilitates learning from scarce data but also enables the model to generate diverse outputs by maximizing likelihood, thus ushering in a new era of creative video synthesis.
Comments
None