UniVS: Unified and Universal Video Segmentation with Prompts as Queries
Comprehensive analysis of video segmentation tasks and introduces UniVS as a novel approach to accommodate all video segmentation tasks within a single model
The paper presents a comprehensive study on video segmentation tasks, categorizing them into category-specified and prompt-specified tasks. Category-specified tasks involve segmenting and tracking entities from predefined categories, such as video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). On the other hand, prompt-specified tasks require identifying and segmenting specific targets throughout the video based on visual prompts or textual descriptions. These tasks include video object segmentation (VOS), Panoptic VOS (PVOS), and referring VOS (RefVOS). The paper highlights the close relationship between video segmentation and image segmentation, emphasizing the significant improvements in model performance and the development of various network architectures for these tasks.
The study provides a detailed comparison of unified video segmentation (UniVS) models, focusing on their quantitative performance across different video segmentation tasks. It discusses the limitations of existing unified models, which are often trained individually on specific datasets, leading to a lack of generalization ability to other datasets and an inability to handle prompt-specified VS tasks. The paper introduces UniVS as a novel approach that aims to accommodate all video segmentation tasks within a single model, demonstrating the highest generalization capability in universal segmentation. Specifically, UniVS achieves outstanding performance on various tasks, such as VOS, RefVOS, VIS, and VPS, showcasing its versatility and effectiveness in addressing diverse video segmentation challenges.
Furthermore, the paper delves into the technical aspects of UniVS, discussing the implementation of separated self-attention types and their impact on visual prompt-guided video segmentation tasks. It presents an ablation study on the prompt-specified VS tasks, demonstrating the efficiency of UniVS in simultaneously processing multiple prompt-guided targets and detecting newly appeared objects. The experimental results highlight the effectiveness of UniVS in handling prompt-specified VS tasks and its potential for efficient and accurate video segmentation.
Comments
None