Probing the 3D Awareness of Visual Foundation Models
Analyzes the 3D awareness of visual foundation models by evaluating if their representations encode 3D structure and consistently represent surfaces across views using a series of experiments.
The paper explores the 3D awareness of visual foundation models through a series of probing tasks. The study considers 26 checkpoints spanning various learning objectives and forms of supervision. Models are chosen based on comparable architecture and training scale, focusing on publicly available checkpoints for common usage. Different tasks are used to evaluate the 3D awareness of the models, including single-image surface reconstruction and multiview consistency. The evaluation is based on performance metrics such as keypoint matching, semantic correspondence, and multiview consistency.
Various tasks are employed to assess the 3D understanding of the models, including semantic keypoint matching and multiview consistency. Semantic keypoint matching involves finding corresponding semantic parts in different images, while multiview consistency evaluates the consistency of features across multiple viewpoints. Evaluation metrics include the percentage of keypoints within a pixel threshold and the performance across different viewpoint variations. The study aims to understand how well models capture 3D properties and whether their representations exhibit 3D awareness.
Performance correlation across tasks is analyzed to determine the relationship between different probing tasks and task domains. The Pearson correlation coefficient is computed to assess the linear relationship between model performance. The study focuses on aggregating performance rankings across tasks to understand the relative 3D awareness of the models. By comparing model performance across various tasks and domains, the paper aims to uncover patterns in how well visual foundation models perceive and infer 3D properties.
Comments
None