Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D

Author: VRAMrod
Published: 3/30/2024, 10:06:42 PM
Category: Research

Extends 2D vision models to make 3D consistent predictions without needing task-specific training

arxiv.org

The paper presents a method for lifting 2D vision operators to 3D representations for various tasks in computer vision. The proposed method leverages user-guided strokes to segment objects in a scene by providing positive strokes over a region of interest in one view. Feature matching is then performed to match marked features with distilled semantics across multiple views. The method utilizes a 2D DINO model and feature-matching strategy to segment desired regions in 3D volumes.

In the context of semantic segmentation, the method is compared against recent scene-specific methods and performs as well as or even outperforms them on specific scenes. By directly lifting 2D features to 3D volumes instead of using 2D-3D feature distillation, the method retains the original 2D feature backbone's capabilities. Qualitatively, the method shows superior segmentation results, especially in capturing clear boundaries and details in the segmented regions.

For style transfer tasks, the method focuses on editing 3D implicit representations using text instructions. By lifting a 2D diffusion model to construct a view-consistent feature volume, the method enables text-based 3D editing. Compared to existing methods that fine-tune parameters using 2D supervision, the proposed method consistently achieves superior editing quality while retaining the original scene geometry across various edits.

In the task of scene editing, the method outperforms state-of-the-art techniques in terms of editing quality and multi-view consistency. By decoding 3D lifted features, the method produces multi-view consistent predictions and preserves scene geometry with higher detail compared to 2D operators. The method can be extended to any 2D vision operator without additional tuning, showcasing its versatility and effectiveness in various computer vision tasks.