PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
End-to-end tile-based framework consisting of Coarse, Fine, and Guided Fusion Networks, incorporating a Global-to-Local module and employing Consistency-Aware Training
Contemporary architectures in depth estimation often grapple with the constraints imposed by their backbone resolutions, resulting in predictions marred by blurriness. In response, PatchFusion introduces a fresh approach to metric single image depth estimation, specifically tailored for high-resolution inputs.
The framework comprises three key components: a Coarse Network for global scale-aware estimation at the expense of high-frequency details, a Fine Network for patch-wise fine depth prediction with intricate details but potential scale inconsistency, and a Guided Fusion Network featuring a Global-to-Local (G2L) module to harmoniously blend the strengths of both networks. Additionally, the authors advocate for consistency-aware training and inference strategies to ensure patch-wise prediction coherence.
The neural network leverages a fusion of coarse and fine depth maps, outperforming traditional post-optimization methods. Notably, the integration of a Global-to-Local Module and a Guided Fusion Network adds sophistication to the architecture. The former applies global-wise self-attention to aggregate critical information for patch-wise scale-consistent predictions, while the latter employs a U-Net design enhanced by Swin Transformer Layers for global context preservation and GPU memory efficiency.
Addressing boundary inconsistencies, the paper introduces Consistency-Aware Training (CAT) and Inference (CAI), emphasizing the importance of consistent feature representations and depth predictions in overlapping regions. By dynamically updating depth estimations during inference, coupled with a running mean, the model achieves local ensemble refinement, mitigating inconsistencies and elevating prediction accuracy.
Comments
None