AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One
Amalgamates distinct visual foundation models like CLIP, DINOv2, and SAM into a unified model through multi-teacher distillation
In a bid to enhance the efficiency and performance of visual foundation models (VFMs), the AM-RADIO framework consolidates multiple VFMs such as CLIP, DINOv2, and SAM into a single unified model through multi-teacher distillation. This approach, Agglomerative Model -- Reduce All Domains Into One, surpasses the performance of individual teacher models while amalgamating their unique characteristics, including zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities.
AM-RADIO leverages adaptor heads with a simple 2-layer MLP design for feature distillation, with careful consideration given to aspects such as distillation dataset choice, loss formulation, and feature summarization to ensure effective knowledge distillation from multiple models. The framework demonstrates its capability in boosting performance in dense image understanding tasks through full feature distillation, highlighting the importance of efficient backbone design.
Efficient backbone design remains a focal point, with insights provided into training augmentation, resolution mismatch between student and teacher models, and feature summarization techniques. The framework exhibits the ability to produce high-resolution, low-noise features, albeit with identified issues such as latent 'low resolution' and 'high resolution' modes that warrant further investigation in future work.


Comments
None