Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification
Feature extraction method that aligns with CLIP's pre-training objectives to improve its cross-modal capabilities
CLIP stands out for its exceptional matching capabilities between images and text, owing to its training on image-text contrastive learning tasks. However, a potential drawback arises when CLIP's image encoder is directly used for single-modality feature extraction tasks, as it might not be optimized for such scenarios, leading to suboptimal performance.
This project attempts to address this issue via CrOss-moDal nEighbor Representation (CODER), which aligns better with CLIP's pre-training objectives. By leveraging the distance structure between images and their neighbor texts, CODER enhances CLIP's robust cross-modal capabilities, improving the quality of image feature representation.
A critical component in constructing high-quality CODER is the generation of diverse and high-quality texts to match with images. To achieve this, the paper presents the Auto Text Generator (ATG), which automatically generates class-specific texts adaptive to target tasks. ATG utilizes query prompts to extract insights from external experts like ChatGPT, constructing various types of texts instrumental in generating high-quality CODER.
Comments
None