Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
1B-scale multimodal vision language model designed for efficient deployment on consumer GPU servers
Xmodel-VLM, a state-of-the-art multimodal vision language model, addresses the significant industry challenge of high service costs that impede the widespread adoption of large-scale multimodal systems. This model has been developed through rigorous training, resulting in a 1B-scale language model that follows the LLaVA paradigm for modal alignment. Despite its smaller size and faster execution, Xmodel-VLM demonstrates performance on par with larger models, as evidenced by extensive testing across numerous classic multimodal benchmarks.
The architecture of Xmodel-VLM includes a vision encoder, a lightweight language model (LLM), and a projector for aligning visual and textual spaces. The vision encoder utilized is CLIP ViT-L/14, and the LLM, named Xmodel-LM 1.1B, is designed to integrate seamlessly with the LLaMA framework. Text data is tokenized using the unigram algorithm with the SentencePiece implementation, ensuring efficient processing.
The training process for Xmodel-VLM involves two main phases: pre-training and instruction tuning. Initially, an efficient projector is trained while keeping the vision encoder and LLM frozen. Following this, a comprehensive fine-tuning of both the projector and LLM is conducted to enhance the model's visual understanding and language processing capabilities. The projector, referred to as XDP, is a two-layer MLP with Mish activation, which serves as a downsampling mechanism to reduce visual tokens by 75%, facilitating efficient and effective multimodal processing.
Comments
None