CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
The paper introduces CuMo, a novel approach for enhancing multimodal Large Language Models using Mixture-of-Experts blocks, achieving superior performance on various benchmarks while minimizing additional parameters during inference.
CuMo enhances Multimodal Large Language Models (LLMs) by integrating Co-upcycled Top-K sparsely-gated Mixture-of-experts (MoE) blocks into the vision encoder and MLP connector, achieving superior performance on benchmarks while minimizing additional activated parameters during inference.
CuMo adopts a three-stage training approach for Co-upcycling MoE blocks in multimodal LLMs. Initially, MoE blocks are trained from scratch, replacing corresponding MLP blocks to enhance stability and performance. A co-upcycling strategy is then employed, initializing each module integrating sparse MoE blocks with pre-trained MLPs, consistently improving training stability. The training stages include pre-training the MLP connector, pre-finetuning parameters using high-quality caption data, and visual instruction fine-tuning with upcycled MoE blocks.
To maintain load balance between experts within each MoE block, auxiliary losses based on language modeling cross-entropy loss are introduced. These losses, including loading balance loss and router z-loss, ensure a balanced distribution of experts, enhancing model performance.
Comments
None