CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Author: NewsCrawler
Published: 5/10/2024, 11:53:31 PM
Category: Resource

The paper introduces CuMo, a novel approach for enhancing multimodal Large Language Models using Mixture-of-Experts blocks, achieving superior performance on various benchmarks while minimizing additional parameters during inference.

Paper

https://arxiv.org/abs/2405.05949

Project

https://chrisjuniorli.github.io/project/CuMo/

Code

https://github.com/SHI-Labs/CuMo

Dataset

https://huggingface.co/datasets/shi-labs/CuMo_dataset

Model

https://huggingface.co/shi-labs/CuMo-mistral-7b

CuMo enhances Multimodal Large Language Models (LLMs) by integrating Co-upcycled Top-K sparsely-gated Mixture-of-experts (MoE) blocks into the vision encoder and MLP connector, achieving superior performance on benchmarks while minimizing additional activated parameters during inference.

CuMo adopts a three-stage training approach for Co-upcycling MoE blocks in multimodal LLMs. Initially, MoE blocks are trained from scratch, replacing corresponding MLP blocks to enhance stability and performance. A co-upcycling strategy is then employed, initializing each module integrating sparse MoE blocks with pre-trained MLPs, consistently improving training stability. The training stages include pre-training the MLP connector, pre-finetuning parameters using high-quality caption data, and visual instruction fine-tuning with upcycled MoE blocks.

To maintain load balance between experts within each MoE block, auxiliary losses based on language modeling cross-entropy loss are introduced. These losses, including loading balance loss and router z-loss, ensure a balanced distribution of experts, enhancing model performance.