Jiacheng Ruan
2024
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-Training
Tong Zhu
|
Xiaoye Qu
|
Daize Dong
|
Jiacheng Ruan
|
Jingqi Tong
|
Conghui He
|
Yu Cheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters.
Search
Co-authors
- Tong Zhu 1
- Xiaoye Qu 1
- Daize Dong 1
- Jingqi Tong 1
- Conghui He 1
- show all...
- Yu Cheng 1