Improved Sparse Upcycling for Instruction Tuning

Wangyi Jiang, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun


Abstract
The Mixture-of-Experts (MoE) architecture has demonstrated significant potential in both large-scale pre-training and instruction tuning by offering increased parameter capacity without additional inference costs. However, developing MoE models faces challenges including training instability and the need for substantial high-quality training data. While efficient methodologies like sparse upcycling exist, they often lead to performance degradation in instruction tuning scenarios. We introduce representation-based sparse upcycling, a straightforward yet effective technique for converting dense language models into sparsely activated ones while maintaining similar computational costs. Unlike conventional sparse upcycling, our approach leverages intermediate representations from language models to initialize router weights. This strategy addresses the mismatch between randomly initialized and well-trained parameters while providing prior knowledge to guide expert specialization during training. Extensive experiments across diverse benchmarks demonstrate significant improvements in both model capabilities and routing consistency compared to existing approaches.
Anthology ID:
2025.coling-main.636
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9485–9498
Language:
URL:
https://aclanthology.org/2025.coling-main.636/
DOI:
Bibkey:
Cite (ACL):
Wangyi Jiang, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2025. Improved Sparse Upcycling for Instruction Tuning. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9485–9498, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Improved Sparse Upcycling for Instruction Tuning (Jiang et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.636.pdf