Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

Zihan Qiu; Zeyu Huang; Bo Zheng; Kaiyue Wen; Zekun Wang; Rui Men; Ivan Titov; Dayiheng Liu; Jingren Zhou; Junyang Lin

doi:10.18653/v1/2025.acl-long.249

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin

Abstract

This paper revisits the implementation of Load-Balancing-Loss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as N_E ∑_i=1^N_E f_ip_i, where N_E is the total number of experts, f_i represents the frequency of expert i being selected, and p_i denotes the average gating score of the expert i. Existing MoE training frameworks usually employ the parallel training strategy so that f_i and the LBL are calculated within a micro-batch and averaged across parallel groups.However, a micro-batch for training billion-scale LLMs typically contains very few sequences, leading to the micro-batch LBL being almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence.Under this strict constraint, even tokens from a domain-specific sequence (e.g., code) are uniformly routed to all experts, thereby inhibiting expert specialization.In this work, we propose calculating LBL using a global-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize f_i across micro-batches and then use it to calculate the LBL.Through experiments on training MoEs-based LLMs (up to 42.8B parameters and 400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks.Our analysis reveals that the global-batch LBL greatly improves the domain specialization of experts. Global-batch LBL is also used in Qwen3-MoEs.

Anthology ID:: 2025.acl-long.249
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5005–5018
Language:
URL:: https://aclanthology.org/2025.acl-long.249/
DOI:: 10.18653/v1/2025.acl-long.249
Bibkey:
Cite (ACL):: Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5005–5018, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models (Qiu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.249.pdf

PDF Cite Search Fix data