FoldMoE: Efficient Long Sequence MoE Training via Attention-MoE Pipelining

Guichao Zhu; Lintian Lei; Yuhao Qing; Yichao Fu; Fanxin Li; Dong Huang; Zekai Sun; Heming Cui

doi:10.18653/v1/2025.acl-long.186

FoldMoE: Efficient Long Sequence MoE Training via Attention-MoE Pipelining

Guichao Zhu, Lintian Lei, Yuhao Qing, Yichao Fu, Fanxin Li, Dong Huang, Zekai Sun, Heming Cui

Abstract

Training LLMs with Mixture-of-Experts (MoE) architecture on long sequences poses significant challenges due to the all-to-all communication bottleneck of expert parallelism. While existing approaches attempt to hide the communication costs in computation through token-level pipelining within MoE layers, their effectiveness is limited by the insufficient computation. We present FoldMoE, a high-performance MoE training system that enables token-level overlapping across entire Transformer blocks through novel attention-MoE pipelining. We propose an efficient pipeline schedule, and a novel token buffering design to decouple attention and MoE layer partitioning, along with a time-uniform micro-batching strategy for enhanced efficiency. Evaluations on GPT-MoE models with sequences up to 32K tokens show that FoldMoE achieves up to 1.49x and 2.72x speedup over state-of-the-art token-level overlapping and non-overlapping baselines respectively.

Anthology ID:: 2025.acl-long.186
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3705–3717
Language:
URL:: https://aclanthology.org/2025.acl-long.186/
DOI:: 10.18653/v1/2025.acl-long.186
Bibkey:
Cite (ACL):: Guichao Zhu, Lintian Lei, Yuhao Qing, Yichao Fu, Fanxin Li, Dong Huang, Zekai Sun, and Heming Cui. 2025. FoldMoE: Efficient Long Sequence MoE Training via Attention-MoE Pipelining. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3705–3717, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: FoldMoE: Efficient Long Sequence MoE Training via Attention-MoE Pipelining (Zhu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.186.pdf

PDF Cite Search Fix data