MoEP: Modular Expert Paths for Sample-Efficient Language Modeling

Joonas Tapaninaho


Abstract
Training language models under tight compute budgets with small training datasets remains challenging for dense decoder-only Transformers, where every token activates the full stack of model parameters. We introduce MoEP (Modular Expert Paths), a sparse decoder-only architecture that enables more selective token activation, which increases model performance and accelerates learning without increasing the total number of parameters. We show that combining model parallelism with Mixture-of-Experts (MoE) style linear projections and a lightweight top-k router outperforms the GPT-2 baseline and stabilizes evaluation performance more quickly.
Anthology ID:
2025.babylm-main.39
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
540–547
Language:
URL:
https://aclanthology.org/2025.babylm-main.39/
DOI:
Bibkey:
Cite (ACL):
Joonas Tapaninaho. 2025. MoEP: Modular Expert Paths for Sample-Efficient Language Modeling. In Proceedings of the First BabyLM Workshop, pages 540–547, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
MoEP: Modular Expert Paths for Sample-Efficient Language Modeling (Tapaninaho, BabyLM 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.babylm-main.39.pdf