Joonas Tapaninaho
2025
MoEP: Modular Expert Paths for Sample-Efficient Language Modeling
Joonas Tapaninaho
Proceedings of the First BabyLM Workshop
Joonas Tapaninaho
Proceedings of the First BabyLM Workshop
Training language models under tight compute budgets with small training datasets remains challenging for dense decoder-only Transformers, where every token activates the full stack of model parameters. We introduce MoEP (Modular Expert Paths), a sparse decoder-only architecture that enables more selective token activation, which increases model performance and accelerates learning without increasing the total number of parameters. We show that combining model parallelism with Mixture-of-Experts (MoE) style linear projections and a lightweight top-k router outperforms the GPT-2 baseline and stabilizes evaluation performance more quickly.