Active Curriculum Language Modeling over a Hybrid Pre-training Method

Eleni Fysikoudi, Sharid Loáiciga, Asad B. Sayeed


Abstract
We apply the Active Curriculum Language Modeling (ACLM) method to the constrained pretraining setting of the 2025 BabyLM Challenge, where models are limited by both data and compute budgets. Using GPT-BERT (Charpentier and Samuel, 2024) as the base architecture, we investigate the impact of surprisal-based example selection for constructing a training curriculum. In addition, we conduct a targeted hyperparameter search over tokenizer size and batch size. Our approach yields stable pretrained models that surpass the official baseline on multiple evaluation tasks, demonstrating ACLM’s potential for improving performance and generalization in low-resource pretraining scenarios.
Anthology ID:
2025.babylm-main.34
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
488–495
Language:
URL:
https://aclanthology.org/2025.babylm-main.34/
DOI:
Bibkey:
Cite (ACL):
Eleni Fysikoudi, Sharid Loáiciga, and Asad B. Sayeed. 2025. Active Curriculum Language Modeling over a Hybrid Pre-training Method. In Proceedings of the First BabyLM Workshop, pages 488–495, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Active Curriculum Language Modeling over a Hybrid Pre-training Method (Fysikoudi et al., BabyLM 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.babylm-main.34.pdf