KidLM: Advancing Language Models for Children – Early Insights and Future Directions

Mir Tafseer Nayeem, Davood Rafiei


Abstract
Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children’s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.
Anthology ID:
2024.emnlp-main.277
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4813–4836
Language:
URL:
https://aclanthology.org/2024.emnlp-main.277
DOI:
Bibkey:
Cite (ACL):
Mir Tafseer Nayeem and Davood Rafiei. 2024. KidLM: Advancing Language Models for Children – Early Insights and Future Directions. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4813–4836, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
KidLM: Advancing Language Models for Children – Early Insights and Future Directions (Nayeem & Rafiei, EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.277.pdf