Using Curriculum Masking Based on Child Language Development to Train a Large Language Model with Limited Training Data

Evan Lucas, Dylan Gaines, Tagore Rao Kosireddy, Kevin Li, Timothy C. Havens


Abstract
In this paper we detail our submissions to the Strict and Strict-Small tracks of the 2024 BabyLM Challenge. We approach this challenge with two methodologies: i) use of a novel dataset, and ii) development of a pre-training technique based on the fusion of child language acquisition with traditional masked language modeling, which we call curriculum masking. The novel dataset used for this task is based on user submissions to the Reddit forum (i.e., subreddit) “Explain Like I’m Five”, which explains diverse concepts using simple language. Curriculum masking works by creating learning phases based on a standard child language development timeline, where the masked words learned by the model start with simple nouns and gradually expand to include more complex parts of speech. We show that using internet-based training data shows a small improvement in evaluation scores as compared to baseline training data. Our proposed pre-training method of curriculum masking is conceptually novel and also shows improved rates of learning over typical masked language modeling pre-training, potentially allowing for good performance with fewer total epochs on smaller training datasets. Code for the curriculum masking implementation is shared at https://github.com/evan-person/curriculumMaskingBabyLM2024.
Anthology ID:
2024.conll-babylm.19
Volume:
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:
CoNLL | BabyLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
221–228
Language:
URL:
https://aclanthology.org/2024.conll-babylm.19/
DOI:
Bibkey:
Cite (ACL):
Evan Lucas, Dylan Gaines, Tagore Rao Kosireddy, Kevin Li, and Timothy C. Havens. 2024. Using Curriculum Masking Based on Child Language Development to Train a Large Language Model with Limited Training Data. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 221–228, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
Using Curriculum Masking Based on Child Language Development to Train a Large Language Model with Limited Training Data (Lucas et al., CoNLL-BabyLM 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-babylm.19.pdf