Timothy C. Havens
2024
Using Curriculum Masking Based on Child Language Development to Train a Large Language Model with Limited Training Data
Evan Lucas
|
Dylan Gaines
|
Tagore Rao Kosireddy
|
Kevin Li
|
Timothy C. Havens
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
In this paper we detail our submissions to the Strict and Strict-Small tracks of the 2024 BabyLM Challenge. We approach this challenge with two methodologies: i) use of a novel dataset, and ii) development of a pre-training technique based on the fusion of child language acquisition with traditional masked language modeling, which we call curriculum masking. The novel dataset used for this task is based on user submissions to the Reddit forum (i.e., subreddit) “Explain Like I’m Five”, which explains diverse concepts using simple language. Curriculum masking works by creating learning phases based on a standard child language development timeline, where the masked words learned by the model start with simple nouns and gradually expand to include more complex parts of speech. We show that using internet-based training data shows a small improvement in evaluation scores as compared to baseline training data. Our proposed pre-training method of curriculum masking is conceptually novel and also shows improved rates of learning over typical masked language modeling pre-training, potentially allowing for good performance with fewer total epochs on smaller training datasets. Code for the curriculum masking implementation is shared at https://github.com/evan-person/curriculumMaskingBabyLM2024.