CLIMB – Curriculum Learning for Infant-inspired Model Building

,


Introduction
Children acquire language skills from being exposed to an estimated two to seven million words * Equal contribution per year (Gilkerson et al., 2017).The current learning regimes of large language models require disproportionately larger sizes of training data to acquire linguistic generalization capabilities (Zhang et al., 2021).State-of-the-art LMs are typically trained on gigabytes of data gleaned from the World Wide Web, on multiple GPUs continuously for days at a time (Zhao et al., 2023).For example, the Chinchilla language model was trained on a dataset of 1.4 trillion words (Hoffmann et al., 2022).Such large-scale training regimes are economically and ecologically unsustainable, and access to the required computing resources remains out of reach for most academic groups and industry start-ups (Izsak et al., 2021).
To enable language models to still perform well with limited data, recent work has looked at utilizing smaller, well-curated, and representative corpora (Samuel et al., 2023;Gao et al., 2020) and careful selection of training and model hyper-parameters (Geiping and Goldstein, 2023).'Zero-shot' and 'few-shot' learning are other dataefficient approaches which can perform well in certain settings but rely on large pre-trained language models (Brown et al., 2020;Wei et al., 2021).These approaches, however, provide engineering solutions to the problem rather than a cognitivelyinspired, compute-efficient framework for training language models from scratch.
Conventional pre-training of large language models remains far removed from human language learning: models operate on a predetermined static vocabulary and optimize a monotonous training objective on a randomly shuffled dataset.We conducted experiments to explore more dynamic learning processes that are motivated by the psycholinguistic and language acquisition literature and are set within the machine learning paradigm of curriculum learning (Bengio et al., 2009).Our models are implemented and evaluated within the 'BabyLM Challenge' framework, a shared task in which the stated goal is "to incentivize researchers with an interest in pretraining and/or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development" (Warstadt et al., 2023).Our goal in participating in the BabyLM Challenge is two fold: First, we aim to contribute toward democratizing language modelling research and move towards this goal by training smaller language models that are still well-performing on NLP tasks.Second, we establish a computational framework based on curriculum learning for simulating aspects of human language acquisition.We participate in the strictest track of the challenge, limiting the training data to only 10 million words of text extracted from various pre-existing corpora.
Initially, we train our own BabyBERTa-style vanilla model1 (Huebner et al., 2021) and find that simply tuning model size and vocabulary size in itself leads to substantial performance gains on some of the BabyLM test sets compared to the shared task baselines.We furthermore carried out a number of pre-processing steps on the training data to further improve performance, including concatenating input sequences to make the most of the available input length.In our own approach, which we term CLIMB -Curriculum Learning for Infant-inspired Model Building -we explore three different curriculum strategies for language modelling: gradually increasing the size of the vocabulary (vocabulary curriculum), the difficulty of the training instances (data curriculum), or the specificity of the objective function (objective curriculum) over the course of training.Within the context of the BabyLM Challenge, Curriculum Learning establishes a framework through which we attempt to replicate key facets of child language acquisition.Counter-intuitively, we find that all of our curriculum learning approaches under-perform our BabyBERTa-style (non curriculum learning) vanilla models.Our contribution to the Baby LM Challenge builds upon this negative finding in three main ways: 1.Our paper establishes a novel framework through which to categorize and implement curriculum learning methods that simulate human language acquisition.We open-source our accompanying code-base for future research to study how curriculum learning replicates the language learning dynamics in humans.
2. We conduct a comprehensive evaluation of our three main curriculum approaches; our results show that the curriculum learning settings we tested did not provide consistent improvements over a baseline on linguistic benchmarks.Instead, we provide a set of recommendations for specific combinations of tasks and settings which may benefit from our proposed curricula.
3. We highlight the importance of careful data, model and hyper-parameter selection to establish a well performing fully supervised baseline for the BabyLM shared task.Our vanilla models outperform the shared task baseline models on tasks involving grammatical knowledge (BLiMP: The Benchmark of Linguistic Minimal Pairs (Warstadt et al., 2020a)) and all the shared-task baselines except RoBERTa (Liu et al., 2019) on tasks involving natural language understanding (SuperGLUE (Wang et al., 2019)).

Curriculum Learning
Curriculum learning (Bengio et al., 2009) is a machine-learning paradigm which optimizes a model's performance by gradually increasing the difficulty of training over time according to a set schedule (a 'curriculum') -based on the idea that learning should proceed from easy to hard, inspired by the way that humans learn (Elman, 1993).Within the context of curriculum learning, one of the central questions is how to define and manipulate the difficulty of the learning process over the course of training.In a recent survey, Soviany et al. (2022) decompose this challenge into two main sub-problems: determining a sorting mechanism to assess the difficulty of instances and developing a pacing function for increasing difficulty over time.

Determining Difficulty
Previous work in curriculum learning typically focuses on difficulty from a data-centric perspective, however, we note that difficulty can arise from (at least) three major elements of training a neural model: the input representation, the data sampling, and the training process.We explore curriculum learning strategies across three distinct dimensions: the vocabulary, the order of training data, and the objective function.
For machine learning models, instance difficulty is in part influenced by the choice of instance representation.For language models, the representational space is constrained by the vocabulary.We propose a new vocabulary curriculum inspired by Soviany et al. (2022), who discuss linking the curriculum criteria to the observed vocabulary sizes in child development.To the best of our knowledge, this is the first attempt at manipulating the vocabulary available to a language model through curriculum learning.
In natural language processing models, the order of the training instances can have a strong effect on performance (Schluter and Varab, 2018).Existing approaches to instance-level curriculum learning determine the difficulty of each instance according to a pre-defined static difficulty assessment according to linguistic criteria (Campos, 2021;Kocmi and Bojar, 2017;Liu et al., 2018;Platanios et al., 2019).It has been shown that humans pay more attention to stimuli that are in just the right zone of difficulty for them: neither too easy nor too hard (Kidd et al., 2012).This so-called 'Goldilocks effect' can be modelled by assessing the difficulty of an instance dynamically based on model behaviour (Sachan and Xing, 2016;Lalor and Yu, 2020).Static and dynamic difficulty assessment can be mapped to teacher-centric and learner-centric educational approaches and we compare both variants in our data curriculum experiments.
Human language learning is guided and enabled to some extent by other agents in the learner's environment (e.g., adult caregivers, siblings) who interact with the learner.In machine learning, such interactions are modelled by the objective function that guides the weight optimization process.The typical 'masked language modelling' (MLM) objective function requires that a model predicts a target token from a pre-defined vocabulary of size N given the surrounding context.Thus standard MLM defines an N -way token classification task.
Curriculum learning can be leveraged within this context to attenuate the difficulty of the classification task during training.One natural starting point for doing so is to redefine the classification task to be over a smaller set of items, K, such that K << N .Bai et al. (2022) map rare words with hypernyms of that word to simplify the classification task in training.A related line of research suggests replacing certain words with either part-of-speech tags (Wang et al., 2023) or syntactic dependency relations (Cui et al., 2022).Since the number of syntactic tags is substantially smaller than the number of vocabulary items, these approaches greatly reduce the difficulty of the objective.Moreover, by varying the amount of syntactic tags that the model should classify over, the difficulty of the task can be dynamically adapted (Wang et al., 2023).We take inspiration from this latter line of work in defining our own objective curriculum.

Pacing Functions
Once a notion of difficulty is set, a pacing function is needed to govern how quickly the model will progress from training on easier examples to training on harder ones (Wu et al., 2021).We experiment with two different pacing functions: linear and logarithmic.Linear pacing functions involve a steady and consistent advancement through the curriculum.This approach ensures a gradual increase in difficulty over time.Logarithmic pacing functions, on the other hand, emphasize early exposure to "easier" concepts, with diminishing increments as the model's capabilities are assumed to increase.Both pacing functions have been proposed in the broader curriculum learning literature (Bai et al., 2022;Li et al., 2021;Wu et al., 2021).

Methodology
All of our models are based on an 8-layer Transformer language model (Section 3.2) comparable to the BabyBERTa model (Huebner et al., 2021).For all experiments, we use the Hugging Face Transformers library (Wolf et al., 2020), Weights & Biases for performance tracking (Biewald, 2020), Hydra to define experiment configurations (Yadan, 2019), and a high performance computing cluster.
We introduce curriculum learning to three of the primary components of language model pretraining: the vocabulary (Section 3.3), the data sampling approach (Section 3.4), and the selection of the objective function (Section 3.5).For each of these aspects, we attempt to simulate facets of human language learning by dynamically increasing the difficulty of the language modelling task over the course of training.

Training Data
We use only the training data provided in the STRICT-SMALL track of the BabyLM challenge, which is limited to 10 million words and combined from 10 individual corpora.Given the variety of data sources (including books, subtitles, transcripts and articles) we carefully curated the data to ensure consistency across corpora.These steps include lowercasing, normalizing punctuation, standardizing typographical conventions using regular expressions, and removing extraneous lines (such as page numbers, bibliography entries, plain text tables , and one-word on-screen actions).We also concatenated contiguous sections of five lines into a single data instance in the transcribed speech corpora (except the BNC) due to the relatively short sequence lengths.In addition, we join data at the point of passing input to the models, in order to make full use of the available input sequence length (128 subtokens).
According to the rules of the STRICT-SMALL track, we were not permitted to make use of external resources, including supervised part-of-speech (POS) taggers.Therefore, we attempted to cluster the words in the training data into word classes by applying the anchor-features algorithm of the unsupervised POS-tagger by Stratos et al. (2016) on our cleaned data.The algorithm yields 30 clusters which we manually mapped to the 12 universal speech tags (Petrov et al., 2012) by choosing the POS-tag that best represents the anchor word of each cluster.We were only able to identify 10 of the 12 universal POS tags in the 30 clusters: no cluster neatly coincided with 'ADV' or 'X' tags.We provide further detail on our data preprocessing and unsupervised POS-tagging in the Appendix.
We provide our cleaned and tagged versions of the 10M word dataset on Hugging Face, along with the scripts used.
2 Our pre-processing procedure 2 https://huggingface.co/ cambridge-climb reduces the data down to 335,858 instances (corresponding to roughly 9.4 million words) from the initial 1,058,740 newline-delineated samples.
3 Our models, tokenizers and part-of-speech taggers were trained on this pre-processed data; however, we actually noticed an increase in performance when training on the raw data, as discussed in Section 5.

Vanilla Models
We investigate three different sizes of a vanilla Pre-Layer Norm RoBERTa model (Liu et al., 2019;Ott et al., 2019) based on the BabyBERTa model (Huebner et al., 2021): 'small', 'medium', and 'large' -Table 2 lists the model configurations and presents the results for the different model sizes evaluated by perplexity, on BLiMP (Warstadt et al., 2020a) and on the supplementary BLiMP-like tasks issued by the BabyLM organizers ('Blimp.Supp').
We found the medium model with a small vocabulary size performed the best overall; however, the small model achieved similar results, and so to save on compute and keep to the restrained intentions of the STRICT-SMALL track, we used the small model in our curriculum learning experiments.We use Byte Pair Encoding (BPE) tokenization (Gage, 1994) with a vocabulary of 8,192 because it yields better overall performance compared to a larger vocabulary of 16,384.The tokenizers we use in our experiments were trained on the cleaned data that we processed using the steps outlined in 3.1.In pilot experiments, we did not observe the benefits reported by Huebner et al. (2021) from removing the unmasking procedure that is a standard component of the MLM objective (Devlin et al., 2019), and therefore did not investigate this option further.All of the curriculum learning methods in the following sections were applied on top of our small vanilla BabyBERTa-style baseline -to isolate the effect of the curriculum-learning training process, we fixed the architecture of the model and the model hyper-parameters.We use an AdamW optimizer with linear scheduling (Loshchilov and Hutter, 2019).

Vocabulary Curriculum
During the early stages of language acquisition, children start with a small vocabulary that rapidly expands at a rate of eight to ten words per day (Weizman and Snow, 2001).In this process, children prioritize learning verbs and nouns before progressing to other parts of speech (Bergelson and Swingley, 2015).Large language models, on the other hand, tend to begin training with a full, fixed vocabulary available to them.
To represent a child's growing vocabulary, we select a limited vocabulary in the initial stages of learning and map all other input tokens into the representation for the unknown token (UNK).We consider three strategies for selecting tokens.In the first strategy, tokens are selected according to frequency.We approximate the frequency of a token by the identifier the BPE tokenizer assigns to it as lower IDs are assigned to tokens that are merged first (i.e., sequences of characters that occur more frequently in the corpus).In the second strategy, tokens are selected by their word class.We approximate the word class of a token by the cluster that the unsupervised POS-tagger assigns to it.We order the word classes as follows, progressing from lexical to functional classes per Bergelson and Swingley (2015): NOUN, VERB, ADJ, PRON, DET, ADP, NUM, CONJ, PRT, PNCT.In this strategy, all words with the respective part-of-speech tag are included in the vocabulary at the same step during learning.To smooth this process, we combine the frequency and the word class constraint in the third strategy.We sort words by their frequency (approximated by the token ID) within each partof-speech category.Note that the same word may be available in some instances and not others if it is assigned a more difficult POS tag.
During the initial steps of training, only 10% of the tokens are available while the rest are replaced with UNK.The vocabulary curriculum regime begins after 25,000 training steps and ends at 350,000 steps, during which time, the vocabulary gradually increases according to a pacing function.We experiment with linear and logarithmic pacing functions.After the end of the curriculum regime, there remain 50,000 training steps before the end of training during which all of the vocabulary tokens are available to the model.Figure 5 in the Appendix shows a plot of the percentage of unmasked vocabulary over the course of training according to our pacing functions.

Data Curriculum
Conventional masked language modelling approaches train a given neural network on a large amount of crawled internet data.The resulting text sequences are usually not curated beyond basic cleaning and are presented to the model in random order, in contrast to the way that human children learn a language.
We attempt to carefully optimize the way data is sampled and presented to the language model over the course of training.We experiment with theorydriven and model-driven approaches to determine the 'relative difficulty' of a certain example and train the model on instances with progressively increasing difficulty.
Source Difficulty We order the available datasets based on their sources so that spoken samples are considered 'easier' and purely written texts 'harder', following the findings of Huebner et al. (2021).Within this ordering, we place the mostly child-directed speech from CHILDES before adultto-adult dialogues in the Switchboard Corpus, and  Simple Wikipedia before Wikipedia, see Table 3. 4 Model Difficulty Determining the difficulty of an instance based on its data source is a relatively naive heuristic that ignores the variation of instance difficulty within one corpus.As a more fine-grained alternative, we determine the difficulty of each instance individually using the modelintrinsic metric of perplexity which determines the likelihood of a sentence.We experiment with two variants: a static unigram language model and a more dynamic self-evaluation.With the unigram model, perplexity for each instance is only determined once at the beginning of training.Alternatively, we evaluate the perplexity of the remaining training data using the model that has been trained so far -from model checkpoints saved at regular intervals in training (every 25K steps).
One challenge with the latter approach is the lack of exposure to training data at the beginning, leading to random perplexity scores for each sample.To address this, we propose two ideas: 1) using a separately trained unigram model to initially evaluate perplexity, or 2) initially sample training instances randomly.After 25,000 training steps, we switch to using the current model for perplexity evaluation.Every 25,000 steps thereafter, we re-evaluate perplexity to identify samples categorized as relatively difficult or relatively easy by the model.

Objective Curriculum
The MLM objective has proven tremendously successful in training Transformer networks as language models (Devlin et al., 2019).Psycholinguistic research, however, suggests that MLM is not a cognitively plausible approximation of language acquisition processes in children (Caucheteux et al., 2023).Curriculum learning establishes a framework for varying the difficulty of the learning process over the course of training.The MLM objective is a very challenging discriminative classifica-tion task because the identity of the masked token needs to be determined over the entire vocabulary.We experiment with using more coarse-grained tasks at the initial stages of training to facilitate generalization and leverage syntactic information.Research in cognitive linguistics has shown that one-year-old infants are sensitive to distributional aspects of language and from two years of age begin to recognize lexical categories such as nouns and verbs Alishahi (2010); Gleitman (1990).We therefore experiment with predicting only the word class of a masked token at the start of training rather than predicting its exact target token ID.
The psycholinguistic literature remains divided on the question of how exactly word learning proceeds from memorizing a small set of fixed lexical items to a more generalized representation of word classes (Clark and Casillas, 2015).Our framework provides a flexible approach to vary the difficulty of objective functions during the course of training, and to enable systematic studies of the effect of objective functions on the acquisition of linguistic knowledge by a model.Here we propose estimating the word class using the unsupervised POS tagger and we vary the number of POS tags which are being classified over.The masked word is classified into 1) one of VERB, NOUN, or OTHER, or 2) one of 10 universal POS tags.
We examine activating the tasks in sequential order (first word class prediction then MLM) or optimizing them in parallel, comparable to a multitask learning setting.For each objective function, we learn a separate task head with its own linear task classifier and separate optimizer.

Results
Multiple evaluation metrics are employed in BabyLM.In this paper we focus on BLiMP (Warstadt et al., 2020a) and the supplementary BLiMP-style tests provided by the shared task organizers.We also report our results on the natural language understanding benchmark, Super-GLUE (Wang et al., 2019), and the ambiguous subset of MSGS (the Mixed Signals Generalization Set) (Warstadt et al., 2020b).In brief, BLiMP evaluates specific linguistic abilities, MSGS evaluates linguistic preference over surface generalisation and SuperGLUE evaluates downstream task performance.For all scores, we report the average score across all categories, rather than test instances, as provided by the BabyLM evaluation Figure 1: Comparison of the BabyLM baselines with our BabyBERTa-style vanilla models (left), and our vanilla models against our curriculum learning models (right) -using BabyBERTa-small trained on clean data as a reference point (asterisked) to show the difference in scores on BLiMP and BLiMP-supplement tasks.For combination models, all pacing is logarithmic, and 'multitask' refers to the 2-task objective curriculum, 10 POS-tags and MLM from the outset.Absolute values may be found in Appendix Tables 5-9.pipeline.
5 All of our curriculum learning models are small BabyBERTa-style ones using the parameters shown in Table 2 and the cleaned training dataset of 9.4M words (reduced from the 10M word dataset for the STRICT-SMALL track) and their results can be found in Tables 5, 6 and 7.
In the tables we compare to our small BabyBERTa-style vanilla model also trained on the clean data (Section 3.2). Figure 1 visualizes these comparisons for the BLiMP tasks; there are similar plots for SuperGLUE in the Appendix (Figure 4).Furthermore, we experimented with some combinations of different curricula to see how they would interact (Table 8), and compare the official BabyLM shared-task baselines with our shared task entries -a number of our own BabyBERTa-style vanilla models and curriculum learning models (Table 9).For all of our runs, we use the same set of hyper-parameters that we report in Table 10.We also report the average amount of compute used for each type of curriculum learning setting (Table 11).
We find notable gains for our own vanilla models 5 For instance, there are 12 categories in BLiMP but 50+ individual tests.We average over the scores given for each category, rather than the scores given for each test.over the shared-task baselines, and, while we do not identify further large improvements in our curriculum learning models, we do notice some modest gains which suggest possibilities for future research and experimentation over variables.While the differences in performance between most of our experimental conditions are small, the large number of ablations we run enables us to provide a comprehensive set of recommendations for how and when different curriculum learning strategies may offer improved performance on linguistic tasks.Below we summarize our observations over the full results tables.
In general, log pacing works at least as well as linear pacing across different curricula learning strategies.In our data curriculum experiments, models using the log pacing function outperform their linear counterparts in 4/4 settings on BLiMP, and 3/4 settings for BLiMP-supplement and Su-perGLUE (Table 6).This indicates that rapidly increasing the difficulty of training instances in the early stages brings downstream benefits on grammaticality and NLU tasks.
In our vocabulary curriculum experiments on the other hand, there is not such a clear picture.Log pacing outperforms linear in 2/3 settings on BLiMP and 3/3 on SuperGLUE, but 0/3 for BLiMPsupplement (Table 5).Presumably this is a reflection of the different vocabulary required by each set of evaluation tasks, which could be a matter for future investigation but also indicates that we do not yet have a clear generalizable pacing function for the vocabulary curriculum.There are of course other pacing functions to be tried.
Different representations of vocabulary difficulty work better for different tasks.When representing difficulty in the vocabulary curriculum experiments, token ID -our proxy for frequency -appears to work better than word classes (POS tags) or a combination of token ID and POS tags on the BLiMP evaluation tasks, but worse than POS tags on SuperGLUE and MSGS (Table 5).
In multi-corpora datasets, ordering by difficulty is a good first step.Training data requirements have grown so much in modern NLP that usually training a language model from scratch will involve multiple datasets, or multiple domains.The results of our data curriculum experiments indicate that a good first step is to put these sub-corpora into some order of intuitive difficulty, as we did (Table 6).In the case of BLiMP this approach outperforms our perplexity-based data curricula, and with log pacing our vanilla model.The same is true of MSGS (with log pacing), as well as BLiMP-supplement and SuperGLUE (though the last two do not beat our vanilla model).Amongst the perplexity-driven models, the picture is less positive: out of 24 tests, only one model outperforms our vanilla model (log pacing, random initialisation + model perplexity in Table 6).
Multitask learning holds sway over sequentially swapping objective functions for now.In our experiments with curricula for the objective function, we compare training on simultaneous tasks -known as multitask learning (Caruana, 1997) with predefined sequences of objective functions which swap from one to another at set thresholds in the training process.We set up two sequential curricula: one with 2 tasks (predicting the 10 universal POS tags found in our dataset, and MLM) and the other with 3 (like the 2 task curriculum, additionally with noun/verb/other prediction).We compare these against multitasking alternatives.In general the sequential curricula are outperformed by the multitasking ones, though the 3-task sequential curriculum outperforms our BabyBERTa-style vanilla model on SuperGLUE and is second only marginally to our best-performing multitask model (Table 7).The multitask learning model with 10class universal POS-tag prediction and MLM in place from the outset performs best on BLiMP and SuperGLUE.However, our best model on BLiMPsupplement -a multitask one -has an element of sequential task scheduling in that the two POS-tag prediction tasks are lined up one after the other, with a switch from 3-class to 10-class after 6.25% of training steps.In Figure 2, we visualize this result for each task in BLiMP-supplement, illustrating that our curriculum learning model improves over our vanilla model in 5/6 tasks.Altogether, these results suggest that sequential objective function curricula do hold some potential for performance gains if further tuning of the tasks and scheduling can be carried out.
Combining all three curricula shows potential on BLiMP.While each individual curriculum learning experiment did not result in consistent improvements across tasks, we investigated whether combining aspects from the different curricula would, together, improve the model.We do find that a combination of all three curricula outperforms any single curriculum model on BLiMP, but the same is not true for BLiMP-supplement and SuperGLUE (Table 8).This is another matter for future investigation, as it seems that improving each of the three curricula we investigate may lead to further gains if they are all combined.
In small data settings, filtering data which we intuitively think is noisy is in fact counterproductive.Perhaps surprisingly, we find that the vanilla models trained on the raw data outperform those trained on the pre-processed data on BLiMP and MSGS.We surmise that models can learn even from linguistically non-standard datapoints.

Submitted models
Table 9 in the Appendix compares our submissions to the shared task baselines.We submitted our best curriculum learning models from each individual curriculum learning setting, and four different vanilla models: two small and two medium models, where each pair additionally varies by whether it was trained on the pre-processed dataset or the raw dataset.We find our curriculum learning models are comparable to our BabyBERTa-style vanilla models, and we think that in most cases some continued experimentation with configurations may yield larger gains for CL approaches.
For interest, we also trained a BabyBERTastyle large vanilla model on the 100M training set made available in the BabyLM STRICT track ('large-100M' in the table).The improvements over smaller models trained on less data are evident and finally provide an advantage over the RoBERTa baseline on SuperGLUE.It remains to be seen how well curriculum learning methods, and our preprocessing methods, would work with this larger dataset.

Discussion
We set out to investigate a number of curriculum learning approaches to language model training, motivated by findings from the human language acquisition process and by the wish to successfully train smaller models for smaller budgets.We first of all implemented a stronger model of our own, based on BabyBERTa (Huebner et al., 2021) and found that a small 8-layer vanilla model could outperform the provided BabyLM baselines on the BLiMP grammaticality tests and get close to the best RoBERTa shared-task baseline on Super-GLUE.This underlines the findings reported in the BabyBERTa paper: that with smaller datasets, it makes sense to use smaller models and a smaller vocabulary size.
The results of our curriculum learning experiments, trained with a small BabyBERTa-style vanilla model, suggest that we can further improve performance in certain linguistic tasks by careful application of a pacing function, how we represent and grow the model's vocabulary during training, select the next training instances according to their difficulty, and vary the objective function.Specifically, we find that a logarithmic pacing function works better for the data curriculum than a linear one, but the findings for the vocabulary curriculum are less clear.Other pacing functions might be tried in the future, including those that reflect acquisition theory around non-monotonic or 'U-shaped' development trajectories.
It is apparent that ordering the subcorpora within a training set may be worthwhile, and that perplexity-based approaches to data selection hold potential even though we have not found a clearcut best method for perplexity calculation as yet.
As shown in other NLP work, multitask learning can be a beneficial approach, though MLM or next-word prediction remain preeminent as singular tasks used in language modelling.We find multitask learning models hard to beat in the objective curriculum, but do find good performance in our sequential settings.We believe that future work varying the timing of task switches and introducing more tasks could be worthwhile.
On a more general note, the Baby LM challenge evaluates a language model only on its final downstream performance on a set of tasks -i.e. at a finite point in time.The challenge does not directly measure whether a given model is learning in a 'human-like' fashion.Our contribution to the BabyLM challenge is to provide a set of curriculum learning strategies which are motivated by the language learning dynamics of infants and children.
We encourage future research to study how to quantitatively evaluate whether the learning trajectory of a model parallels that of a human language learner and how similarities to human language learning results in downstream NLU performance.

Conclusions
We use child-like language learning as inspiration to investigate and implement three types of curriculum learning for language modelling: gradually increasing the size of the vocabulary (vocabulary curriculum), the difficulty of the training instances (data curriculum), or the specificity of the objective function (objective curriculum).
We find that our BabyBERTa-style vanilla models outperform the BabyLM baselines on BLiMP and MSGS, and get close on SuperGLUE.Our various curriculum learning models at times offer further gains over our vanilla models, and indicate the potential for curriculum learning methods given further exploration.We list out a set of recommendations for when and how to optimally apply our proposed curriculum learning strategies.
Additionally, training our vanilla model trained on unprocessed data outperforms a 'cleaned' version -suggesting that retaining as much data as possible, in low-resource settings, is more important than standardizing it according to linguistic norms.
Finally, our work establishes a computational framework for how to categorise and implement curricula learning strategies that simulate human language learning dynamics.

Table 1 :
Table 1 provides an overview of our experiment variables.Curriculum learning experiments overview

Table 2 :
Our vanilla BabyBERTa-style models evaluated on original BLiMP and the BLiMP-like tasks prepared for BabyLM (BLiMP.Supp).Models are grouped by their vocabulary sizes.

Table 3 :
Difficulty level assigned to each dataset.