On the effect of curriculum learning with developmental data for grammar acquisition

This work explores the degree to which grammar acquisition is driven by language `simplicity' and the source modality (speech vs. text) of data. Using BabyBERTa as a probe, we find that grammar acquisition is largely driven by exposure to speech data, and in particular through exposure to two of the BabyLM training corpora: AO-Childes and Open Subtitles. We arrive at this finding by examining various ways of presenting input data to our model. First, we assess the impact of various sequence-level complexity based curricula. We then examine the impact of learning over `blocks' -- covering spans of text that are balanced for the number of tokens in each of the source corpora (rather than number of lines). Finally, we explore curricula that vary the degree to which the model is exposed to different corpora. In all cases, we find that over-exposure to AO-Childes and Open Subtitles significantly drives performance. We verify these findings through a comparable control dataset in which exposure to these corpora, and speech more generally, is limited by design. Our findings indicate that it is not the proportion of tokens occupied by high-utility data that aids acquisition, but rather the proportion of training steps assigned to such data. We hope this encourages future research into the use of more developmentally plausible linguistic data (which tends to be more scarce) to augment general purpose pre-training regimes.


Introduction
Pre-training modern LLMs has become an increasingly resource intensive process, often requiring hundreds of GPU hours, and enough electricity to power a small village.These requirements have led to model creation increasingly becoming restricted to the few actors that are able to muster the resources necessary, excluding many from being able to participate in researching the field.
On the other hand, recent work (Huebner et al., 2021;Mueller and Linzen, 2023) has shown that Transformer LLMs can acquire knowledge of grammar and syntax with less data scale than was previously thought necessary, provided that they are exposed to simpler forms of language.These findings provide a hope that research on pre-training can once again become accessible to the community as a whole.
However, even if scale may not be such a strict requirement for the acquisition of linguistic knowledge, there are two tendencies exhibited by transformer models that may still be barriers to accessibility.Firstly, simply increasing the number of training steps generally yields better results.In fact, recent work by Murty et al. (2023) has shown that continuing training long past train loss saturation can lead to acquisition of a bias towards tree-likeness.While a fascinating finding in its own right (as hierarchical structure is considered a central feature of natural language) many groups simply won't have the GPU hours necessary to reach this point, so resources may remain a barrier.Secondly, it is often the case that simply increasing the complexity of a model can be beneficial (e.g.greater depth can aid syntactic generalisation (Mueller and Linzen, 2023)), but increasing complexity also increases cost.
This work investigates whether we can use the starting small approach to curriculum learning (Elman, 1993) combined with a small scale developmentally plausible pre-training set to aid model grammar acquisition without necessitating an increased budget of training steps.Our findings are mixed.We were unable to significantly outperform a random sampling baseline over all the pretraining corpora contained in the strict-small track.However, we are able to attribute this to the prevalence of high-utility simple speech data.We demonstrate through the use of a control corpus that in a setting where this high-utility data is more scarce, the benefits of developmentally ordered learning start to show themselves.Elman (1993)'s seminal early work presented the idea of starting small, whereby a model is first exposed to simpler data before moving on to more complex types of input.The idea is that complex data might get the model to learn 'false friend' heuristics that are actually harmful in the long run, but simple data might get it to learn in a way that generalises well.However, this hypothesis is not without controversy.Rohde and Plaut (1999) found that networks trained on complex sentences from the start performed better than those trained on simpler sentences initially, contradicting the startingsmall hypothesis.They argue that previous studies supporting the starting small hypothesis may have terminated the training of complex networks too early.Bengio et al. (2009) train a language model using a curriculum learning strategy where only spans of text containing the first 5k most frequent words are included, then expanding to the first 10k and so on.They find that while a random sampling baseline initially achieves a superior loss, with sufficient updates the curriculum strategy comes to a better minimum and converges more stably.
These approaches have in common that they gradually reveal more and more of the dataset.An alternative approach is a single-phase curriculum where the input data is sorted by some criterion and then presented to the model in a fixed ordering.The model goes through the curriculum once, and does not revisit simpler data once it transitions to more complex data.The success of the single phase approach depends heavily on how complexity is defined, and has shown dubious results when applied to NLP (Campos, 2021;Surkov et al., 2022).Even under a developmentally plausible setting, the efficacy of the single phase approach has been shown to be mixed (Huebner et al., 2021).

Model and Training Details
The baseline model architecture we use in this work is an adaptation of BabyBERTa (Huebner et al., 2021).BabyBERTa is a variant of RoBERTa (Liu et al., 2019), with a few key differences: No Unmasking: RoBERTa had used unmasking to minimise the disparity between pre-training and fine-tuning (where no mask tokens are used).Instead, BabyBERTa prioritises the finding that removing unmasking substantially improves model grammar acquisition.No length truncation: Sequences which exceed the max length set in BabyBERTa are excluded instead of truncated.This ensures the model is only provided with whole utterances that correspond to a coherent linguistic unit.Smaller Size: BabyBERTa is both shallower (fewer layers) and narrower (lower hidden and feed-forward size) than the original RoBERTa.Training Data and Vocab Size: BabyBERTa is pre-trained on child directed speech and uses a substantially smaller vocabulary size in order to mimic that of a 6-year-old (theorised to be roughly 6k words).
We adopt this architecture for use in our paper with some alterations: Increased Vocabulary: The BabyLM training corpora consist of more diverse data than AO-Childes, and encompass a wider range of developmental complexity.Consequently, a greater vocabulary size may be beneficial.We performed a grid search over vocabulary sizes 10k, 20k, 30k, 40k and 50k and found 30k to be optimal.Increased Width: We double the hidden size and feed-forward network dimension of the original BabyBERTa from 256 to 512 and 1024 to 2048 respectively.These changes yielded slight improvements in BLiMP performance, but without them the model performed substantially worse on NLI tasks than the RoBERTa baseline provided for the challenge.However, increased width yields only minimal improvements in terms of grammar acquisition.We tested increasing the depth of the model (more layers), but found this yielded no improvements within the pre-training step budget we had available, neither did increasing the number of attention heads.
Our remaining model parameters are the defaults for RoBERTa from the transformer's library (Wolf et al., 2019).We use relative key query positional embeddings and set our max sequence length during training to 128 for efficiency reasons, and follow the no-truncation strategy.We set the learning rate to 1e-5 and the max number of steps to 120k using batch size 128.Unless stated otherwise, all our experiments utilise these same hyperparameters.We utilise dynamic masking as with the original RoBERTa, and no unmasking follow-ing BabyBERTa in all cases without exception.While the latter choice may impact downstream performance in the fine-tuning tasks, the focus of this paper is largely on grammar acquisition as measured by the zero-shot evaluation suite and here removing unmasking proved beneficial.

Sequence Complexity Curricula
Our first point of investigation was to examine whether we could use sequence complexity based curricula to improve grammar acquisition.In the original BabyBERTa paper, the authors found that training on AO-Childes in its original ordering (which corresponds to age ordering, hence AO) led to better grammar acquisition than the reverse, but failed to outperform a random sampling baseline.They attribute this failure to a lack of vocabulary diversity in each batch when using age ordering.By contrast, the BabyLM pre-training corpora exhibit varying complexities (AO-Childes or Open Subtitles are on average much simpler than Wikipedia, see Figure 2), as well as variance in complexity within the corpora.Consequently, we hypothesised that we may be able to scaffold learning by presenting sequences to the model in order of complexity, while mitigating the potential issue of vocabulary and domain diversity by drawing these sequences from across all the source corpora.

Curriculum Types
We tested three kinds of curricula using different measures for complexity.As we were submitting to the strict small track, we only used sequence complexity metrics that could be easily inferred from the raw data.We call lines of the corpora 'sequences' for lack of a better term.Each corresponds to a linguistically coherent unit, but they can vary from short transcribed utterances to full articles.It is likely that better curricula can be created by using more complex and linguistically motivated metrics, but without the use of external resources this is difficult to achieve.The three types we tried are: Entropy: Entropy favours highly likely sequences, but penalises based on length.This should order data such that the most likely shortest sequences appear first, allowing the model to learn simple local dependencies before moving to more complex data.Unigram Probability: Orders sequences by the average unigram probability of their tokens.This is similar to entropy, except without penalising length directly.The idea here is that the model can learn good representations for highly likely tokens first and use that to inform its decision around more complicated/rarer tokens later down the line.The approach is similar to that of Bengio et al. (2009).Block: Introduced by Nagatsuka et al. (2021)

Creation
We first tokenised all sequences using the model's tokeniser, then calculated probabilities for each token using MLE, and scored each sequence, and subsequently re-ranked the data.The re-ranked sequence were then divided into different stages, by chunking according to rank.We used 4 stages for all curricula, with each stage containing a roughly equal number of sequences.Increasing this number did not yield significant improvements.
In the original block curriculum Nagatsuka et al. ( 2021) use block sizes 64, 128, 256 and 512, with the maximum batch size that could fit on their GPU at each step.We adopt this approach, but following initial findings that significantly smaller block sizes proved more beneficial than larger ones (potentially as a result of us limiting the max number of steps to 120k to enforce consistency across experiments), we instead switched to block sizes 16, 32, 64, 128.In some preliminary training runs, we tested both the single phase and starting small approaches to curriculum learning.The single phase approach proved significantly inferior and exhibited a tendency towards catastrophic forgetting.Instead, we  used the following strategy: Each stage introduces new data for training, and the model is trained on the data in the current stage concatenated with that of all stages seen prior.This approach worked best for us.Each stage was trained on for 30k steps, totalling a combined 120k.As a baseline, we trained using random sampling over the whole data, also for 120k steps.

Summary
Figure 1 shows results.None of the curricula were able to outperform a baseline measure of simply sampling random sequences from the concatenation of all the datasets.Though the sequence complexity based curricula showed improvement throughout training, the block curriculum got worse with each stage.This raised two follow-up questions for us.First, what causes the random sampling baseline to do so well?Second, is using blocks as inputs rather than sequences causing the block curriculum to fail, or some other factor1 ?

Investigating Random Sampling
Why might random sampling be successful?Let us begin by examining how we present our data.In terms of number of tokens, the BabyLM pretraining corpora are roughly equally divided between the source modalities: text and (transcribed) speech.Though there is a slight weighting in favour of speech, which comprises 56% of total tokens.Now let us contrast this with the relative complexity of each corpus (see Figure 2).We can see that the speech corpora on average, across all metrics, contain far simpler language than the text corpora.Secondly, as we were submitting to the strict small track we do not perform any augmentation on the data, including sentence tokenisation.This means that the random sampling baseline takes as input lines from each corpus.If we examine the distribution of number of lines between corpora, we find a very different division compared with the number of tokens.Figure 3 shows the breakdown.Looking at the number of lines, the balance between transcribed speech and text data becomes highly unequal, with transcribed speech now comprising a total of 80% of all examples.Secondly, the two corpora which contain on average the simplest language (AO-Childes and Open-Subtitles) represent 59.8% of all lines, and these may be responsible for driving the majority of grammar acquisition.If this is the case, then it may explain the performance of the random sampling baseline, as it is more likely to see sequences from these two corpora than any others, while still being provided a degree of diverse examples in each batch.By contrast, when the input is treated as blocks rather than lines, the balance between speech and text inputs corresponds to the proportion of number of tokens.Alternatively, it may simply be that training on blocks requires more steps so that the model can identify linguistically coherent units.
To test this hypothesis, we train on both models, taking either blocks or lines from the corpora (henceforth referred to as sequences) as input.We train for an equal number of steps (120k).We report results for block size 32, as when trained for the full number of steps, this worked best out of all the variations tested in the block curriculum.

Summary
Even when trained for a greater number of steps we find that sequences as input still quite substantially outperform blocks.Results are shown in Figure 4 and Table 1.The only exception is on the held out tasks, however, this is due to the block variant of the model essentially having random accuracy on the QA congruence tasks (close to 50%) while the sequences variants appear to have learned to solve the easy tasks, but fail at the hard ones (see Table 7 for full results by for each task).
We can conclude from this that providing linguistically coherent units as input is beneficial for overall efficient grammar acquisition, despite the fact that the model is disproportionately being exposed to speech data, and therefore only a subset of the overall tokens throughout pre-training.However, we still need to disentangle whether it is speech that is driving this effect or the fact that the model is being presented linguistically coherent units.6 Speech versus Text

Efficient Acquisition by Modality
Prior work examining the impact of pre-training on AO-Childes (Huebner et al., 2021;Mueller and Linzen, 2023) has shown that utilising this simpler form of language enables more efficient acquisition of grammatical knowledge and encourages a bias towards hierarchical generalisation in transformer language models.So in our case, over exposing the model to simpler data such as speech may be driving performance.To test this, we perform two ablations.First, we compare the impact of training on only one source modality (for a reduced number of steps) to assess whether text-only or speech-only provides a better starting point for acquisition.This actually ought to favour the textual data in some respects because it contains longer sequences and therefore should provide more signal per step, as each input will contain more masks and contexts while still representing a linguistically coherent unit.Figure 5 shows results for the first ablation comparing the two modalities when trained for 40k steps each.Training on transcribed speech consistently outperforms training on text alone, and leads to more stable improvements than just text, indicating that speech is a better starting point.As a second follow-up investigation, we once again trained on two different settings.In the first we train on speech first and then the concatenation of text and speech for 60k steps respectively.This is to check whether we can build a foundation from speech data alone, and then transition to including both modalities.However, here text data only occupies 10% of the overall proportion of inputs, and is only observed in the later stages of training.
As a control, we also try the inverse, starting with text first and then transitioning to the concatenation of all the corpora.This means that the text data now provides 60% of all the total inputs and speech is only introduced to the model later in training, no longer acting as a foundation.Results are in Table 2. Weighting towards speech beats the text-first control in the original BLiMP tasks, and reaches similar performance to random sampling given their standard deviations overlap.On the held out tasks, the highly variable text-first results are sometimes competitive.

Summary
We find that transcribed speech leads to improved BLiMP performance and lower variance compared with text only data.Based on this finding, we investigated whether we could design a simple two stage curriculum where we first train the model on speech only and then transfer to the full dataset.Under this setting, performance is roughly equal to random sampling, and shows some very slight improvements compared to the text-first reverse curriculum.This is despite withholding ≈ 50% of the total tokens (contained in the text portion of the data) until the latter half of training.

Corpora Complexity Curricula
Having found that speech data can provide a better foundation than text, and that over exposure may be behind the random sampling baselines performance, we conduct a follow-up investigation.How much exposure to more complex data is necessary in order to achieve grammar acquisition?To probe this question, we use the same strategy for our curriculum by training on a stage and the concatenation of all previous stages.This time we define our ordering using the average rank across our various corpus complexity measures as shown in Figure 2.So our ordering starts with AO-Childes and ends with Wikipedia.The curriculum is simply the corpus complexity ordering, with two caveats.We treat BNC spoken and switchboard as one corpus, as switchboard is too small to warrant a new stage.We also do the same for CBT and children's stories, as they are very similar in terms of complexity.Using this form of curriculum further increases the model's exposure to simple data, with AO-Childes and Open Subtitles now representing 72.2% of all total training examples, compared with 59.8% before, and Wikipedia representing only 0.3% (see Figure 6).We again implement the reverse curriculum as a control measure, starting with Wikipedia and finishing with AO-Childes, and compare results to the random sampling baseline (see Table 3).The simple to complex curriculum yields marginally better results overall compared to the random sampling baseline, and the gap with the reverse curriculum is wider here than for the previous speech versus text curricula.However, the marginality of the increase compared to the random sampling baseline makes it  difficult to make any strong claims regarding the effect of ordering.We suspect this was because the BabyLM training data is already favourable for grammar acquisition and weighted towards speech, and expect we would observe greater benefits over random sampling in a setting where the data lacked these properties -as in many larger scale datasets where high utility speech data is relatively scarce.

Summary
We wanted to test whether we could design a curriculum based on the complexity of the various pre-training corpora (see Figure 2).We find that following this approach led to improvements over the reverse, especially on the original set of BLiMP tasks, but failed to show a significant difference over random sampling.We hypothesise that this due to AO-Childes and Open Subtitles, two of the most high utility corpora for grammar acquisition, already making up a large percentage of inputs in the random-sampling setting.Thus, the introduction of a curriculum may have little impact.

Control Dataset
To test whether complexity ordering helped more when the training data was less optimal, we created a new dataset.It consists of the AO-Childes portion of BabyLM 10M, and the CBT and Wikipedia portions of BabyLM 100M, representing the simplest, middle, and most complex corpora respectively.We set max sequence length to 512 to allow training on as much of the data as possible.Combined, these three corpora have approximately 10 million tokens (similar to the 'strict-small' track), but with the vast majority of these now coming from text data.It also means that the number of inputs that come from simpler, more beneficial data is reduced.Descriptive statistics can be found in Table 4.
We train a new tokeniser on the data, and then   compare results between a random sampling baseline and corpus complexity curriculum approach described in the previous section.Both versions are trained for 120k steps, but we had to lower the batch size to 64 due to GPU memory constraints with longer sequences.Results are in Table 5, and we plot the by task scores in Figure 8.Under this setting, the curriculum approach begins to demonstrate modest but visible improvements over random sampling, though this does not extend to the held out tasks. Figure 7 shows BLiMP performances as the number of steps increases.The curriculum consistently offers slight improvements over random sampling.

Summary
We wanted to test whether curriculum learning can be beneficial in a scenario where the majority of data is not as high utility as simple transcribed speech.To do so, we created a control corpus where the majority of data comes from long form text.Under this setting, we find a slight, but discernible improvement from using the curriculum.
We began our exploration by attempting to design a learning curriculum to further grammar acquisition for the BabyLM strict-small track.We found that when the majority of the data is high-utility, as is the case here, curriculum learning shows no substantial benefits.However, such training data is not always available or may be dwarfed by the number of tokens of low utility data available.In these settings-common for pre-training NLP modelsour results indicate some promise in starting small after all.However, extensive further experimentation, most likely requiring larger scale corpora, is necessary to properly test and verify this claim.

Figure 2 :
Figure 2: Heatmap ranking of the BabyLM Strict Small training corpora according to complexity measures.

Figure 3 :
Figure 3: Distribution of line counts across the ten language corpora, with each line treated as a unique sequence.The percentages represent the proportion of total lines that each individual corpus contributes to the overall dataset.

Figure 4 :
Figure 4: By-task breakdown of zero-shot performance when input data is either a linguistically coherent sequence or a block.Results averaged over 3 seeds.

Figure 5 :
Figure 5: Zero-shot performance by step when the model is trained on either the transcribed speech or text portions of the pre-training corpora (over 3 seeds).

Figure 6 :
Figure 6: Proportion of total inputs comprised by each of the corpora using the corpus complexity curriculum.

Figure 7 :
Figure 7: Zero-shot performance by step when the model is trained using the curriculum or random sampling on our control dataset (over 3 seeds).

Figure 8 :
Figure 8: By-task breakdown of zero-shot performance on the control dataset curriculum vs random sampling.Results averaged across 3 seeds.

Table 1 :
By-task breakdown of zero-shot performance between models utilising random sampling strategies where inputs are either linguistically coherent sequences or blocks.Results averaged over 3 seeds.

Table 3 :
Comparison of performance by corpora complexity ordering.Results averaged across 3 seeds.

Table 4 :
Control Dataset Statistics