Too Much Information: Keeping Training Simple for BabyLMs

This paper details the work of the University of Groningen for the BabyLM Challenge. We follow the idea that, like babies, language models should be introduced to simpler concepts first and build off of that knowledge to understand more complex concepts. We examine this strategy of simple-then-complex through a variety of lenses, namely context size, vocabulary, and overall linguistic complexity of the data. We find that only one, context size, is truly beneficial to training a language model. However this simple change to context size gives us improvements of 2 points on average on (Super)GLUE tasks, 1 point on MSGS tasks, and 12\% on average on BLiMP tasks. Our context-limited model outperforms the baseline that was trained on 10$\times$ the amount of data.


Introduction
The pretraining of language models has traditionally relied on large amounts of data, which, for many languages, is readily available.However there exist several low-resource languages in which even unlabeled data is not so readily available.While transferring knowledge from other languages is often an effective way to achieve better performance, there may be implicit biases also transferred from the text of the higher-resource language, which could be potentially harmful.Additionally, given that a 13 year old sees less than 100 million words in their lifetime (orders of magnitude less than the amount used in LM pretraining), there ought to be methods that more efficiently learn from limited data.
Such is the motivation for the BabyLM Challenge and subsequently our work.We focus on the strict-small track, which limits the training data to only 10 million words, from a selection of domains with varying complexity (from child speak up to Wikipedia articles).
In our work, we investigate different methods for introducing the model to varying levels of complexity.Namely we ramp up the difficulty of the pretraining along 3 avenues: 1. Context length 2. Dataset complexity

Vocabulary size
Concerning context length, we adopt the strategy of starting with a small number of tokens per input and increasing this over the course of training, with the intuition that a human typically learns a language starting with short sentences with limited cross-sentential context, and builds up from there to longer contexts.
In addition, the sentences initially learned by a human are also simpler conceptually, starting with frequently-used words and building up to rarer words.To this end, we develop a strategy to filter the dataset such that the model starts training on simpler data and later trains on more complex data.
Similarly, we also follow the intuition that a human develops a vocabulary over time, originating from the chunking of characters within the words, and as such we start with a character-level vocabulary and introduce a transfer method to give a good initialization for a larger subword vocabulary.

Related Work
Concerning context size, prior work (Edman et al., 2022) has shown that in low-resource language modeling, using a lower context size can greatly help with model convergence.The concept of increasing context size is not novel: BERT (Devlin et al., 2018) was initially trained on a smaller context size of 128 tokens before being increased to 512, though, to our knowledge, this was done for efficiency reasons.There have been several works on internally reducing the scope of contextualization by limiting attention to local patches (Beltagy et al., 2020;Zaheer et al., 2020), thereby decreasing the complexity of self-attention.These works were done with processing long documents in mind, however, and can have a negative impact on model speed given an extra layer of complexity in calculating self-attention.
Concerning vocabulary size, there is ample work on character-level models, where they have been shown to require less data for pretraining while achieving the same or better performance at the cost of training and inference speed (Xue et al., 2022).Character models also can greatly outperform subword models on out-of-domain tasks (Boukkouri et al., 2020), low-resource translation (Edman et al., 2023), and tasks which require morphology or character-level perturbations (Xue et al., 2022;Ingólfsdóttir et al., 2023).Their performance in these scenarios has been largely attributed to their non-static vocabulary, allowing for good initializations to unseen or rarely-seen words.All of this points to character-informed models being potentially useful for this shared task.
Concerning lexical complexity, (Eldan and Li, 2023) has shown that using a synthetic dataset of children's stories, written for a 3 or 4 year old to understand, one can train a small (<10M parameter) Transformer model and generate stories near the quality of much larger models.
Another group of NLP approaches that condition learning on linguistic complexity is a branch of curriculum learning, exploring potential benefits from exposing models to training samples in a meaningful order, from easy to hard (Bengio et al. 2009;Kocmi and Bojar 2017;Zhang et al. 2018 among many others).These approaches show conceptual promise but are complicated by the choice of appropriate complexity measures and the pacing function.1 3 Method

Model Choice
We opted to use encoder-only models.We initially experimented with encoder-decoder models, but found that the evaluation metrics for this shared task being non-generative gave encoder-only models an advantage, as it allows for full attention, rather than only causal attention.In terms of specific model selection, we opted for RoBERTa-base (Liu et al., 2019) in order to directly compare with the provided baseline.We also experimented with (and ultimately submitted) DeBERTa-large (He et al., 2021) as it is a larger model and considered state-of-the-art for encoder-only models.

Training and Evaluation
Our pretraining uses the standard MLM scheme (Liu et al., 2019) We primarily evaluate with BLiMP (Warstadt et al., 2020a), due to its speed of evaluation and not requiring a fine-tuning step.We also report results of our best models for the BLiMP supplement, (Super-)GLUE (Wang et al., 2018(Wang et al., , 2019)), and MSGS (Warstadt et al., 2020b) tasks.

Vocabulary size
We first experiment with vocabulary size.For creating the vocabulary, we use SentencePiece's Unigram model (Kudo and Richardson, 2018;Kudo, 2018).We found that a vocabulary size of 40k provided the best standalone performance on BLiMP (we report this in Appendix A).
We further experiment with a character-level vocabulary, and transferring to a subword vocabulary (of size 40k).To enable this transfer, we copy over all character-only embeddings, and initialize subword embeddings as the sum of their respective character embeddings.The main body of the transformer model is also directly copied.The language modelling head is simply re-trained from scratch.

Context size
We also experiment with context sizes in powers of 2, from 16 to 256.To achieve a consistent and coherent context size, we split the data into n-token examples (with n being the context size), prior to shuffling.Our initial experiments with determining the optimal vocabulary size use a context size of 64, although we later find that a context size of 32 performs slightly better.

Curriculum learning
We explored potential gains from different order of exposure of the model to training data, inspired by curriculum learning approaches (see Bengio et al. 2009 and much subsequent work; for a recent comprehensive survey of the field of curriculum learning, see Soviany et al. 2022).
The basic motivating intuition is to start the training with subsets of data that are 'simpler' than others in some relevant sense, gradually increasing the complexity of data the model is trained on.Hopefully, simple data can give the model a head start that would also form a foundation for linguistic generalization.To try out this idea, we formulate a complexity measure that we use in data reordering.The measure is a combination of the following features: • Type/Token Ratio: The number of unique words in a text divided by the length of the text in words.The feature targets lexical diversity of the text per text unit.
• Mean word rarity: The mean of rarity scores across all words in the text (word rarity score is 1 -normalized log-frequency; it ranges from 0 to 1, the higher the rarer).This is another measure of text complexity via lexical diversity -this time, based on how rare the words used in the text are, as judged based on the whole training dataset.
• Max word rarity: The maximum of word rarity scores in the text.Same as above, but picking out the maximum -the peak of complexity-as-rarity reached in the text.
• Punctuation density: The proportion of punctuation marks in the union of words and punctuation marks in the text.This proportion is used as a proxy to syntactic complexity.
• Mean sentence length in the text, in words.
• Mean word length in the text, in characters.These last two scores approximate syntactic and morphological/lexical complexity, respectively.
In our experiments, we scale all these features to fit into the [0,1] interval (with MinMax scaler) and use their mean as our complexity measure.
To assess the role of data ordering along the complexity scale based on the measure above, we trained triples of minimally different models, keeping everything apart from the data ordering fixed: • Curriculum model: All training data is ordered by increasing complexity.
• No-curriculum model: No particular order is imposed on the training data.
• Reversed-curriculum model: Training data is ordered by decreasing complexity.
All models in this set of experiments are RoBERTa-base models trained following the twostage procedure described in Section 4.1 -first, the models are trained on context size 32, then the context is increased to 128.Unlike in other experiments, however, each of the stages was further divided into three consecutive phases: • Phase 1: The first 1/3 of the data is used in training, the other 2/3 are withheld.The curriculum model just sees the 'easiest' data here; the reversed-curriculum model sees the 'most difficult' portion; the baseline, no-curriculum model sees 1/3 of data without any particular selection; • Phase 2: Another 1/3 of the data is unlocked.Now all models are being trained on 2/3 of all training data.Both the curriculum model and the reversed-curriculum model now have access to the middle of the complexity range.
• Phase 3: The final 1/3 of data is unlocked.Now all models are being trained on the whole range of complexity.The data-unlocking procedure above happens twice: first, on a small context size (32 tokens), and later when the context size is increased (128 tokens).
Using the taxonomy of curriculum learning in (Soviany et al., 2022), we can describe our approach as vanilla data-level curriculum learning with easy-then-hard iterative schedule.

Context Size
The vast majority of our improvement comes from limiting the context size.We show this in Figures 1 and 2.Here we can see that a context size of 32 gives the best performance on BLIMP, whereas 64 gives the best performance on GLUE.The overall shift in trend between the two benchmarks fits with the fact that the average input length is longer in GLUE than in BLIMP.There is a substantial drop in performance using a context size of greater than 64.To our understanding, the baselines provided by the task organizers use a context size of 128, which may explain their relatively poorer performance (as shown later in Figure 4).
However, if we simply first train with a context size of 32, then increase the context size to 128, we see a substantial gain over training on 128 from the beginning.In the case of GLUE, we see that increasing the context size from 32 to 128 increases the performance beyond what simply training on 32 or 128 alone can accomplish.This suggests that a larger context size is indeed necessary for performance on (Super-)GLUE, but pretraining initially on a smaller context can guide the model to more efficient training on larger context sizes.

Vocabulary Expansion
Next, we look at the performance of our models which were initially trained on a character-level vocabulary, then transferred to our 40k subword vocabulary.We show the results in Table 2 As we can see, the performance is mixed and depends on the context size.For context size 64, there appears to be an improvement, however for context size 32, the performance drops.The lack of improvement for context size 32 led us to leave out this technique in our final model, as the potential gains are inconsistent and training first on the character level adds a costly extra pretraining step.
As for the use of characters in low-resource pretraining, we suspect that there are better ways of integrating rather than via an extra initial pretraining step.Using our method, the model is susceptible to forgetting what it has learned during the character-level pretraining when it is pretraining for the second time.
Additionally, the evaluation metrics chosen for this shared task do not stand out as tasks where character models would be particularly beneficial.Other tasks where character-level models have been shown to greatly outperform subword-level models such as morphological inflection would be perhaps more suitable for assessing the potential benefits of our character-informed model.

Curriculum learning
We evaluate the results of data ordering from simple to complex against two alternatives: no data ordering and reversed data ordering (from difficult to simple).We train triples of models that minimally differ from each other -everything apart from the order of data is kept constant.
Figure 3 shows evaluation loss dynamics of one typical model triples during training (we set up several training experiments, varying the number of epochs per phase, without qualitative change in results, so here we only report one of them).
While there are stages in training where the loss seems to indicate an advantage of the curriculum model against the baseline one, the no-curriculum model eventually catches up.Perhaps more surprisingly, the reversed-curriculum model shows systematically lower loss during longer phases of training.
Targeted linguistic evaluation also shows mixed results.Table 3 lists BLiMP scores for the model triple: We don't see a clear pattern in what types of linguistic phenomena benefit from a particular order of data exposure and can't conclude whether the observed effects are robust and systematic.
The lack of a clear benefit from the curriculum might be traced back to at least one of the following: • Low quality of the complexity metric;  • Non-optimal pacing function; • The genuine lack of advantage from data reordering.
To illustrate some of these considerations, we pick three typical samples from the ordered dataset for context size 32 -from the 'simplest' end, from the middle, and from the 'most complex' end, respectively: The easiest samples are indeed linguistically simple -they contain a lot of repetitions, very simple syntactic structures and very frequent words.At the same time they are not very representative of the rest of the dataset, both grammatically and lexicaly.The typical sample from the main body of the dataset -samples like (b) -do not show the characteristic repetitive pattern and a large proportion of the lexical material across the dataset falls outside of what the simplest samples contain.The simplest data defined the way we do it is useful for generalization to the rest of the data only to a very limited degree: the model does see the most frequent words, but the contexts of their use are pretty different from how they are typically used elsewhere.For a model with non-character-level tokenization, it might not be particularly helpful.
On the other side of the complexity scale, a lot of samples are indeed difficult, but in a way that does not necessarily reflect true linguistic complexity: vocabulary and punctuation features push up samples that contain elements of HTML, have collapsed space symbols, are lists or are written in languages that are not the main language of the dataset.
In a sense, both extreme ends of the complexity scale contain samples that are probably not good grounds for linguistic generalization given the MLM training objective, but in different ways.

Model Size
Table 4 shows the performances of the two models we used, as well as DeBERTa-base to control for the differences in model architecture between RoBERTa and DeBERTa.We can see that DeBERTa-large generally performs best.Interestingly, we see that switching from RoBERTa to DeBERTa seems to account for the difference in GLUE scores, but scaling up to large accounts for the increase in BLiMP scores.This shows that when limiting the context size, we can potentially scale up to larger models even when data is scarce.We also experimented with training a Deberta-XL model, which is identical to Deberta-Large except with 48 layers rather than 24.Our results on BLiMP were however not better (roughly 2% worse than the comparable large model), so it would seem that there is a limit to how much one can simply scale up model size and see performance improvements when it comes to pretraining on limited data.

Submission
In Figure 4, we show the overall results for our best models, compared to the baselines.We also report results on each individual sub-task in Appendix B.
Our final models include a model trained only on context size 32, and two trained again on context size 128, one for 10 epochs and one for 50 epochs.
As our one trained with 10 additional epochs performed best on average, this was our final submission.We can see the trade-off for context size between the GLUE and BLiMP scores, as BLiMP favors models trained on a shorter context while GLUE favors models trained on a longer context.MSGS appears to also have some slight preference for models trained on a shorter context, though the differences between all models is comparatively small.Interestingly, the 10M baseline is better on average than the 100M baseline on MSGS, as well as the BLiMP supplement.We see the largest difference in the BLiMP supplement, where our models outperform the baselines by around 20 points on average.Much of this improvement comes from the qa_congruence_easy set, where our best model achieved a score of 81%, compared to the baseline score of 31%.

Conclusion
Our conclusion is very simple: if you want to pretrain a model on little data, train with a smaller context size.This can greatly aid in model convergence such that no specific hyperparameter tuning or complex methods need to be used for superior performance.
In fact, both of our more "complex" approaches concerning initialization with a character vocabulary and curriculum learning proved to be unreliable, where gains paled in comparison to the gains realized from simply lowering context size.
If a larger context size is eventually needed, such as for some GLUE tasks, continuing training with a larger context size can provide some benefit.We do think that there may be a smarter way to control context size, such as a gradual increasing during training, which could lead to smoother and faster training.Additionally we expect that there are other potential ways to implicitly limit context size, such as limiting self-attention, which may achieve a similar effect.

Figure 1 :
Figure 1: Average BLiMP score for models trained using various context sizes.32→128 indicates a model trained initially on context size 32, then trained again on 128.

Figure 2 :
Figure 2: Average (Super-)GLUE score for models trained using various context sizes.32→128 indicates a model trained initially on context size 32, then trained again on 128.

Figure 3 :
Figure 3: Loss dynamics for three minimally different models: curriculum; no curriculum; reversed curriculum.
(a) down! up up up up up up up up up up up up up up down!(b) the flared skirt of the cone yet to be combed, and this provide (c) p;amp;gt;&amp;amp;gt;Exactly.&amp;amp;gt;&amp;amp;gt;Combining

Figure 4 :
Figure 4: Average scores for submitted models compared to baselines.32→128 indicates a model trained initially on context size 32, then trained again on 128.The number in parentheses indicates the number of epochs trained on for the second iteration of pretraining.MSGS scores are the average Matthew's Correlation Coefficient, multiplied by 100.

Table 2 :
. Average performance on BLiMP across context and vocabulary sizes.

Table 3 :
The effect of data ordering on linguistic generalization.

Table 4 :
RoBERTa-base versus DeBERTA-base and large on all tasks.MSGS is the average Matthew's Correlation Coefficient multiplied by 100.Best in bold.