Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways

We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks. Overall, our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. We found that training on short sequences performed better than training on longer sequences.Pretraining on music may help performance marginally, but, if so, the effect seems small. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting (e.g., Negative Polarity Items). Training performant LLMs on small amounts of data is a difficult but potentially informative task. While some of our techniques showed some promise, more work is needed to explore whether they can improve performance more than the modest gains here. Our code is available at https://github.com/venkatasg/Lil-Bevo and out models at https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a


Introduction
Large Language Models (LLMs) generate complex and largely grammatical strings and display impressive performance with structures traditionally thought to require abstract and hierarchical syntax (Linzen et al., 2016;Linzen and Baroni, 2021;Wilcox et al., 2022;Futrell and Levy, 2019).They have achieved human-like performance at a wide range of natural language tasks (Bubeck et al., 2023;Frank, 2023), particularly those having to do with linguistic form (Mahowald et al., 2023).This state of affairs has led to claims that such models should be taken seriously as cognitive models of human language (Piantadosi, 2023;Baroni, 2022;Frank, 2023), in line with claims from the neuroscience literature to "take mechanistic abstraction seriously" (Cao and Yamins, 2021).
One reason that has been posited not to take LLMs seriously as cognitive models, though, is the immense amount of data they are trained on relative to what a human child is exposed to (Warstadt and Bowman, 2022;van Schijndel et al., 2019).Thus, it is possible that models memorize more than humans do and, relative to humans, over-rely on statistical heuristics and memorized chunks of language (Bender et al., 2021).
On the other hand, the quality of data that LLMs get during pretraining is, in many ways, much worse than what human learners get.Children get richly structured, interactive, multimodal input, tailored to their specific interests and needs.A baby might reach for a cup of water and be told "Water.You want some water?"Given that babies are known to conduct repeated experiments to learn about the world (Gopnik et al., 1999), the baby might try this again and again until mastering the concept of what water is.An LLM, meanwhile, might begin learning language by being asked to predict random tokens in the Wikipedia article on quantum mechanics.
In this paper, we describe our experiments with Lil-Bevo, a small language model trained on human-scale data for the BabyLM competition (Warstadt et al., 2023).The goal of the competition is to train a performant LM on a humanscale amount of data: 10M words for the small track, 100M for the larger track.We submitted to both strict tracks -however, we were notified through the meta-review that our models qualify only for the loose track due to the usage of additional non-linguistic data (music from the MAE-STRO dataset (Hawthorne et al., 2019)).The evaluation is on a set of natural language tasks including grammatical acceptability judgments via minimal pairs in the BLiMP benchmark (Warstadt et al., 2020a), language understanding tasks in Super-GLUE (Wang et al., 2019), and MSGS (the Mixed Signals Generalization Set) (Warstadt et al., 2020b) We started with a baseline DeBERTa model, trained from scratch on BabyLM data using a custom unigram SentencePiece tokenizer (Kudo and Richardson, 2018).Our strategy was not focused on the architecture, but on ways in which we could adjust the training regime to improve performance above the baseline.
Specifically, our strategy targets 3 ways in which typical LLM training regimes lead to lower-quality data than humans have access to.Here, we describe those strategies and their motivation.We give detailed methods in Section 2 and then present results, including a number of ablation studies that attempt to partition out what strategies were successful.
We treated these studies as proof-of-concept and did not exhaustively test these strategies.Thus, we think that there is still room for improvement.
Training on Short Sequences Unlike LLMs, babies do not start language by learning long complicated sequences all at once.Using databases of child and child-directed speech, it has been shown that there is some alignment of caretakers to the child's level in terms of linguistic complexity such that caregivers talk to younger children using shorter utterances and longer utterances as they develop (Schwab and Lew-Williams, 2016;Kunert et al., 2011).To that end, Mueller and Linzen (2023) showed that training on simpler data first could induce a better hierarchical bias for learning language.We specifically take inspiration from Press et al. (2021) who showed that LLMs learn better when trained on shorter sequences before being trained on longer sequences.

Training on Music Before Training on Language
Unlike LLMs, babies are exposed to a wide range of input besides just text.Before and while learning language, they are also learning to map the visual world, to navigate the physical world, to process non-linguistic auditory stimuli, and to engage in a wide variety of cognitive operations.Thus, it is commonly observed that some of the machinery thought to be language-specific (e.g., hierarchical structure) might be induced in pre-linguistic infants through exposure to other kinds of stimuli.Papadimitriou and Jurafsky (2020) use this idea to show that training language models on structured data (e.g., music) can help models learn faster.We use a similar idea, with initial pretraining on a mix of music (piano performances) and text.
Targeted Masked Language Model The role of child-directed speech in human language learning is controversial (see Consortium and et. al., 2020, for discussion and a large-scale replication of infant-directed speech preferences).It is generally agreed that parents do not correct a child every time they make a grammatical error (Marcus, 1993), but there is also evidence that social feedback acts as a signal (Tomasello, 1992) and that parents structure input to be helpful (Weisleder and Fernald, 2013).When a child says something wrong, a parent might "recast" the utterance or highlight grammatical features that children are struggling with (Nicholas et al., 2001).Inspired by this idea, targeting the BLiMP (Warstadt et al., 2020a) syntactic evaluations as well as more general tasks, we trained with a targeted MLM objective.
We considered some variations of the idea of learning with some external feedback that distinguishes correct tokens against corrupted/noisy replacements.For example, ELECTRA (Clark et al., 2020) consists in learning to detect tokens which have been replaced by an auxiliary model.Unfortunately, replaced token detection approaches such as ELECTRA (Clark et al., 2020) suffer from an inability to learn probability distributions over the entire vocabulary, and so cannot be used for (pseudo)-likelihood scoring (Salazar et al., 2020).Another related approach is Corrective Language Modeling (CLM) (Bajaj et al., 2022), in which the model is trained to correctly replace corrupted tokens; however, it is not clear how to best use these models for scoring sentences in BLiMP. 2iven the problems outlined above, we decided to use masked language modeling (MLM) with targeted masks.The motivation is to make it easier for the model to learn syntactic phenomena that co-occur frequently with certain words.Other strategies for selecting masks were used in Sadeq et al. (2022); Gu et al. (2020); unlike these works, we mask specific words which are essential to the phenomena in BLiMP.For example, to target the filler-gap dependency subtask in BLiMP, we go through the original data set and mask every occurrence of "that" and "what" in the corpus.By focusing on these words, we anticipate that the model will more quickly learn to score "I know what you did last summer."more highly than "I know that you did last summer."

Experiments & Methods
We report all experiments and results for Lil-Bevo in this paper, as it enabled quick prototyping, and because we find similar trends with our larger model Lil-Bevo-X.Lil-Bevo-X differs from Lil-Bevo in the model used (deberta-base rather than deberta-small), training data (100M versus 10M), and vocabulary size.Final results for the Lil-Bevo-X are available on our online repository.
Tokenizer We trained a unigram SentencePiece tokenizer (Kudo and Richardson, 2018) from scratch on the BabyLM data combined with the MAESTRO (Hawthorne et al., 2019) dataset (described in detail below) using the sentencepiece library.Specifically, we trained a tokenizer with a vocabulary size of 16,640 and 33,280 for Lil-Bevo and Lil-Bevo-X respectively.<mask> and <cls> were included as control symbols in the vocabulary, along with an end-of-sequence token (</s>), a pad token (<pad>) and an unknown token (<unk>).
Model We chose to use an encoder-based language model, specifically DeBERTa since (a) encoder-based language models are known to capture many syntactic and semantic features in language when pretrained on relatively modest amounts of data (Zhang et al., 2021), (b) there were a wide variety of off-the-shelf DeBERTa architectures available on HuggingFace for easy prototyping and use.
We trained the model in three phrases: (1) pretraining on a combination of music and text for 5 epochs with a sequence length of 64 tokens, (2) continuing pretraining on text for 50 epochs with a sequence length of 128 tokens, and (3) finally pretraining on text using targeted MLM for 2 epochs with a sequence length of 512 tokens.Each of these is described in more detail below.
1. Music Pretraining Papadimitriou and Jurafsky (2020) find that pretraining on languages other than the target language -including music and code -lead to lower perplexities on target language as compared to random distributions of tokens, or even Zipfian token distributions.Inspired by this idea, we explored whether supplementing the 10M linguistic tokens with non-linguistic musical tokens from the MAESTRO dataset (Hawthorne et al., 2019) could lead to noticeable improvements in LM learning.The impetus behind pretraining on music is two-fold: (a) additional training data that nevertheless has structural biases that could help the model learn structural biases found in language (b) the model reaching a stable region in parameter space that enables it to learn desired linguistic properties much faster and/or better.
After several experiments, we found that pretraining on the combined strict-small and the entire MAESTRO dataset for 5 epochs provided the best results.We use V3.0.0 of the MAESTRO dataset, which contains 85M tokens using our custom trained tokenizer.The dataset consists of 200 hours of MIDI piano recordings, which we convert to text and tokenize with the shared unigram Sen-tencePiece tokenizer.Our textual representation of MIDI consists of a chronological sequence of codes describing the channel and key of each note onset and release event (e.g.c0n71 for 'note on, channel 0, key 71') delimited by spaces and optional codes for time between events (e.g.t18 for 18 MIDI ticks).We chose a short sequence length of 64 tokens for pretraining inspired by the Shortformer, which we now explain in further detail.2021), but discovered lower evaluation results on most BLiMP categories (albeit with some improvements on some categories like Island Effects and Quantifiers).Results on BLiMP (Warstadt et al., 2020a) and SuperGLUE (Wang et al., 2019) saturated with as little as 2 epochs -we believe this is because of the much smaller size of the dataset as compared to that in (Press et al., 2021), leading to overfitting on the dataset.

Shortformer
3. Targeted MLM We specifically masked out words which were essential to some of the BLIMP subtasks.Some of these, such as quantifier and negation words, are also important to some of the SuperGLUE tasks (e.g., textual entailment.)For anaphor agreement, we masked the words "himself", "herself", "itself", "themselves".For NPI licensing the masked words included "not", "often", and "probably"3 .The list of words which were masked in each category are shown in Table 3 in Appendix A. We used a sequence length of 512 tokens, and additionally masked other random tokens in order to mask a total of 15% of tokens per sample.
The total number of words masked for each category across the 10M train set are given in Table 1.
The Animacy class consists of animate nouns, and was used to target the minimal pairs in the Argument Structure with animate/inanimate subjects ("Amanda was respected by some waitresses."vs "Amanda was respected by some picture").To obtain a list of animate nouns we used all the lemmas of (direct and indirect) hyponym synsets of person.n.01 in WordNet.
In addition to targeting the BLiMP categories of S-V agreement, quantifiers, NPI licensing, filler gap, argument structure, DN-agreement and anaphor agreement, we also included some modal verbs (e.g., can, might, shall) and certain adverbs (e.g., never, maybe, always, perhaps), since these are important for textual entailment.

Ablations
We compare Lil-Bevo with ablations to explore how important our three strategies are for final performance.Specifically, we compare Lil-Bevo with the following: Long-only Train DeBERTa with a sequence length of 512 tokens for 57 epochs.
Short-only Train DeBERTa with a sequence length of 128 tokens for 57 epochs.
Short+target Train DeBERTa with a sequence length of 128 tokens for 55 epochs.Then train with targeted MLM for 2 epochs.
Music+short Train DeBERTa on music and text for 5 epochs with a sequence length of 64 tokens.Then continue training on text with a sequence length of 128 tokens for 52 epochs.
Music+short+long Train DeBERTa on music and text for 5 epochs with a sequence length of 64 tokens.Then continue training on text with a sequence length of 128 tokens for 50 epochs, followed by training with a sequence length of 512 tokens for 2 epochs.
Lil-Bevo (music+short±target) This is the same as Music+short+long except that the final stage of pretraining for 2 epochs uses targeted MLM.
Implementation We train all our models using the Trainer API, part of the huggingface python package.Models are trained using 4 Nvidia A40 GPUs, with the maximum possible batch size that was permissible with each experiment.Apart from setting initial learning rate to 6e-4, weight decay to 0.1 and a warmup ratio to 0.0001, we use default training arguments in the API (except for the final targeted MLM/long stage, where we used all default parameters).Models are evaluated on the validation split of the BabyLM dataset.We did not use the test split of the BabyLM data.We release all of the above pretrained models online on the Huggingface Hub.

Results
Results for BLiMP, MSGS, SuperGLUE and the supplementary tasks are shown in Figure 1.The results are color-coded to represent each model's differences from the RoBERTa baseline results (obtained from the BabyLM GitHub).We highlight some results below.

Does pretraining on music help?
Comparing short-only with music+short, we see that pretraining on music helps slightly on 8 of the 12 BLiMP subtasks, and on two of the 5 supplement tasks.However, it suffers from a large gap of 9.1 points on QA Congruence tricky.On SuperGLUE, mu-sic+short outperforms short-only on 6 of the 11 subtasks, and only slightly.Thus, we do not think there is strong evidence that pretraining on music improves over the short-only condition, in isolation.
Comparing Lil-Bevo (music+short+target) with short+target, we see that Lil-Bevo outperforms short+target on 69% of all tasks.Predicting score for each task in a mixed-effect linear regression with a fixed effect predictor for whether the model was Lil-Bevo or short+target, we found that Lil-Bevo was slightly better (β = 1.3, χ 2 (1) = 4.11, p < .05by a likelihood ratio test).So, while music pretraining may help, the effect is small and inconsistent in our observed data.
What is the effect of targeted MLM?We compare music+short+long with Lil-Bevo (mu-sic+short+target) and short-only with short+target to ascertain whether targeted MLM helps over random masking.Targeted MLM does not systematically improve performance, except for two BLiMP tasks: NPI Licensing and Argument Structure.For NPI Licensing, Lil-Bevo outperforms mu-sic+short+long by 14.8 points, and short+target outperforms short-only by 16.2 points.We suspect that this difference could be meaningful since our Targeted MLM strategy specifically targets NPI terms that are substituted in BLiMP.
The effect of increasing sequence length When comparing music+short with music-short-long, and short-only with long-only, we find that pretraining with 512-token sequence lengths generally underperforms pretraining with 128-token sequence lengths.The difference between short-only and long-only conditions is quite large in fact.A linear mixed effect regression comparing the two using the same method as above found that performance was 1.8 points worse on average for the long-only method (β = 1.8, χ 2 (1) = 14.2, p < .001by a likelihood ratio test).Thus, we believe pretraining with shorter sequences helps significantly compared to using longer sequences.

Discussion
Overall, we found that, for BabyLM's, sequence length matters, music pretraining may help a little (but may be spurious), and targeted MLM training may help on specific tasks.These results are far from exhaustive, and we see a number of areas for future improvement using these methods.To fully understand the role of initial pretraining on music, one could construct a series of synthetically-generated music datasets, with varying degrees of complexity.Would pretraining on music that is more "language-like" (Lerdahl, 1996) in some sense improve performance on downstream tasks?Perhaps there is a principled way to interpolate between music and language, using the same kind of data format (MIDI).At one end of the spectrum one would have MAESTRO, and at the other end, text that has been encoded into MIDI events.

Model
Related to the use of varying sequence lengths, future work could consider improvements in data preprocessing and batching; in particular, knowing the beginning and ending of coherent chunks of text (e.g., dialogues or documents) could help improve the model.Beyond this, Mueller and Linzen (2023) provide some evidence that curriculum learning approaches may be fruitful to improving low-resource language models.
Finally, a more thorough analysis is needed on when (and by how much) targeted MLM is able to boost model performance.Other strategies are also possible, such as combining targeted MLM with information-theoretic strategies for picking random masks (Sadeq et al., 2022).Beyond MLM, contrastive objectives could be used to encourage the model to score grammatical sentences more highly than ungrammatical sentences.

Conclusion
A big motivating question for training models on human-scale data is whether it is possible for models to attain linguistic competence without the massive amounts of data used to train the massive LLMs that dominate NLP leaderboards.If so, that would make it more plausible that we should take LLMs seriously as cognitive models.So can BabyLMs learn like grown-up ones?While we find some hints of directions to pursue for making small language models learn more from less, we did not come close to matching LLM performance from larger amounts of data.Of course, that does not mean it is not possible to do so, and other teams might have different experiences.We did not fully explore optimizing all of our methods, and we treated our manipulations largely as proof-ofconcept.Aggregating methods and results from a wider variety of teams will make it possible to more fully explore these questions.
Press et al. (2021)  introduce a few innovations to the training regime.In particular, we focused on their idea of training for shorter sequence lengths before moving onto longer ones.We used a similar training regime to(Press et al.,  2021), where we started with a training sequence length of 128 for 50 epochs, before moving to a training sequence length of 512.We initially experimented with training on longer subsequence length for 150 epochs as in Press et al. (

Figure 1 :
Figure 1: Results for each model, for each task.The color reflects the difference in score between the given model and the RoBERTa baseline results released by the organizers of BabyLM.

Table 2 :
Scores on Dynabench for different models.

Table 3 :
Words which were masked in targeted MLM in the 10M train set.For Animacy only words appearing over 100 times are shown in the table.

Table 4 :
Age of Acquisiton results