Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

,


Introduction
Although there have massive improvements in the effectiveness of neural language models in the last decade, humans are still the state of the art in language learning.To achieve impressive results, language models need to be trained on hundreds of times more language input than a typical human will be exposed to in an entire lifetime.The BabyLM Challenge is a shared task that invites * Equal contribution.Warstadt and Bowman (2022).
members of the natural language processing, linguistics, and cognitive science communities to train language models in low-resource data settings, where the amount of linguistic input resembles the amount received by human language learners.In doing so, our motivations (Section 2) are to improve the relevance of language models as cognitive models of human language acquisition, find more effective and data-efficient training algorithms for language models, and democratize research on language model training by emphasizing research questions that can be addressed on a smaller training budget.
Participants in the shared task could submit to the Strict, Strict-Small, or Loose track, which, respectively, required models to be trained on corpora that constituted either 10 million words, 100 million words, or 100 million words plus an unlimited amount of additional non-linguistic data (Section 3).These corpora were constructed from a mixture of sources including developmentally plausible domains such as child-directed speech, transcribed dialogue, and children's literature (Section 4).To enable standardized evaluation and easy comparison of the resulting models, we create a leaderboard and release an evaluation pipeline (Section 5) targeting zero-shot grammatical performance, finetunability on language understanding tasks, and model inductive bias.We also contribute a novel set of zero-shot evaluation tasks targeting semantic and discourse-level phenomena.
We received 31 papers making a variety of contributions, ranging from designing novel architectures and tuning hyperparameters to employing curriculum learning and training teacher-student model pairs (Section 6).We conduct a meta-analysis of the results, yielding several concrete recommendations and scientific conclusions (Section 7).The winners of the challenge's various tracks made contributions that led to impressive improvements in our evaluation over not just the BabyLM baselines, but also the massively pretrained Llama 2 model (Touvron et al., 2023).The best-performing models overall (Charpentier and Samuel, 2023) use the LTG-BERT architecture (Samuel et al., 2023), which synthesizes a number of recent optimizations of the Transformer architecture.The winner of the Loose track (Xiao et al., 2023) trains the models continuously on the training samples belonging to the same source dataset while randomizing the dataset orders in each training epoch.Other submissions did not achieve strong downstream results, but still provided valuable scientific contributions.We received many curriculum learning submissions, including one that systematically tested a variety of strategies (Martinez et al., 2023) and reported few improvements over non-curriculum baselines.Steuer et al. (2023) found that benchmark performance is not correlated with a greater ability to predict human psycholinguistic data.
We plan to organize future BabyLM Challenges that will build on the success of this first iteration (Section 8).The winning submission from this year sets a high baseline for next year.Future iterations will need harder and more varied evaluations, including those that emphasize human-like processing and learning; they should emphasize new approaches that were not thoroughly explored this year, such as multimodality; and, they should incentivize compute-efficiency.Altogether, the first BabyLM Challenge has been a successful initiative, and we hope that this will continue to advance research on small-scale language models.

Motivation
The observation at the center of the BabyLM Challenge is this: Children are incredibly data-efficient language learners, and language models are not.Children are exposed to less than 100 million word tokens by age 13 (Gilkerson et al., 2017), while modern language models are typically trained on 3 or 4 orders-of-magnitude more data (Figure 1).This discrepancy raises two important questions: First, how is it that humans are able to learn language so efficiently?Second, what insights from human language learning can be used to improve language models?
A great deal of recent work in language model training seeks improvements by scaling up pretraining data and parameters (Raffel et al., 2020;Brown et al., 2020;Hoffmann et al., 2022;Touvron et al., 2023).Scaling is undoubtedly central to building deployable models (though see McKenzie et al. 2023 for counterexamples) and raises its own set of scientific questions, such as quantitative scaling laws (Kaplan et al., 2020) and the emergence of new abilities (Wei et al., 2022).However, increased emphasis on scaling is unlikely to lead to answers to the two questions we raised, and it excludes researchers without access to massive computational resources.
Thus, there are three principal benefits to data-limited language model training which the BabyLM Challenge aims to highlight: 1. Building more cognitively and developmentally plausible models of human language acquisition and processing, 2. Optimizing training pipelines prior to scaling by allowing for faster iteration on architectures and hyperparameters, and 3. Enabling research on language model training beyond highly funded industry groups.
Cognitive Modeling.Language models have been used to model aspects of human language learning and processing for decades (Elman, 1990;Hale, 2001;Reali and Christiansen, 2005, o.a.).While many researchers continue to advocate for language models as cognitive models (Keller, 2010;Dupoux, 2018;Linzen, 2019;Baroni, 2022;Warstadt and Bowman, 2022;Piantadosi, 2023;Wilcox et al., 2023), most agree that it is critical to make LMs learn in more human-like ways.Warstadt and Bowman (2022) and Linzen (2020) point to data quantity as the most egregious advantage that modern language models have over humans.When restricted to developmentally plausible data volumes, language models no longer perform well on benchmarks for human-like syntactic and semantic behavior (van Schijndel et al., 2019;Zhang et al., 2021).
Working to close the data-efficiency gap between language models and humans will have two principal advantages for cognitive modeling.First, by reverse-engineering known and hypothetical aspects of the human learning scenario-from multimodal inputs and multi-agent interaction to innate linguistic structural biases-we can determine which factors are critical to our unique ability to learn language efficiently (Dupoux, 2018).Second, by minimizing differences between humans and models, we make results from controlled experiments carried out on models more likely to be applicable to humans (Warstadt and Bowman, 2022).
Faster iteration on architectures and hyperparameters for language modeling.Reducing the scale of training provides researchers with a sandbox in which to more fully explore this design space and better optimize training pipelines.The search space for design choices when training language models is enormous.Thus, it can be impractical, especially at large scales, to experiment with new model architectures, training objectives, or data preprocessing steps, in addition to necessary hyperparameter tuning.Models such as RoBERTa (Liu et al., 2019) have succeeded in making some optimizations to the BERT training pipeline, but more optimizations remain.Indeed, there are anecdotes of basic design choices for popular pipelines, such as the masking rate for BERT training (Wettig et al., 2023), being poorly tuned for years, despite hundreds or even thousands of papers using this training pipeline.
There are numerous dimensions along which to scale down training.Some works seek to optimize pipelines for a limited amount of compute, time, or money.Notable examples of such pipelines for bidirectional encoder-only include ELECTRA (Clark et al., 2020), 24-hour BERT (Izsak et al., 2021), and MosaicBERT (Portes et al., 2023).These pipelines typically combine multiple approaches, such as modifying training objectives to increase the number of supervised predictions per forward pass, using low-precision floating-point computations for certain components, reducing sequence length or padding, and altering the attention or feed-forward layers of the transformer block.
However, the objective of optimizing pipelines for a fixed data budget is relatively underexplored.This is changing in the last year with new models optimized for small datasets such as LTG-BERT (Samuel et al., 2023) and community-oriented events centered around data-limited training such as the Learning from Small Data workshop (Breitholtz et al., 2023) and the MiniPile Challenge (Kaddour, 2023).
Democratizing language model training research.The third goal of the BabyLM Challenge is to democratize research on pretrainingtypically thought to be practical only for large industry groups-by drawing attention to challenging and important open problems that can be explored on a university budget.In recent years, efforts aimed at widening participation in LM research often take different avenues from the one proposed here, including aggregation of distributed computation power (Diskin et al., 2021), reliance on public computing infrastructure (Scao et al., 2022), aggregation of expertise, data and stepwise contributions (Don-Yehiya et al., 2023;Raffel, 2023) and modularity (Pfeiffer et al., 2023).Such a line of pretraining research proposes to keep costs large but to distribute them across funding sources through many contributing factors.
Other works on decentralizing computation (Diskin et al., 2021;Li et al., 2022;Lialin et al., 2023) or model recycling works generally take existing models and build upon them, proposing a single adaptation finetuning (Choshen et al., 2022), a single knowledge edit (De Cao et al., 2021), combining several models (Yadav et al., 2023), or iterative approaches showing that stacking such improvements can continually improve models (Don-Yehiya et al., 2023).Recently, a framework for doing so was also released (Kandpal et al., 2023).One can see the BabyLM challenge in this context as a suggestion to persist in using a centralized approach to pretraining, but making it tractable, by reducing the cost through increased focus on tractable research questions.

Guidelines and Timeline
Tracks.Submissions to BabyLM had to conform to one of three sets of guidelines, which we term tracks.In this section, we describe each competition track; for specific details about wording, see the original Call for Papers (Warstadt et al., 2023).The three tracks for the BabyLM challenges were Strict, Strict-Small, and Loose.Participants in all tracks were allowed a constant number of Englishlanguage training tokens (100 million in Strict and Loose and 10 million in Strict-Small) to be used in total for all software used in the pipeline.This data was released by the organizing committee and is described, in detail, in Section 4. Loose track submissions were encouraged to train on data beyond just the linguistic text data provided through the shared task (e.g., speech audio signal, code, music, or visual input).The Loose track also permitted the use of expert-annotated data, but any language data used to train the LM or auxiliary models counted towards the 100M word budget.Thus, for example, a Loose track submission could train a parser on the Penn Treebank (Marcus et al., 1993) and self-train to parse the pretraining corpus, as long as the number of words in the Penn Treebank plus the pretraining corpus total less than 100M. 1  In general, seeing the same data twice (e.g., across different epochs) did not count as seeing more text.While it is unlikely that humans process data iteratively in a manner similar to epoch-based training, there is evidence that humans do repeat some of the information they process (e.g., in memory replay, Carr et al., 2011).Furthermore, epochs are very useful for gradient-based methods.
Finally, participants across all tracks were encouraged to submit models and papers even if their work did not fit into any of the three tracks.As the goal of the shared task is to advance efficient and cognitively plausible LM training, we did not want to curtail participant creativity.While submissions using external linguistic data did not qualify to win any of the tracks, they still qualified to be presented in the competition and to be published in the proceedings.
Community building.Given that the BabyLM Challenge aims to encourage research in efficient and cognitively plausible model pretraining, one of our goals was to encourage the formation of a research community with shared interests.Towards that end, we hosted a public messaging forum on Slack and enabled participants to interact with each 1 In our initial announcement, external software trained on linguistic input or expert annotations not included in our corpus-including taggers, parsers, tokenizers, or models were not allowed.However, numerous questions from participants prompted an announcement in April 2023 that we were modifying the rules of the Loose track to allow such methods.We made this decision because we determined that the interests of the community were better served by emphasizing creativity and discovery in the Loose track.Text generated by a language model that was trained only on a BabyLM corpus was not counted towards the 100M word budget, nor was data bootstrapped by such models.
other and with the task organizers.At the time of paper writing, this forum had over 250 members, including many interested researchers who did not ultimately submit to the challenge.An interactive forum was useful for both establishing a community and building interest; it allowed the community to clarify the track rules, debug the evaluation pipeline, and receive announcements from the organizers.
Timeline.Below, we replicate the timeline from the website.(Touvron et al., 2023) are trained on trillions of words (Figure 1).Even BERT (Devlin et al., 2019), which is comparatively small by today's standards, was trained on over 3B words, well over the amount of input to a human in an entire lifetime.This discrepancy in input volume between LMs and humans is an oft-cited criticism of using these artifacts out-of-the-box as cognitive models (Warstadt and Bowman, 2022;Frank, 2023, a.o.).
Domain: Mostly transcribed speech.We source the majority (≈ 56%) of the pretraining corpus from transcribed or scripted speech.We made this choice because the majority of the input to a hearing child comes from speech (though this proportion decreases with age as consumption of written media increases).This contrasts with standard LM training corpora, which consist mostly of text that was intended to be read and potentially edited.This is particularly significant for studying grammar learning, as some grammatical constructions (such as nominalizations and passives) are far more frequent in writing, while others (such as first-and second-person pronouns) are more frequent in speech (Biber, 1991).
Domain: Child-directed language.About 40% of the data in the pretraining corpus comes from sources either intended for children or appropriate for children, including child-directed speech, children's books, educational videos, and simplified English.Child-directed speech has been used as the sole or primary data source in some previous work aiming to model child language acquisition with LMs (Reali and Christiansen, 2005;Perfors et al., 2011;Pannitto and Herbelot, 2020;Huebner et al., 2021;Yedetore et al., 2023).We chose to include data from other domains (both child-directed and not) for several reasons.First, fewer than 10M words of transcribed child-directed speech are available, far below our 100M word budget.Second, child-directed speech makes up only part of the input to children.This amount can vary by a factor of 10 or more across cultures and socio-economic groups (Cristia et al., 2019).The estimate on which we base the 100M word budget (Gilkerson et al., 2017) counts all speech in the child's environment including overheard speech.

Contents
The contents of the BabyLM pretraining dataset are summarized in Table 1.Descriptions of each data source are provided in Appendix A.

Preprocessing
We release Strict and Strict-Small train, development, and test splits of each of the ten data sources, split approximately 83.3%/8.3%/8.3%.The 10M word Strict-Small training set is sampled randomly from the Strict training set.After any preprocessing, we downsample and split each source by randomly sampling chunks of 2000 lines or longer.
The code and instructions for downloading and preprocessing the raw data are publicly available. 3e perform minimal preprocessing in terms of filtering and reformatting text.Notably, we gener-ally preserve newlines in the original texts, meaning newlines do not consistently delimit documents, paragraphs, or sentences, as in some pretraining datasets.We use WikiExtractor (Attardi, 2015) to extract text from the xml Simple English Wikipedia dump dated 2022-12-01.We perform additional preprocessing on Simple English Wikipedia to remove <doc> tags.We select the spoken subset of the BNC by selecting only lines from the xml containing the <stext> tag and extracting only the text from the xml.We use code by Gerlach and Font-Clos (2020) to download and preprocess data from Project Gutenberg, which we additionally filter to contain only English texts by authors born after 1850.The OpenSubtitles and Wikipedia portions of the pretraining corpus were shared with us in preprocessed form, having had duplicate documents removed from OpenSubtitles and preprocessing steps performed to Wikipedia similar to our Simple English Wikipedia procedure. 4We use regular expressions to remove speaker and dialog act annotations from the Switchboard Dialog Act Corpus.We perform no preprocessing on the remaining datasets.

Evaluation
To evaluate submissions, participants were asked to upload their model predictions to Dynabench, which is an online platform for dynamic data collection and model benchmarking. 5Multiple submissions to the Dynabench platform were allowed, but at most one candidate was allowed to be chosen as a competitor from each team.

Evaluation Tasks
The goal of the evaluation pipeline is to assess the extent to which submitted models have learned the latent syntactic and semantic structure of their pretraining language.To evaluate the grammatical abilities of LMs, we use BLiMP (Warstadt et al., 2020a).BLiMP consists of tasks that evaluate the ability of language models to behave in a manner consistent with the structure of English.Each example consists of a minimal pair of sentences, where one sentence is acceptable and the other is unacceptable (differing as minimally as possible from the acceptable sentence otherwise); a model is correct on a given example if it assigns higher probability to the correct sentence in the minimal pair.We also release a supplement to the BLiMP tasks, which tests for phenomena not captured by BLiMP (see §5.1.1).
To assess the abilities of LMs on more typical downstream NLP tasks, we evaluate on a mixture of tasks from a subsample of (Super)GLUE, which consists of text classification tasks.We include a variety of task types, including paraphrase detection (MRPC, QQP), sentiment classification (SST-2), natural language inference (MNLI, QNLI, RTE), question answering (BoolQ, MultiRC), acceptability judgments (CoLA), and commonsense reasoning (WSC).

Hidden Tasks
Two weeks before the results deadline, we released three hidden evaluation tasks: the Mixed Signals Generalization Set (MSGS), a supplement to BLiMP, and an age-of-acquisition (AoA) prediction task.MSGS and the BLiMP supplement were mandatory; AoA prediction was provided as an additional analysis point for participants in writing their papers.The motivation for using these hidden tasks was to prevent our evaluations from rewarding submissions that overfit to the BLiMP and (Super)GLUE tasks.
The BLiMP supplement includes five test suites consisting of BLiMP-style minimal pairs that cover areas of linguistic knowledge not tested by BLiMP-namely, dialogue and questions.The test suites are semi-automatically generated using manually filled templates.As with BLiMP, models are evaluated on the supplement in a zero-shot manner, by comparing the probabilities of the sequences in a minimal pair, under the assumption that the acceptable sequence will be more probable than its unacceptable counterpart.
HYPERNYMS.We evaluate LMs' knowledge of lexical entailment, i.e., hypernym-hyponym relationships.This task bears similarity to natural language inference (Dagan et al., 2006;Bowman et al., 2015;Williams et al., 2018), but we instead measure whether models assign a higher likelihood to valid statements of entailment compared to minimally differing invalid statements.The evaluation data is designed around manually written triples consisting of ⟨hypernym, base, hyponym⟩-for example, ⟨plant, herb, basil⟩.We also specify an other noun (for example, flower) which shares the hypernym but not the hyponym with the base noun.From these nouns, plus a set of manually written contexts, we generate six types of minimal pairs, shown in Table 5 in Appendix C. Additionally, we randomly vary the text used to convey entailment, e.g., If p then q, If p that means q, p therefore q, etc.

SUBJECT-AUXILIARY INVERSION.
The subject-auxiliary inversion rule applies in question formation in English (e.g., relating Logan will go.to Will Logan go?).This task has been used to evaluate language models' syntactic abilities and preferences (e.g., McCoy et al., 2020;Mueller et al., 2022;Yedetore et al., 2023;Mueller and Linzen, 2023).Our test data was created by Warstadt (2022, Ch. 6), where it is described in more detail.
TURN-TAKING.Comprehending dialogue requires tracking the grammatical properties of utterances from multiple speakers.Pronouns such as I, you, and she are indexicals, meaning their interpretation depends on the speaker's context and identity.This test suite evaluates whether LMs can predict which pronoun is appropriate to use when there is a change in speaker.For example, if person A asks person B a question of the form Can I ..., person B's response should begin with You, not I.Our tests include (i) cases where the pronoun is expected to change, and (ii) cases where it is not.We also vary the context length (and therefore the distance between the context pronoun and the target), and whether the context contains a distractor pronoun in an embedded position.Finally, for each example, we randomly select one from a set of formats for indicating the speaker, e.g., A: ..., B: ..., or "...," he asked."...," she said., etc. Examples of each format can be found in Table 6 in Appendix C.
QUESTION-ANSWER CONGRUENCE.The syntax of a question constrains the acceptable responses.For example, a congruent answer to a who-question must be an animate noun (or contain one in a suitable context).This test suite evaluates whether LMs assign a higher likelihood to congruent answers compared to incongruent ones, and therefore learn the cross-sentential dependency between a wh-word and an answer.In addition to a set of EASY test cases, we construct a set of adversarial TRICKY test cases where there is a highly salient distractor answer that is not congruent with the wh-word.We randomly vary whether the answer appears as a fragment or in a complete sentence as well as the format for indicating the speaker.See Table 7 in Appendix C for examples.
Mixed Signals Generalization Set.The Mixed Signals Generalization Set (MSGS; Warstadt et al., 2020b) is a text classification task that evaluates the inductive biases of language models.For a MSGS subtask, models are finetuned on an ambiguous training set where the labels are consistent with both a syntactic generalization and a surface generalization, and then evaluated on examples that disambiguate which generalization the model converged on (if any). 6 Ideally, models would be more sensitive to linguistic features than surface features, as a systematic preference for abstract linguistic properties allows models to generalize more robustly to unseen structures.The metric for MSGS is the Matthews correlation coefficient between the model's predictions and the labels according to the linguistic generalization on the test set.A coefficient of 1 corresponds to a systematic linguistic generalization, and -1 to a systematic surface generalization.Indeed, Warstadt et al. (2020c) find that linguistic bias increases with the volume of pretraining data, and that models with RoBERTa-like architectures require more than a billion words of pretraining data to achieve an overall linguistic bias (i.e., a score greater than 0).Age-of-acquisition Prediction.Optionally, participants could evaluate on the age of acquisition (AoA) prediction task of Portelance et al. (2023).When humans are learning language, they tend to acquire certain words at specific ages; the age of acquisition of a word refers to the age at which humans acquire that word.The AoA prediction task compares LMs' word surprisals with children's AoA of the same words.A language model's average surprisals are converted into AoA predictions, and these are then compared to the actual average AoA (in months) of those words.Models achieving lower mean absolute deviation between the actual 6 For example, one of the subtasks tests which of the following two generalizations the model's inductive bias favors: whether the word "the" is present (the surface generalization), or whether the sentence contains an adjective (the syntactic generalization).Thus, training examples will include only ambiguous labeled pairs where these two properties are both perfectly correlated with each other and with the binary labels, such as (The big dog barked, 1) and (A dog barked, 0).At test time, the model must classify held-out sentences where the features are anti-correlated, such as A big dog barked and The dog barked.If the model predicts labels 1 and 0 respectively for these and other analogous examples, we infer that it classifies examples based on the linguistic feature, while if it predicts 0 and 1 respectively, it adopted the surface generalization.
age and predicted age are said to perform better on the task.7While we did not require participants to submit these scores as part of their predictions, we provided code to make evaluation on this task simple, such that they could include this score as an additional analysis point in their paper submissions.7 teams (22.6%) evaluated on the AoA prediction task; see Appendix E for results and discussion.

Evaluation Pipeline
The organizers provided code to unify the evaluation setup across submissions.This was released as a public repository on GitHub. 8The evaluation pipeline supports models implemented in Hug-gingFace, though we did not restrict the model submissions to HuggingFace-based models. 9For model and result submissions, users were required to (i) upload a link to their model (on any filehosting service), and (ii) provide model predictions for each example of each task (via Dynabench); we provided a template specifying the format of the predictions file.
Data preprocessing.NLP tasks in our evaluation pipeline often contained vocabulary that is not contained in the BabyLM pretraining corpora.To address this mismatch, we filtered each task according to its lexical content: if an example contained any words that appear less than twice in the Strict-Small training corpus, we filtered the example out.Otherwise, each dataset is presented in its original format.See Table 4 in Appendix B for details on the size of the filtered datasets.

Evaluation Paradigms
Zero-shot evaluation.For zero-shot tasks-BLiMP and the BLiMP supplement-we modify the BigScience fork of the lm-eval-harness repository, originally by EleutherAI (Gao et al., 2021).This provides functionality for scoring autoregressive decoder-only LMs and encoder-decoder LMs.For encoder-only LMs, we modify the repository to support masked language model scoring as described in Salazar et al. (2020). 10  Finetuning.We first attempted zero-shot learning and few-shot in-context learning for (Super)GLUE and MSGS tasks.However, this often resulted in random-chance accuracies from each of our baselines; we, therefore employ finetuning. 11or tasks requiring finetuning-(Super)GLUE (Wang et al., 2018(Wang et al., , 2019) ) and MSGS (Warstadt et al., 2020b)-we base our scripts on Hug-gingFace's example finetuning scripts for text classification. 12We modified the script to support encoder-decoder models, and to work for a wider variety of tasks.We provide a default set of hyperparameters that we found to work well across our baseline models, though participants were allowed to freely modify hyperparameters.

Dynabench Leaderboard
Dynabench is an open-source platform for dynamic dataset creation, model evaluation, and leaderboard hosting (Kiela et al., 2021).In addition to open-sourcing datasets-including adversarial and human-in-the-loop datasets (Nie et al., 2020;Bartolo et al., 2021;Potts et al., 2021;Sheng et al., 2021;Vidgen et al., 2021;Kirk et al., 2022)-Dynabench has offered leaderboard support for several community challenges in the past (Wenzek et al., 2021;Bartolo et al., 2022;Mazumder et al., 2022).Given that we desire a dynamic leaderboard that allows for submissions even after the end of the challenge, this platform was well-suited to the BabyLM Challenge.All model submissions to the challenge were submitted via the Dynabench platform, to the respective leaderboards for the Strict,13 Strict-Small,14 and Loose15 tracks.
Each leaderboard presents aggregate scores across all tasks, which can be interactively bro- ken down into more fine-grained scores per task and per subtask.To compute the aggregate score, we weigh BLiMP and the BLiMP-supplement together at 50% (all subtasks weighted equally), (Super)GLUE at 30%, and MSGS at 20%.This weighting scheme was arrived at heuristically, though we did observe that the winners for each track were stable across a wide range of reasonable weightings.Dynabench allows users to specify a custom task weighting to compute an alternative aggregate score.The leaderboard for the BabyLM challenge will continue to accept submissions indefinitely.

Baselines and Skylines
Baselines.To provide simple baselines for our evaluation tasks, we train multiple models on the data released for Strict-Small and Strict tracks and evaluate them on the evaluation tasks.Three baseline models are provided: OPT-125M, RoBERTa-base, and T5-base.These models use the same objective function and network architecture corresponding to their original papers (OPT; Zhang et al., 2022, RoBERTa;Liu et al., 2019, T5;Raffel et al., 2020).The network architecture of these models covers both encoder-decoder (T5-base and RoBERTa-base) and decoder-only (OPT-125M) architectures.Their objective functions include next-token prediction (OPT-125M), masked-token prediction (RoBERTa-base), and sequence-tosequence (T5-base) matching losses.The baseline models are trained using a fixed context length of 128, a constant learning rate of 1e-4, a linear learning-rate warmup from 0 in the first 5000 steps, a batch size of 128, and AdamW (Loshchilov and Hutter, 2019)  dataset.Although most of these hyperparameters are loosely inspired by Huebner et al., we expect that the specific choices on them can be further improved and leave these potential improvements as possible topics for submissions.We find that our baseline models achieve reasonable performance on the evaluation tasks, with clear improvement from more data from Strict-Small to Strict track and notable gap towards their counterparts pretrained on much larger datasets.
Skylines.To get an approximation of how well larger models could, in principle, perform in our task and setting, we ran Llama 2 70B (Touvron et al., 2023) and the fully trained RoBERTa-base model through our evaluation pipeline.This is meant to provide a comparison point to the state of the art in 2023, as the Llama 2 model is pretrained on much more data (2T tokens) than the challenge allows, and it has far more parameters than we expect to find in submissions.We evaluate Llama 2 on (Super)GLUE using in-context learning, but it is fully finetuned on MSGS.BabyLM submissions that approach these scores can be considered to have greater sample efficiency than the skyline models, and may therefore provide stronger starting points for future research in sample-efficient NLP.

Submissions Summary
We received 31 papers and 162 models in total.Some participants submitted to multiple tracks; we show data for unique participants in Figure 2. We found that many submissions focused their efforts on similar techniques.To better quantify this, we devised a typology of the nine most common approaches and assigned each submitted model one or more labels.Figure 3 shows the number of submissions employing each approach.§7.3 provides more detailed descriptions of each approach, as well as results indicating which ones were most effective.
The official leaderboard is available on Dyn-abench. 16With the consent of participants, we release links to submitted models, their complete predictions for the evaluation tasks, their scores for each task and subtask, and metadata about each submission at the BabyLM's GitHub at https:// github.com/babylm/submissions2023.We provide a summary of each submission in Appendix F.

Overall Results & Track Winners
The results from all submissions are shown in Figure 4, with the scores of the top-performing models in each track detailed in Table 2.In the figure, dashed green lines show the performance of the Llama 2 skyline.Solid green lines show human performance on GLUE reported in Nangia and Bowman (2019), and human performance on BLiMP as reported by Warstadt et al. (2020a).Before discussing the winning systems in each track, we note a few high-level takeaways from these results.The strongest results were achieved by models in the Strict track.Given the Strict track's larger training corpus relative to the Strict-Small corpus, it is not surprising that these models could outperform those in the Strict-Small track.However, there are two interesting trends: First, Strict models did not outperform those in Strict-Small by a large amount, even though the size of training data was an order-ofmagnitude larger.For example, there are only two models in the Strict track that achieve higher GLUE scores than the best-performing Strict-Small model.Second, models in the Loose track tended to perform worse in the aggregate than those in the Strict-Small track, even though they potentially had access to additional (non-linguistic) data.One conclusion we can draw from this is that learning from multiple modalities of data presents a challenge in its own right, and that current model architectures are not optimized to efficiently utilize multiple types of inputs during training.
The other important high-level takeaway is that many BabyLM models are very close to the Llama 2 skyline, and to achieving human-level performance on BLiMP and GLUE (i.e., they are near the green lines in Figure 4).Strong performance could be expected in the case of (Super)GLUE, where models were finetuned with additional data, but we note that even for BLiMP, the top-performing model is only about 3% shy of human performance.Note that prior to the start of the challenge, we explored the possibility of measuring zero-shot performance on (Super)GLUE test sets, and found zero-shot performance to be at or below chance for our baselines.This fact, as well as the consideration that GLUE has been traditionally evaluated using finetuning, leads us to select finetuning evaluations for the (Super)GLUE benchmark(s).
Given that successful training on developmentally plausible corpora could have ramifications for cognitive and linguistic theories of learnability (Wilcox et al., 2023;Warstadt and Bowman, 2022), these results point to two important takeaways: (1) Human-level results have not been achieved yet.However, (2) given the strong performance of the top-scoring models, human-level results appear likely to be achieved very soon, possibly within the next few years.Of course, one possible concern is the following: current models may not be close to human-level performance; rather, current performance metrics, like BLiMP, might not accurately measure human-level linguistic competence.We are sympathetic to such concerns, but we also note that BLiMP, and other related syntactic benchmarks such as those presented in Marvin and Linzen (2018) and Gauthier et al. (2020), were specifically designed to mimic the types of tests invented by linguists and cognitive scientists to reveal syntactic competence-i.e., they are all based on minimal pair sentences.Thus, while it is imperative to continue building more comprehensive and larger datasets, we believe it is fair to say that the close-to-human scores observed in the BabyLM challenge on BLiMP reflect genuine grammatical generalizations learned by the models.

Winning Submissions
Below, we discuss the winning submissions from each track in greater detail.We also mention the winners of our "Most Interesting Paper" awards and provide a brief justification for each.
Strict track.The winner of the Strict track is ELC-BERT submitted by Charpentier and Samuel (2023).This model, as well as the runner-up submission Boot-BERT (Samuel, 2023), used as their starting point the LTG-BERT architecture from Samuel et al. (2023).Although these submissions make additional incremental improvements to the LTG-BERT training regime, their own baselines suggest that the backbone architecture plays a large role in the submissions' successes.LTG-BERT's main contribution is a synthesis of several optimizations to the Transformer architecture, namely: (1) additional layer normalization, following (Shleifer et al., 2021); (2) GEGLU feedforward modules (Shazeer, 2020); (3) disentangled attention following DeBERTa (He et al., 2021); and (4) scaled weight initialization following (Nguyen and Salazar, 2019).ELC-BERT modifies this backbone such that the input to each layer is a weighted sum of the outputs of all previous layers.Another notable property of LTG-BERT is that all models with this architecture so far have been trained for a large number of epochs.Charpentier and Samuel (2023) train models for over 450 epochs for their Strict submission, and over 2000 epochs for their Strict-Small submission.LTG-BERT models performed exceptionally well on our set of evaluations, outperforming not only every other submission to the shared task but also the Llama 2 and RoBERTa-Base skylines on overall score and on all test suites except for (Super)GLUE (Table 2).The second runner-up for this track was McGill-BERT (Cheng et al., 2023).
Strict-Small track.The winner of the Strict-Small track is, again, ELC-BERT (Charpentier and Samuel, 2023).This double-win demonstrates that the model's architectural choices work well with multiple scales of pretraining data.The runners-up were MLSM (Berend, 2023b) and McGill-BERT (Cheng et al., 2023).Most interesting paper awards.These awards are given to papers that go beyond achieving high scores on a leaderboard, and instead demonstrate contributions to the shared task based on interesting analyses, useful negative results, creative modeling choices, or a combination thereof.We awarded two most interesting paper awards in two different categories.
Outstanding evaluation.The most interesting paper award for outstanding evaluation was given to "Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures" (Steuer et al., 2023).This work goes beyond the BabyLM evaluation tasks: the authors use measures of human cognitive processing effort and linguistic competence and additionally correlate these with BabyLM task performance.Their work assesses BabyLM submissions as models of human language processing, thus contributing to our understanding of how to better train cognitive models.
Compelling negative results.The most interesting paper award for compelling negative results was given to "CLIMB-Curriculum Learning for Infant-inspired Model Building" (Martinez et al., 2023).This work proposes a typology of common curriculum learning approaches and performs a thorough and principled evaluation exploring this design space.Although they find that none of the tested approaches leads to widespread improvements across the evaluation tasks, the exhaustiveness of this search and the careful controls and baselines in the study make this negative result a valuable contribution.

Common Methods
One of the main objectives of the BabyLM Challenge is to compare and contrast methodological choices for sample-efficient pretraining.To do so, we hand-coded each submission based on the method(s) it employs.Figure 3 shows the number of submissions using each approach, and we visualize the performance of different methods in Figure 5.We also present a similar figure separated by the underlying architecture (Figure 6).Each of these approaches is discussed in further detail below.We highlight two high-level takeaways to start: First, curriculum learning, which was the most popular approach, did not tend to produce high scores (although one curriculum learning model did perform well).Second, the highest-performing models were ones that made architectural modifications-namely, those based on the LTG-BERT architecture.
Teacher-student or auxiliary model.Many papers trained their submitted models with the aid of additional models.According to our rules, this was permissible as long as any auxiliary models were trained on the BabyLM corpus.Knowledge distillation using auxiliary models was often a successful approach: Samuel ( 2023) considered an exponential moving average teacher model (Tarvainen and Valpola, 2017), while Berend (2023b) modeled a latent semantic feature distribution from a teacher model.Timiryasov and Tastet (2023) performed distillation on an ensemble of features.Others used auxiliary models to select appropriate training examples for a curriculum (Chobey et al., 2023;Hong et al., 2023), or trained a reward model for use in reinforcement learning (Zhao et al., 2023).
Data preprocessing.Many submissions modified the format of the pretraining corpus.When controlled comparisons were performed, these preprocessing steps often led to improvements.In §7.2 we discuss the successful Contextualizer method for constructing new training samples.Other successful approaches used short sequences or individual sentences as training samples, rather than long portions of documents (Govindarajan et al., 2023;Cheng et al., 2023;Edman and Bylinina, 2023).Among the more unique approaches in this space was Baby's CoThought (Zhang et al., 2023), which used an LLM to reformat unrelated sentences from the corpus into coherent paragraphs.
Hyperparameter tuning and model scaling.This was a relatively common approach.Many submissions performed extensive hyperparameter searches, producing hard-won hyperparameters that work well on smaller datasets while preserving features of the dataset.While extensive hyperparameter searching can be expensive and challenging when scaling up to full-sized pretraining, in our limited data regime, consistently successful modifications include reducing context length (see "Data preprocessing", above), and training for more epochs or long epochs with data augmentation (Jumelet et al., 2023;Bhardwaj et al., 2023;Yang et al., 2023;Xiao et al., 2023;Samuel, 2023;Charpentier and Samuel, 2023).However, results are mixed when modifying model size: some participants achieved better results when scaling model sizes up (Çagatan, 2023), while others were able to perform well when using very small models (Proskurina et al., 2023).More controlled studies using a variety of architectures and datasets are needed to determine whether scaling up or down is a better solution.
Multimodal learning.Multimodal learning was one of the directions where we expected the most interest and the most submissions; however, we received few submissions based on multimodal inputs, and the multimodal submissions did not reliably contribute to higher overall accuracy.One submission used music (Govindarajan et al., 2023), another used vision and language data (Amariucai and Warstadt, 2023), a third explored text-andaudio (Wolf et al., 2023), and a fourth incorporated text-and-image data and lexical sensorimotor data as part of the embedding process using multiplex networks (Stella et al., 2017;Ciaglia et al., 2023).
Music training produced minor improvements on some subtasks, while the vision-and-language system marginally improved over the baselines in the Strict-Small track.The multiplex network did not produce performance gains, though it did allow the participants to reduce the number of parameters while preserving performance relative to the baselines.WhisBERT was reported to be undertrained, making its results difficult to interpret.
Architecture modifications.The winning submission made architectural modifications: Charpentier and Samuel (2023) made slight improvements to LTG-BERT (see §7.2 for more on this architecture) by taking a weighted sum over the outputs of all previous layers.Momen et al. (2023) used the relatively novel StructFormer architecture (Shen et al., 2021), which encourages treestructured representations of inputs.
Training objectives.Some submissions trained language models using a mixture of both a language modeling objective and some other objective.Knowledge distillation from teacher models (see paragraphed titled "Teacher-student or auxiliary model" above) was the most common modification.Martinez et al. (2023) simplified the masked language modeling objective by coarse-graining the output classes, with little effect.Govindarajan et al. ( 2023) achieved improvements on specific BLiMP subtasks by modifying the masking procedure to preferentially mask specific words thought to be rel-evant to a particular phenomenon tested by BLiMP.
Linguistic bias.Some submissions tried to impart human linguistic biases to models.Such approaches discussed above include curriculum learning based on linguistically motivated data sorting methods and architectures like StructFormer that encourage hierarchical analyses of inputs.Chen and Portelance (2023) also pretrained with token embeddings obtained via grammar induction, and Thoma et al. (2023)

Future BabyLM Challenges
The first iteration of the BabyLM Challenge yielded many successes, but also some organizational and scientific challenges.The lessons learned from our findings can improve future iterations of this challenge.
We were surprised that there were significantly more submissions to the Strict-Small track than the other two tracks combined, considering that the Loose track allows for a much wider variety of methods.However, this is understandable from the perspective of compute: training on Strict-Small is the least computationally expensive of each of the tracks, and it constrains the model search space enough that ideas are perhaps easier to define and execute.In future iterations of the BabyLM challenge, it could be interesting to provide more specific and constrained Loose tracks, which focus on particular research directions-for example, LLM-assisted low-resource pretraining, allowing expert annotations during pretraining, or joint text and audio modeling.
We can also draw insights from the data preprocessing and hyperparameter tuning submissions in particular, and standardize them into the dataset/evaluation pipeline.For example, we could preprocess the data in ways the present challenge has shown to be effective.This could include sorting the data according to the curriculum learning method that yielded performance gains, providing better-starting hyperparameters, and training a baseline with the best architecture.
Although data quantity was the main focus of this iteration, we may also consider rewarding compute efficiency in the future.Many of the most successful submissions consumed a lot of compute by training for many epochs.Indeed, the winning submission trained on about as many samples as BERT, despite having a training set only about 3% as large.While this finding is interesting, it does little to help achieve our goals in §2.Training for hundreds of epochs is not cognitively plausible, and it is does not make it easier and more accessible to test novel training approaches or train models on a university budget.
The evaluation pipeline was built on the existing lm-evaluation-harness repository,17 but maintaining and updating it for this challenge was no small feat for a single organizer.In future iterations of the challenge, it would be beneficial to have a larger dedicated support team for the evaluations.A dedicated team could also allow us to handle a greater variety of submissions, including those not supported by HuggingFace.

Conclusions
The BabyLM Challenge encouraged participants to think small.We asked: can we improve language modeling on smaller and more cognitively plausible datasets?The submitted systems employed diverse methods, but the most consistent gains came from modified model architectures, new training objectives, principled preprocessing of the pretraining corpora, and hyperparameter searches.In one case, a curriculum learning method resulted in significant improvements.Future work can build on these findings to further improve language modeling for low-resource settings and for cognitive modeling research.
Table 11: Mean average deviation (MAD) in months across cross-validation folds when predicting the age of acquisition of words.Lower MAD scores are better.We present all systems that evaluated on AoA prediction, as well as the baseline model with the best scores per track.We bold the highest-scoring system for each task within each track.

F Summary of Each Submission
GPT-wee (Bunzeck and Zarrieß, 2023).This paper tests various approaches to reordering the examples based on word and sentence statistics.The motivation comes from usage-based linguistics and the idea that frequent lexical items, such as phrases or common groups of words, are learned early (rather than words, for instance).They also find that training more-up to 10 epochs-helps, and that a medium-sized model might be as good as larger models.
Tiny Language Models with Multiplex Networks (Fields et al., 2023).This approach leverages multimodal data (including text/visual data and sensorimotor data) as part of the embeddings to an ELECTRA language model.The proposed models are very small (as few as 7M parameters) and perform well on BLiMP.For reference, the baseline models contain 125M to 220M parameters.
Mini Minds (Proskurina et al., 2023).This submission explores how scaling down models (in terms of number of parameters) can help in low-data settings.The authors conduct a parameter search for scaled-down versions of GPT-2 and RoBERTa, and find that optimal models have around a 2-to-1 ratio of attention heads to layers.They train two models and find that they perform about as well as larger parameter count models on GLUE.Furthermore, the authors test their models on an ethical reasoning benchmark and find that the small models perform about as well as models which have about ten times the parameters.
Grammar induction pretraining (Chen and Portelance, 2023).This submission introduces syntactic bias into the static token embeddings of an LM.An unsupervised grammar induction system is trained on a 1-million word subset of the Strict-Small corpus, and the resulting static token embeddings are used to initialize the LM token embeddings.Although the results improve over the BabyLM Strict-Small baseline, similar improvements are observed with a custom baseline model using randomly initialized token embeddings.Thus, there is no evidence that the grammar induction step had a positive impact on LM results.
ChapGTP (Jumelet et al., 2023).This work explores how targeted data augmentation can improve the performance of masked language models in the Strict-Small track.The authors used regex patterns to extract common phrases from the GLUE tasks and then used these patterns to generate follow-up questions that served as additional training data.They also found that increasing the training epochs up to 200 epochs continues to help performance.
BabyBerta+ (Yang et al., 2023).The submission replicates the BabyBERTa training setup (Huebner et al., 2021) and tests its ability after pretraining on the Strict-Small corpus.They find that a small model trained on many epochs keeps improving and becomes better than baseline models in grammatical aspects, but not downstream tasks.
Keeping Training Simple for BabyLMs (Edman and Bylinina, 2023).This paper proposes a variety of complexity metrics for reordering the BabyLM Strict-Small data from simple to complex.Compared to no curricula and reversed curricula, the proposed curricula do not result in consistent performance improvements on the BabyLM evaluation tasks.However, reducing the context length to 32 (from the baselines' 128) results in significant and consistent performance improvements.
Can Training Neural Language Models on a Curriculum with Developmentally Plausible Data Improve Alignment with Human Reading Behavior?(Chobey et al., 2023).This paper explores surprisal-based curricula for pretraining on the Strict-Small dataset of the BabyLM challenge.The authors use an ensemble of LSTM "teacher" models to rank sentences by average surprisal, on which a final OPT model is trained.Results are mixed.The authors find that their model does not outperform a random baseline.However, when this model is further trained on the randomly-ordered training dataset after training on the curriculum-ordered data, it does beat the baseline.As an additional analysis, the authors investigate the ability of their model to predict human reading times for syntactically complex sentences, finding that the model is not particularly good at the task, but that it is about equivalent to baselines which are trained on much larger datasets.
CLIMB (Martinez et al., 2023).This submission presents a thorough comparison of different approaches to curriculum learning in the Strict-Small setting.They consider three main criteria for curriculum design: the size of the input vocabulary, the difficulty of the training sample, and the size of the output space for MLM prediction.They conduct experiments exploring eight different curricula sorted into these three main approaches.While there are many small differences in performance among these settings, curricula provide no consistent improvements over more naive training algorithms.
Acquiring Linguistic Knowledge from Multimodal Input (Amariucai and Warstadt, 2023).The authors explored whether vision-language co-training helps the learning of linguistic knowledge.They trained models on Wiki texts with images using the state-of-the-art multi-modality model (FLAVA).After varying the amount of training data and how many images are used, the authors found that visual input only provides a slight improvement on grammar benchmarks for 10M-word training, but not for 100M-word training.
GPT-like Models are Bad Babies (Steuer et al., 2023).This paper trains a decoder-only model, trying different hyperparameters, including reordering the training data by different orders (based on cues which did not improve over regular shuffling), different sizes, layer widths, among other features.The main focus of the paper is to test if models that perform better on BabyLM evaluation tasks are also better at modeling reading difficulty in humans.Surprisingly, models performing better on BabyLM tasks performed less well in modeling reading difficulty.
Baby's CoThought (Zhang et al., 2023).This system leverages a large language model, GPT-3.5-Turbo, to reformat semantically unrelated sentences into cohesive paragraphs.In low-data settings, this approach can form better training examples for language models; the proposed approach results in improvements across BLiMP tasks, though performance is not significantly different on (Super)GLUE or MSGS.Note that the LLM is trained on far more than 100M words, so this submission technically does not qualify under any track.However, this method does improve the sample efficiency of the student model, and it aids our understanding of what types of data are best for supervising smaller language models.ToddlerBERTa (Çagatan, 2023).This paper conducts a thorough hyperparameter investigation of the BabyBERTa model, exploring different options for model sizes and training algorithms.The author finds that larger models tend to perform better.
CogMemLM (Thoma et al., 2023).This work explores an approach to word segmentation and tokenization that is intended to model vocabulary growth during learning.A vocabulary is cumulatively built using a cognitively-inspired model of word segmentation, in which strings are split into chunks based on an activation weight which changes throughout training depending on how often the chunk is observed together.While the approach achieves consistent improvements over the BabyLM Strict baseline results, it is not clear whether these improvements are due to the segmentation scheme or other hyperparameter modifications.
BabyStories (Zhao et al., 2023).This paper investigates how reinforcement learning from human feedback (RLHF) improves the performance of causal language models pretrained on small scales of datasets.The authors report that models finetuned by RLHF on short stories yield better performance on language understanding benchmarks, though this improvement is only observed on larger models.Their findings suggest that benefiting from RLHF requires a large number of trainable parameters.Learning (DeBenedetto, 2023).This paper proposes a curriculum learning approach for reordering data based on non-linguistic metrics.Specifically, they choose the order in which datasets are shown to the model starting from the minimal amount of bytes per sentence and going up.This happens to also start from spoken data and follow with text data later.The paper also shows that a larger model as well as more epochs improves the results.

Byte-ranked Curriculum
McGill BabyLM Submission (Cheng et al., 2023).This paper finds that changes to the data format have large positive impacts.Specifically, not using sequence packing, using sentences and not documents as examples, not truncating, and reducing maximum sequence length are each highly effective.By contrast, adding supervision from POS tags and using unsupervised syntactic induction have negligible impact.
Mean BERTS make erratic language teachers (Samuel, 2023).This submission presents Boot-BERT, a latent bootstrapping approach to language modeling in low resource settings.In the latent bootstrapping set-up, a student model is trained to produce predictions over words as well as to match contextualized embeddings from a teacher model.In turn, the teacher's embeddings are obtained via a moving average of the student's.The authors use LTG-BERT (Samuel et al., 2023) as an encoder backbone, as well as for a baseline. 21They find that their Boot-BERT outperforms LTG-BERT for some of the BabyLM tasks, including GLUE for both the Strict and Strict-Small tracks.
Every Layer Counts BERT (ELC-BERT) (Charpentier and Samuel, 2023).This submission takes as its starting point the very effective LTG-BERT architecture from Samuel et al. (2023) and modifies it such that the input to each layer is a weighted sum of the outputs of all previous layers, where the weights can be learned but also biased by initialization.Several variations are explored, including equal initial weights, and initial weights biased towards the previous layer.Results on BabyLM evaluations do not strongly suggest that any one variant is clearly better than the LTG-BERT baseline, though all models perform significantly better than the BabyLM RoBERTa baseline.Additionally, inspection of the learned weights for combining previous layer outputs suggests that the most important outputs are from the previous few layers and the static embedding layer.
WhisBERT (Wolf et al., 2023).In this submission, the authors explore whether text-and-audio cotraining helps model performance on BLiMP tasks.After pretraining a multi-modal model (FLAVA) on 100M words with or without their corresponding word-aligned speech, they find that the speech-augmented model outperforms the text-only model on 11 out of 17 grammatical tasks.
Surprisal-based active curriculum learning (Hong et al., 2023).This submission combines curriculum and active learning to schedule training order for models.The authors use n-gram surprisals to determine the sentences with the highest surprisal and then train their models on structurally similar examples to these high-surprisal sentences.Models with active curriculum learning show noticeable performance gains in (Super)GLUE but underperform the models without such learning on MSGS.
Linguistically Motivated Curriculum Learning (Mi, 2023).This submission tests 6 linguistic metrics of complexity as curriculum learning approaches.On the Strict-Small track, this approach succeeds in finding improvements over training on the whole corpus in a random order.
Baby Llama (Timiryasov and Tastet, 2023).This submission proposes a knowledge distillation approach with two teacher models (a 300M-parameter Llama model and 700M-parameter GPT-2 model) trained on the Strict-Small corpus.These are distilled into a 58M-parameter Llama model called Baby Llama.The proposed model outperforms the BabyLM baselines, the teacher LMs, and a 58M-parameter Llama model trained from scratch on the Strict-Small data without distillation.
Curriculum learning based on sentence complexity approximating language acquisition (Oba et al., 2023).This submission assesses the impact of curriculum learning based on sentence complexity within the context of the Strict-Small task.The authors order training data based on three sentencelevel complexity metrics: number of tokens, number of constituents, and max depth of the sentences' dependency parse.They find that the dependency-based ranking leads to better models, however, all curriculum-based models underperform a random baseline.
Masked Latent Semantic Modeling (Berend, 2023b).This paper adopts a method from Berend (2023a) called Masked Latent Semantic Modeling (MLSM) in which the target output distribution can be transformed from a one-hot distribution over the vocabulary into a sparse distribution over latent "semantic property" vectors.Then, the same kind of student-teacher optimization as in knowledge distillation is applied using this modified output distribution instead of the full vocabulary.MLSM on its own is found to lead to degradation in BLiMP performance, although combining MLSM with typical MLM training in a multitask setting leads to similar performance as MLM training alone.-Bevo (Govindarajan et al., 2023).This paper offered submissions to both Strict-Small and Strict tracks and used three design choices for LM training: (i) initially pretraining on music data, following work on transfer learning (Papadimitriou and Jurafsky, 2020), which suggested that musical structure may form a reasonable basis upon which to learn language structure; (ii) subsequently using a training curriculum starting from shorter sequences (128) before moving to longer ones (512), following insights from Press et al. (2021), and (iii) masking critical tokens necessary to perform some of the BLiMP subtasks (e.g., masking "not" for NPI-licensing).Taking final results into consideration alongside ablations, this team found that sequence length matters, music pretraining may help a little, and targeted MLM training seems to help (but only for some BLiMP subtasks, including NPI licensing and Argument Structure).

Lil
Contextualizer (Xiao et al., 2023).This paper sorts the corpora in the training dataset loosely based on their age of acquisition and reading difficulty.The authors then introduce techniques to begin and end the training with padding-separated datasets sorted from easy to hard, while the middle of the training employs a noisier padding and sorting strategy to improve the model's robustness.The final model performs similarly to its counterpart pretrained with thousands of times more data.
Implicit Structure Building (Momen et al., 2023).This submission introduces an unsupervised hierarchical bias into the transformer.The approach shows that such structural bias with StructFormer improves over the classic MLM Transformer approach.Improvements are not consistent across scenarios: the model excels in single-sentence or syntactic evaluation tasks, but less so in semantic tasks with multi-sentence inputs.
Pretraining LLMs using human-like development data (Bhardwaj et al., 2023).This submission trains RoBERTa, DistilBERT, and GPT-2 models on the Strict and Strict-Small data.They find that training DistilBERT for 60 epochs is better than 20 epochs.They also claim that the performance of the baseline RoBERTa model may not be replicable across random initializations and that hyperparameter searches should be more thorough to hedge against such outlier models.
On the Effect of Curriculum Learning with Developmental Data for Grammar Acquisition (Opper et al., 2023).This submission explores the effect of curriculum learning, using BabyBERTa models, on the Strict-Small data track.The authors contrast three types of curriculum learning: one that orders input by word frequency; one by sequence entropy; and one by increasing context length.They find that neither of these methods produces results above a baseline random presentation.In a series of follow-up experiments, the authors verify that model performance is linked to the amount of exposure to transcribed speech data and suggest that speech data is a good foundation for curriculum learning.
Difficulty-based Sentence Reordering (Borazjanizadeh, 2023).This study explores two broad approaches to dataset preprocessing to improve LM training in the 10M-word setting: data reordering (curriculum learning) and data cleaning.Results show that reordering a subset of the data by sentence difficulty may lead to marginal improvements, as long the local coherence of the samples is not damaged too greatly.However, the clearest improvements come from cleaning the data of incoherent, ungrammatical, or non-linguistic strings.

Figure 1 :
Figure 1: Data Scale: Modern Language Models are trained multiple orders of magnitude more word tokens than the amount available to a typical child.This image is based on Fig. 1 from Warstadt and Bowman (2022).

Figure 2 :
Figure 2: Number of participants who submitted to each track, with multiple submissions counted once.

Figure 3 :
Figure3: Total number of submitted models that used each of the nine approaches in our typology.We count at most one submitted model per participant per track.
Loose track.The winner of the Loose track is the Contextualizer model of Xiao et al. (2023), which used a data processing scheme in which extra training samples are created by combining chunks of texts from different contexts.Repeating this process 40 times for each chunk gives a dataset that has as many training samples as 4B word dataset, but based on a dataset of only 100M words.This augmentation technique outperforms training for 40 epochs using the same training samples.Runners-up for this track were McGill-BERT (Cheng et al., 2023) and the BabyStories model of Zhao et al. (2023).

Figure 5 :
Figure 5: Effect of Training Strategy and Backbone Architecture: Each point represents a submission.Some submissions may appear more than once if they use multiple strategies.Shapes show the challenge track to which the model was submitted.Colors show the backbone architecture on which the model is based.Gray bars show within-category aggregates.

Figure 6 :
Figure 6: Effect of Backbone Architecture: Each point represents a submission.Shape indicates the challenge track.Gray bars show within-category aggregates.

Figure 7 :
Figure 7: Submission Results by GLUE subtask: Points show the performance of submission.Gray bars show the across-submission average in each category.

Figure 8 :
Figure 8: Submission Results by BLiMP subtask: Points show the performance of each submission.Gray bars show the across-submission average in each category.

Table 1 :
The datasets we release for the Strict and Strict-Small tracks of the BabyLM Challenge.We present the number of words in the training set of each corpus that we include. 1 http://www.natcorp.ox.ac.uk 2 https:

Table 2 :
Table3shows the submission counts for each track.Top 3 systems for each track, as well as the baseline model with the highest aggregate score.We also show "skyline" models: RoBERTa-base and Llama 2 trained on their full pre-training corpora.Each task score is simply the mean score across each of its subtasks.The aggregate score is a weighted average of each task.We bold the highest-scoring system for each task within each track.

Table 3 :
Total number of models and participants per track.Participants who submitted to multiple tracks are counted once in the total.
iteratively updated the vocabulary of the LM based on word simplicity measures (motivated by human age-of-acquisition analyses).