Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings

This paper explores the use of latent bootstrapping, an alternative self-supervision technique, for pretraining language models. Unlike the typical practice of using self-supervision on discrete subwords, latent bootstrapping leverages contextualized embeddings for a richer supervision signal. We conduct experiments to assess how effective this approach is for acquiring linguistic knowledge from limited resources. Specifically, our experiments are based on the BabyLM shared task, which includes pretraining on two small curated corpora and an evaluation on four linguistic benchmarks.


Introduction
All modern language models are trained with a general self-supervised learning (SSL) paradigm (Radford et al., 2018;Devlin et al., 2019;Raffel et al., 2020).Recently, the field of visual representation learning has seen a growing usage of selfsupervision on latent embeddings (Grill et al., 2020;Chen et al., 2020;Chen and He, 2020;Assran et al., 2023).While this type of self-supervision has been recently proposed as an integral part of a humanlike machine intelligence system (LeCun, 2022), language models are still mostly self-supervised on hard targets, typically on subword tokens.
The concept of latent bootstrapping (Grill et al., 2020) offers a promising alternative, as the latent vectors provide a deep and semantically rich representation of the input.This, in turn, delivers a more valuable supervision signal compared to the conventional method of supervision on discrete subword indices.Data2vec (Baevski et al., 2022) showed that latent bootstrapping performs on par with traditional self-supervised language modeling when pretrained on a large text corpus.We argue that, intuitively, the rich training signal from contextualized embeddings should be particularly effective in low-resource data settings.In this paper, our aim is to test this hypothesis and identify possible drawbacks of the bootstrapping method.We base our experiments on the BabyLM challenge (Warstadt et al., 2023b), a shared task that uses two carefully curated, sampleefficient pretraining corpora, mimicking the English language exposure to young children.In addition, this challenge employs four benchmarks to evaluate different aspects of linguistic knowledge and understanding learned by language models.
We introduce BootBERT, a novel masked autoencoder language model (Meng et al., 2023) that harnesses latent bootstrapping (Grill et al., 2020) between a mean teacher (Tarvainen and Valpola, 2017) and its student.Through a positive feedback loop, the student and the teacher iteratively learn from each other, as illustrated in Figure 1.The student is trained to match the teacher's outputs while the mean teacher is defined as the exponential moving average of the student.Once pretraining is complete, only the student language model is used for evaluation and the teacher is discarded.We assess its performance on the BabyLM challenge, contrasting it with conventional language models.The source code and pretrained models are available online at https://github.com/ltgoslo/boot-bert.The left side illustrates the student language model, a masked autoencoder network, that targets two training objectives: 1) conventional masked language modeling, aiming to predict the masked token (e.g., the word 'world'), and 2) aligning the contextualized embedding of the masked tokens to their unmasked counterparts.The embeddings for the unmasked tokens are produced by a mean teacher network (on the right), computed as an exponential moving average of the student parameters.

Method
In this section, we outline our proposed model, BootBERT, delving into its neural architecture and the latent bootstrapping training objective.In order to allow for language-modeling-based evaluation, the bootstrapping objective operates alongside conventional masked language modeling.The diagram in Figure 2 illustrates the general idea of this approach.
Masked autoencoder architecture.BootBERT diverges slightly from the standard 'encoder-only' architecture often found in masked language models (Devlin et al., 2019).Instead, following the method of Meng et al. (2023), we employ a masked autoencoder (MAE; He et al., 2022) framework for the text domain.This approach distinguishes the encoding of contextualized embeddings from the decoding of masked subwords.These two functionalities are separated by dividing the model into an encoder and a decoder module, as illustrated in Figure 2 on the left.The encoder's role is to create a bidirectional contextualized embedding of input tokens.Unlike traditional masked language models, the encoder does not process any [MASK] tokens, thus eliminating the need to allocate parameters for representing them (Meng et al., 2023).
The [MASK] tokens are processed and denoised by the decoder module.The decoder is supplied with the full input -the unmasked tokens are represented by their contextualized embeddings (provided by the encoder) and the masked tokens are represented by a static [MASK] embedding.Note that the decoder in this type of model is bidirectional and purely self-attentive, differing from the original definition of a transformer decoder by Vaswani et al. (2017).
Teacher-student feedback loop.Conceptually, the training process can be divided into optimization of a student model and optimization of a teacher model.Here, the masked student autoencoder model is trained to match the contextualized embeddings of the unmasked tokens, produced by the mean teacher network.In line with Tarvainen and Valpola (2017), the teacher parameters ϕ are not optimized via gradient descent, but rather through a slow exponential moving average (EMA) of the student parameters θ: This moving average not only stabilizes the latent targets but also prevents representation collapse (Grill et al., 2020).
Loss.We optimize two objectives during training the student model: a traditional masked language modeling objective with hard targets, symbolized by L LM , and a latent bootstrapping objective using teacher's latent targets L LB .The final loss function combines these objectives with a weighted sum: Here, L LM is calculated simply as negative loglikelihood of the true targets.Its purpose is twofold: allowing for a MLM-based evaluation (for example BLiMP), and preventing representation collapse of unconstrained latent bootstrapping (Grill et al., 2020).
The second objective is computed as a smooth L1 loss between student predictions y s and teacher's contextualized embeddings y t , This works mostly like a standard mean-squared error but prevents exploding gradients from outliers (Girshick, 2015): LTG-BERT transformer backbone.As for more low-level architectural and training choices, we adopt the approach of LTG-BERT by Samuel et al. (2023a).This method was optimized for lowresource masked language modeling on a similar corpus to the corpora provided in BabyLM.The key improvements of the LTG-BERT transformer architecture include the use of the NormFormer layer normalization (Shleifer and Ott, 2022), an alternative disentangled attention mechanism with relative positions (He et al., 2021) and gated-linear activation function (GEGLU; Shazeer, 2020); as illustrated in Figure 3. On top of these architectural changes, the authors also employ masking of random subword spans (Joshi et al., 2020).More details about these choices can be found in Samuel et al. (2023a).

Experiments
The main goal of this paper is to evaluate how well language models trained with latent bootstrapping acquire language and if it makes a viable training objective for language representation learning.We base the experiments on the BabyLM challenge (Warstadt et al., 2023b).First, we describe the pretraining process of two BabyLM tracks and second, the evaluation of pretrained models using the BabyLM evaluation pipeline.BabyLM challenge.This challenge provides a share ground for experiments on small-scale language modeling.It consists of three tracks: STRICT, STRICT-SMALL and LOOSE.For the first two tracks, the submissions have to be pretrained solely on a fixed corpus provided by the organizers.This corpus contains about 100M words in the STRICT track and about 10M words in the STRICT-SMALL track.As for the LOOSE track, the submissions are still limited to pretrained on 100M words, but this data can come from any source and the models can utilize an unlimited amount of non-linguistic data in addition.As detailed in Section 3.2, the submissions are compared on a shared evaluation set consisting of syntactic and natural language understanding tasks.

Pretraining
The pretraining is done on corpora provided by the BabyLM challenge.These texts are curated specifically to be of the same type and quantity that children learn from.Thus, it allows us to assess (to some degree) whether latent bootstrapping is a more plausible cognitive model of human language acquisition (Warstadt et al., 2023b).
Training corpus.Specifically, we consider the STRICT and STRICT-SMALL tracks and pretrain the models on their respective 100-million-word and 10-million-word corpora.Both datasets contain child-directed speech, transcribed speech, children's books and Wikipedia, among other sources.The content of these datasets is detailed in Appendix B, together with our simple preprocessing pipeline, which unifies the typographical features of the BabyLM subcorpora.
Pretraining process.Generally speaking, we adopt the training recipe of LTG-BERT (Samuel et al., 2023a), which was optimized for pretraining on another low-resource 100 million English corpus.The pretraining process is the same for both tracks, except for using a smaller vocabulary and a smaller model for the STRICT-SMALL track.
As for the STRICT track, we use a BASE-size language model -12 encoder layers and 4 decoder layers with hidden size of 768 and with 12 attention heads.We train a case-sensitive WordPiece tokenizer (Wu et al., 2016) with a vocabulary size of 2 14 = 16 384, using solely texts from the STRICT corpus.As per Samuel et al. (2023a), we pretrain the models with1 /2 of the BERT training budget, as it has been shown to be sufficient for a relatively small 100-million-word corpus.The tokens are masked with continuous span masking (Joshi et al., 2020;Raffel et al., 2020).In particular, the masks are iteratively sampled until 15% of tokens are masked and the length of each span is sampled from the geometric distribution Geo(p), with p = 1 /3.
The STRICT-SMALL track is tackled by a SMALLsize language model -12 encoder layers and 4 decoder layers with hidden size of 384 and with 6 attention heads.The subword vocabulary is reduced to 2 12 = 4 096 items. 1 The full list of hyperparameters and implementation details are provided in Appendix C and in the released source code.2

Evaluation
We utilize the language modeling benchmark suite from the BabyLM challenge (Gao et al., 2021;Warstadt et al., 2023b),3 which relies on three conceptually different evaluation tasks: 1.The GLUE and SuperGLUE datasets test the ability of a pretrained model to adapt to various language understanding tasks.
2. BLiMP and BLiMP supplement tasks test the affinity of a model towards grammatical sentences in a completely zero-shot manner.
3. MSGS measures how much does a pretrained model prefer linguistic generalizations (over surface ones) during finetuning.
We further elaborate on each of these evaluation suites below.
(Super)GLUE benchmark.General Language Understanding Evaluation benchmarks (GLUE and SuperGLUE; Wang et al., 2018Wang et al., , 2019) ) are arguably the most common ways of evaluating the languageunderstanding and transfer-learning capabilities of language models.The BabyLM challenge uses a subset of 10 (Super)GLUE tasks, detailed in Appendix F. We employ the standard way of finetuning masked language models on these datasets, as introduced in BERT (Devlin et al., 2019).More details about the finetuning processes are given in Appendix C.
As we use the BabyLM version of GLUE, our results cannot be directly compared with previous literature -the dataset samples are filtered to not contain out-of-vocabulary words and some of the employed metrics differ from the original recommendations (Wang et al., 2018(Wang et al., , 2019)).We opted to adhere to the BabyLM version to be compatible with other works in this challenge.However, in order to reliably compare our models, we decided to depart from BabyLM and to divide the training set in 90:10 ratio into a new training and development split; the former validation set is then used as a held-out split. 4LiMP.When using any finetuning approach, it is unclear how to disentangle the innate language understanding from the knowledge learned during the second-stage supervised finetuning (Belinkov, 2022).In contrast, the Benchmark of Linguistic Minimal Pairs (BLiMP; Warstadt et al., 2020a) attempts to measure the linguistic knowledge of a language model in a zero-shot manner -without any additional training.Each pair of sentences in BLiMP differs minimally on the surface level, but only one of the sentences is grammatically valid.We can use the intrinsic ability of language models to assign a probability to every sentence and test how often a language model assigns a higher probability to the correct sentence (Wang and Cho, 2019;Salazar et al., 2020).
As detailed in Appendix A, the results on BLiMP greatly depend on temperature scaling (Guo et al., 2017a).Thus, to fairly compare different types of language models, we employ an alternative approach to evaluating BLiMP: we report the accuracies that are achieved with the optimal temperature for every language model; the reasoning is explained in Appendix A.
The BabyLM challenge also comes with an additional 'BLiMP supplement' held-out set with five additional diagnostic tasks.To comply with the held-out spirit of these tasks, we keep the temperature values calibrated for BLiMP, even though this results in suboptimal performance (Appendix A).

MSGS.
The diagnostic dataset called Mixed Signals Generalization Set (MSGS; Warstadt et al., 2020b) measures whether a pretrained model prefers linguistic or surface generalizations.The experiments follow the poverty of the stimulus design (Wilson, 2006) -to first finetune a model on ambiguous data (consistent with both linguistic and surface explanations) and then test it on nonambiguous data to see if it prefers the linguistic generalization.
We use the filtered MSGS datasets with no inoculation in the training set, as provided by the BabyLM challenge.Similarly to (Super)GLUE, we avoid the BabyLM approach that validates and tests on the same split -instead, to obtain a reliable comparison, we roughly follow the original work (Warstadt et al., 2020b) and use three learning rates: (1 • 10 −5 , 2 • 10 −5 , and 3 • 10 −5 ), five random seeds, batch size of 16 and finetune for 5 epochs without early-stopping; then we report the mean and standard deviation statistics on the 6 non-ambiguous and non-control test datasets, measuring the Matthew's correlation coefficient  4. (Matthews, 1975, which is renamed to the Linguistic Bias Score (LBS) in MSGS).

Results
The overall averaged results for all four evaluation suits are given in Table 1.Apart from evaluating masked autoencoders trained with latent bootstrapping (BootBERTs), as described in Section 2, we evaluate the three baseline language models provided by the organizers of BabyLM challenge: decoder-only OPT (Zhang et al., 2022), encoderdecoder T5 (Raffel et al., 2020) and encoder-only RoBERTa language models (Liu et al., 2019).As we base our models on the LTG-BERT architecture (Samuel et al., 2023a), we follow recommendations of the authors and also pretrain LTG-BERTs to get a strong and comparable baseline.

Discussion
LTG-BERT performance.Our results confirm the findings by Samuel et al. (2023a) who introduced the improved language modeling architecture called LTG-BERT.These models perform drastically better than the OPT, RoBERTa and T5 baselines pretrained on the same low-resource BabyLM corpus; the performance is improved across all evaluation suites -GLUE, MSGS, BLiMP as well as the BLiMP supplemental data -and across both STRICT and STRICT-SMALL tracks.LTG-BERT has also been used as the backbone of recent Norwegian language models trained on large amounts of data (Samuel et al., 2023b), which demonstrates that LM methods developed for efficient training are also beneficial for large-scale training.
Self-supervised learning.When we compare BootBERT to the LTG-BERT baseline, we can see that the latent bootstrapping approach leads to a substantially better performance when finetuned on (Super)GLUE in the STRICT track and to a slightly better performance in the STRICT-SMALL track.Specifically on the biggest and arguably most robust GLUE task, MNLI, the accuracy is better by 1.3/1.6 percentage points in the STRICT track and by 1.2/1.2pp in the STRICT-SMALL track.The overall average (Super)GLUE score is better by 1.4pp and by 0.4pp, respectively.This shows that language models pretrained with this approach are a good option for downstream tasks.The ability of linguistic generalization, as measured by the linguistic bias scores in MSGS, is substantially worse in BootBERT than in the LTG-BERT baseline, as evident from Figure 4.A more detailed analysis in Appendix D reveals that this holds for both BabyLM tracks -but the difference is mainly due to the fact that LTG-BERT reliably prefers the linguistic feature 'is the main verb in "ing" form?', other tests are relatively similar for both types of models.It is unclear what part of latent bootstrapping causes this difference.
The results on the BLiMP-based benchmarks are mixed but overall worse when comparing Boot-BERT with the LTG-BERT baseline.This is possibly because of the utilization of two conflicting training objectives in BootBERT -intuitively, pure language-modeling-based training should have an advantage on benchmarks that rely on sentence likelihood.
In conclusion, these low-resource experiments suggest that the advantage of latent bootstrapping for natural language is not as great as the advantage that has been previously demonstrated for computer vision.We believe that this is because the atomic units of text, subword tokens, can provide much more semantically rich signal when compared to the atomic units of images, pixels.Thus there is not a large need for bootstrapping a rich signal from a teacher; instead, the standard language modeling comes with a training objective that is simple and provides enough signal, while suffering from issues like representation collapse.
The shared task results.The official Dyn-aBench results for BabyLM can be found in Ap-pendix E. Our system ranks high when evaluated on GLUE (first and second place) and on BLiMP (second and first place) in the STRICT and STRICT-SMALL tracks, respectively.As discussed earlier, BootBERT strongly prefers the surface features over the linguistic features and thus places low on the MSGS benchmark (third and last place), which also hurts the overall ranking of our system (third an seventh).Note however that this evaluation is not using a proper train/development/test split and it does not account for high variation of some metrics (MSGS in particular), which is why we have used an alternative evaluation in the rest of this paper.
Computational cost of latent bootstrapping.It is important to note that latent bootstrapping comes with an increased computational cost because of an additional forward pass through the mean teacher; which roughly equates to a 50% increase in pretraining time.Thus, it should be carefully considered whether the potential benefits of bootstrapping are worth this cost.That being said, this method does not bear any additional cost during finetuning nor inference, which might justify it in some cases.

Related work
Self-supervised learning.Our work is greatly inspired by the 'bootstrap your own latent' approach (BYOL;Grill et al., 2020), which introduced the bootstrapping feedback loop between a student and a mean teacher network.BYOL by itself can be considered an example of contrastive learning (Hjelm et al., 2019;van den Oord et al., 2019;Chen et al., 2020;He et al., 2020) without negative instances.Another important aspect of BYOL is the usage of a 'mean teacher', a slow-moving average of a student network, which is a term coined by Tarvainen and Valpola (2017).
Many methods of visual representation learning adopted the bootstrapping approach and further improved its parts (Chen and He, 2021;Zbontar et al., 2021;Bardes et al., 2022;He et al., 2022).In particular, our work bears similarities with the recently introduces 'image-based joint-embedding predictive architecture' (I-JEPA; Assran et al., 2023), which also trains a masked autoencoder student network to predict the contextualized embeddings of an unmasked mean teacher.While mostly used for the image domain, data2vec method showed that latent bootstrapping can also be successfully applied to text (Baevski et al., 2022).

MSGS: Linguistic Bias Scores (LBS)
Figure 4: The Linguistic Bias Scores (LBS) of language models pretrained on the STRICT dataset.These plots show the distribution of the LBS scores across 15 evaluation runs (3 learning rates × 5 random seeds) for each of the 6 non-ambiguous test datasets (90 values in total for each model).The red horizontal lines highlight the first, second (median) and third quartile.The overall negative scores show that none of the tested models prefers linguistic features over the surface ones.
Efficient language modeling.The necessity of pretraining modern language models on large corpora were questioned in CamemBERT (Martin et al., 2020) and the effect of corpus size has been then thoroughly studied in Micheli et al. (2020), Zhang et al. (2021) as well as in Hoffmann et al. ( 2022).Samuel et al. (2023a) introduced the LTG-BERT -an improved language model optimized for pretraining on a low-resource corpus.They showed that a well-tuned language model can match the performance of BERT even when it is pretrained only on a small 100-million-word British National Corpus (BNC).We base our approach on this model due to the apparent similarity of the BabyLM training corpus to BNC.

Conclusion
In this paper, we presented a masked autoencoder language model trained with latent bootstrapping, an alternative self-supervised learning method.We showed that when pretrained on a low-resource corpus, the results of this method are varied -compared to a masked language modeling baseline, the performance is clearly better on (Super)GLUE, but worse on MSGS and mixed on BLiMP.We believe that it makes a promising alternative to traditional language modeling methods, but its reliable and effective utilization requires future work.The linguistic knowledge of BERT base and BERT large appears comparable when judging from performance at temperature 1, but the potential of the larger model is much greater.(c) We train four sizes of LTG-BERT on the STRICT track and plot their confidence profiles.Larger models tend to be more confident and, therefore, measuring them at temperature 1 is more misleading.

A The Effect of temperature scaling on BLiMP
Our preliminary experiments with calibrating language models via temperature scaling (Guo et al., 2017b) revealed that the BLiMP scores are hugely dependent on the scalar temperature parameter -when these are calculated with the standard method by (Salazar et al., 2020).This single temperature value can increase the accuracy on some BLiMP subtasks by more than 10% (Figure 6), which challenges the usage of BLiMP as an appropriate evaluation tool.It is especially problematic when comparing different types of language models (Figure 5a) and different sizes of language models (Figure 5b,c).
Background.To better understand this problem, this section describes how are the BLiMP scores traditionally computed for masked language models.These models can estimate P (s t |s \t ) -the likelihood of a token s t given its bidirectional context s \t = (s i |i ̸ = t).This probability distribution P is given by a softmax transformation of the output logits z, where τ is temperature: .
Large temperature yields more even distribution and low temperature gives more 'peaky' distribution.Salazar et al. (2020) proposed to use these probability estimates (with τ = 1) to infer a score for each BLiMP sentence, with a higher score corresponding to a more likely sentence.Then, the BLiMP accuracy measures how many times is the score of a grammatically correct sentence greater than the score of an incorrect sentence.Specifically, we use the pseudo-log-likelihood score (PPL) by Wang and Cho (2019).The PPL score of a sentence s is defined as: Proposed solution.BLiMP should measure the linguistic knowledge of language models and we believe that this metric should be independent of the prediction confidence of these models.Formally speaking, the BLiMP score should be invariant to temperature scaling.Therefore, we propose to use the maximal average accuracy across all possible temperature values -instead of simply using the average accuracy at temperature equal to 1.As apparent from Figure 5b, such formulation can better reflect the difference of linguistic knowledge found in BERT base and BERT large .There, the accuracy measured at temperature 1 is at odds with other measures that show substantially better linguistic knowledge of BERT large (Devlin et al., 2019;Tenney et al., 2019;Ettinger, 2020).Note that our approach bears only a negligible compute cost because the temperature modification is done ex-post, i.e., it does not require any additional passes through the language model.
Using one temperature for all subtasks does not account for the severe difference between the accuracy scores on these tasks (Figure 6), but it is a simple solution that also allows us to evaluate models on a held-out set, such as the BLiMP supplement.We believe that a scoring function that is (i) unified, (ii) invariant to temperature and (iii) fair to all subtasks, is an interesting future work.

B Data preprocessing
The pretraining datasets for the STRICT and STRICT-SMALL tracks are a mix 10 different corpora, as shown in Table 3.We applied light preprocessing and normalization to these corpora in order to cast them into a unified format.In particularly, we applied these modifications: • CHILDES: We capitalize the first letter of each line, normalize punctuation with whitespaces (essentially detokenization) and put every line between double quotes (as directed speech).
• British National Corpus: Capitalization, normalization and double quotes.
• Children's Book Test: This corpus contains some remnants of the Penn Tree format where, for example, -LRB-and -RRB-tokens are used instead of '(' and ')'.We normalize all unnatural symbols and whitespaces.
• Children's Stories Text Corpus: We try to conserve the formatting with a special [TAB] symbol and apply whitespace normalization.
• Standardized Project Gutenberg Corpus: The text file is aligned into blocks by inserting a newline symbol after at most 70 characters, which ruins the sentence structure.We restore the original paragraphs by removing these additional newline symbols and apply whitespace normalization.
• OpenSubtitles: Some lines arbitrarily start with a dash symbol, which we remove.Then whitespace normalization is applied and every line is cast a directed speech with double quotes.
• QED: This corpus contains some incorrectly parsed HTML symbols, which we tried to clean up with some simple heuristics.The whitespace normalization is applied and every line is cast as directed speech with double quotes.
• Wikipedia: This dataset also needed to be cleaned of incorrectly parsed Wikipedia tags and hyperlinks.Whitespace normalization is applied.
• Simple Wikipedia: Heuristic HTML clean-up and whitespace normalization.
• Switchboard: The same as OpenSubtitles: removed leading dashes, whitespaces normalization and added double quotes.
Note that the preprocessed corpora and the preprocessing scripts are released alongside the training scripts.The implementation of latent bootstrapping mainly follows I-JEPA (Assran et al., 2023).We also adopt their usage of a linearly increasing schedule of the EMA decay hyperparameter τ and a cosine schedule of weight decay.
The hyperparameters for pretraining are given in Table 5.Table 6 shows the finetuning hyperparameters.

D Finegrained MSGS scores
This section shows the full score distribution over all MSGS subtasks, including the control subtasks.This gives a better view on the behavior of different language models than the aggragated scores in Figure 4 and Table 1.

Figure 1 :
Figure1: The self-supervision feedback loop of latent bootstrapping: a student model improves by aligning with its teacher's latent outputs and the teacher improves by maintaining the exponential moving average of the student.

Figure 2 :
Figure2: A detailed overview of the self-supervised feedback loop.The left side illustrates the student language model, a masked autoencoder network, that targets two training objectives: 1) conventional masked language modeling, aiming to predict the masked token (e.g., the word 'world'), and 2) aligning the contextualized embedding of the masked tokens to their unmasked counterparts.The embeddings for the unmasked tokens are produced by a mean teacher network (on the right), computed as an exponential moving average of the student parameters.

Figure 3 :
Figure3: We base our model on LTG-BERT.This simplified diagram shows one layer from that transformer architecture, it illustrates the self-attention module (bottom) and the feed-forward module (top).Both modules utilize a modified NormFormer-like layer normalization placement and the feed-forward module contains a gated-linear activation function.

Figure 5 :
Figure 5: These plots show the BLiMP 'confidence profiles' of several language models -the influence of temperature scaling on the average BLiMP accuracy.(a) Models trained by different training objectives show different confidence profiles, judging their linguistic knowledge from BLiMP accuracy can be misleading.Here, we compare the three baseline from the BabyLM challenge trained on the STRICT track.(b)The linguistic knowledge of BERT base and BERT large appears comparable when judging from performance at temperature 1, but the potential of the larger model is much greater.(c) We train four sizes of LTG-BERT on the STRICT track and plot their confidence profiles.Larger models tend to be more confident and, therefore, measuring them at temperature 1 is more misleading.

Figure 6 :
Figure 6: The confidence profile of our proposed BootBERT base model pretrained on the STRICT track.Apart from the average BLiMP accuracy (in red) and the average BLiMP supplement accuracy (in blue), this plot shows fine-grained BLiMP accuracies on all subtasks.

Figure 7 :
Figure 7: The MSGS linguistic bias scores of the control tasks (in blue) and non-control disambiguated tasks (in red).Values close to 1 indicate preference of linguistic explanations (columns) while values close to -1 indicate preference of surface explanations.

Table 3 :
Warstadt et al. (2023b)s for the the STRICT and STRICT-SMALL tracks; the table is taken fromWarstadt et al. (2023b).In order to reduce training time, pre-training is parallelized over multiple GPUs with the global batch size of 4 096.The number of GPUs used depends on the size of pre-trained language models, ranging from 32 to 128 AMD Instinct MI250X GPUs, each with 64GB memory.The amount of training steps is 62 500, reducing the training budget of the original BERT model by 50%.Unlike the BERT and LTG-BERT training recipe, we use the same sequence length, 256, throughout the whole training.This decision is necessary for keeping a reasonable exponential moving average of the parameters (it could be corrupted when switching to a longer sequence length in the middle of training).