Not all layers are equally as important: Every Layer Counts BERT

This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the STRICT and STRICT - SMALL tracks. Our approach allows each transformer layer to select which outputs of previous layers to process. The empirical results verify the potential of this simple modification and show that not all layers are equally as important.


Introduction
Modern language models (LLMs), with their deep architectures and large parameter counts, have displayed outstanding performance on a wide range of tasks.Their ability to understand, generate, and manipulate human language has been groundbreaking (Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020).However, this success largely relies on vast amounts of unsupervised data that these models need for pretraining, requiring extensive computational power and time.While this is feasible for high-resource languages like English, it becomes a bottleneck for languages with limited data resources (Joshi et al., 2020).Moreover, the environmental and economic costs of such massive training regimens are growing concerns (Strubell et al., 2019;Thompson et al., 2020).
The BabyLM challenge tries to address these concerns by providing a shared experimental ground for efficient language modelling (Warstadt et al., 2023).All models submitted to this shared task have to be trained on a restricted text corpus of 10M and 100M words -in the STRICT-SMALL and STRICT tracks, respectively.The challenge pushes the boundaries of what is possible with dataefficient language model pretraining.
In response to this challenge, we present a novel modification to the well-established transformer Table 1: The DynaBench scores of the BabyLM challenge (Warstadt et al., 2023), the table shows the top 5 submissions in the STRICT-SMALL and STRICT tracks.Higher scores are better, the best results in each evaluation suite are boldfaced.
architecture (Vaswani et al., 2017).Instead of traditional residual connections, our model allows each layer to selectively process outputs from the preceding layers.This flexibility leads to intriguing findings: not every layer is of equal significance to the following layers.Thus, we call it the 'Every Layer Counts' BERT (ELC-BERT).
The BabyLM challenge provided us with a robust benchmark to evaluate the efficacy of ELC-BERT.Our approach emerged as the winning submission in both the STRICT and STRICT-SMALL tracks (Table 1), which highlights the potential of layer weighting for future low-resource language modelling.
Transparent and open-source language modelling is necessary for safe future development of this field.We release the full source code, together with the pre-trained ELC-BERT models, online. 1  4).The right heatmap shows the α weights of the normalized ELC-BERT variant; for clear visual comparison between the two models, we rescale the α weights so that the kth row sums to k.Note that the layer 0 is the embedding layer, as in Equation ( 1).

Related work
Residual and highway networks.While the predecessor of residual models, highway networks, used a conditional gating mechanism to weigh layers (Srivastava et al., 2015), modern residual networks (including transformers) simply weigh all layers equally (He et al., 2016;Vaswani et al., 2017).Our work reintroduces layer weights into residual models -but without the computational cost of a gating mechanism.
Layer importance.The difference between various layers inside pre-trained language models has been extensively studied (Jawahar et al., 2019;Tenney et al., 2019;Niu et al., 2022).Different layers process different linguistic phenomena, thus their importance for downstream tasks varies -this has been successfully utilized by learning layer weights during finetuning, for example in ULMFiT (Howard and Ruder, 2018) or UDify (Kondratyuk and Straka, 2019).Following this direction, our system uses layer weights in the finetuning as well as in the pretraining phase.
ReZero transformer.A related approach to ours was proposed by Bachlechner et al. (2021).In that paper, the authors experimented with scaling the output of each layer.They showed that by initializing the scaling parameter to zero, their 'ReZero transformer' model tends towards setting the scale to 1 /N (where N is the number of layers).Our approach can be considered as a generalization of this method -in ELC-BERT, every layer weights the outputs of previous layers individually.

ELC-BERT layer weighting
We modify the residual connections inside the transformer architecture so that every layer can select which outputs from previous layers it wants to process -instead of always taking a simple sum of all preceding layers, as done in the Transformer (Vaswani et al., 2017) and in most works that use a variant of this architecture.This modification allows the model to form a complex inter-layer structure, as visible from Figure 1.
Transformer definition.To be more specific, we first formally define a transformer encoder as a function that maps subword indices x onto subword probabilities y.First, x is embedded into a vector representation h 0 out , which is then processed by N layers consisting of attention and multi-layer-perceptron (MLP) modules.Finally, y is produced by processing the final hidden representation with a language-modelling head.Formally for n ∈ {1, . . .N }: The original residual connection.The original transformer definition by Vaswani et al. (2017) can be recovered by simply assigning This recurrent assignment can also be rewritten as h n in ← n−1 i=0 h i out , which highlights the implicit assumption of residual models that the output from every previous layer is equally important.
Layer weighting.In our formulation, we make two changes to the original definition: (i) the residual connections in all MLP modules are removed, (ii) the input to every layer is a convex combination of outputs from previous layers.Specifically, we replace Equation (2) and Equation (4) by: where n−1 i=0 α i,n = 1.This constraint is satisfied by a softmax transformation of the raw learnable layer weights α * ,n ∈ R n into α * ,n .α * ,n is initialized as a zero vector except for the value of αn−1,n set to one, to bias the weight towards the input from the previous layer.

Training
LTG-BERT backbone.We base our models around LTG-BERT (Samuel et al., 2023).This model has been specifically optimized for pretraining on small text corpora, similar to the one provided by BabyLM.We adopt all of their architectural modifications, their language modelling objective as well as all other pretraining settings.We also use the raw LTG-BERT (without our layer weighting) as a strong baseline in the following evaluation.Details on the pretraining hyperparameters can be found in Table 4.
BabyLM pretraining corpus.We pretrain all language models on a corpus from the BabyLM challenge (Warstadt et al., 2023).The goal of this challenge is to shed more light on data-efficient language modelling and on the question of human language acquisition.Thus, the organizers have constructed a small-scale text corpus of the same type and quantity that children learn from.
Specifically, the shared task consists of three tracks: STRICT, STRICT-SMALL and LOOSE.We Table 2: Results for the BabyLM challenge suite of evaluation datasets -BLiMP, supplemental dataset to BLiMP, MSGS and (Super)GLUE.We compare the results of our submitted model (ELC-BERT biased ) to the backbone model (LTG-BERT base ) and the baselines given by the organizers of the challenge on the STRICT dataset.On the STRICT-SMALL dataset, we compare a variation (ELC-BERT zero ) of small size to the backbone model and baselines.
participate in the first two tracks, where the submissions have to be pre-trained only on the BabyLM corpus, which corpus contains about 100M words in the STRICT track and about 10M words in the STRICT-SMALL track.We adopt the preprocessing pipeline from Samuel (2023) for unifying the format of texts from this corpus.

Results
This section provides the results of the empirical evaluation of ELC-BERT.First, we compare our method to baselines, then we perform an ablation study of different ELC-BERT variations, and finally, we take a deeper look into the learnt layer weights.

BabyLM challenge evaluation
We adopt the BabyLM evaluation pipeline for all comparisons. 2The pipeline itself is an adaptation of Gao et al. (2021) and it aims to provide a robust evaluation of syntactic and general language understanding.The syntactic understanding is measured by the Benchmark of Linguistic Minimal Pairs (BLiMP & BLiMP supplemental; Warstadt et al., 2020a) and the Mixed Signals Generalization Set (MSGS; Warstadt et al., 2020b).The general natural language understanding is measured by GLUE and SuperGLUE (Wang et al., 2018(Wang et al., , 2019)).All of these benchmarks use filtered subsets of the original datasets (provided by the organizers), which means that they are not directly comparable to previous literature.If applicable, we divide the training set into a train-development split and report the mean/std statistics over multiple runs on the former validation split.
BLiMP.This benchmark tests zero-shot preference of grammatical sentences.From the STRICT results in Table 2, we see that ELC-BERT outperforms the baseline models by a fair margin on this task.However, if we look at the LTG-BERT baseline, we see that our model slightly underperforms it (by 0.5 percentage points).Table 7 provides a more in-depth comparison of the models.
If we now look at the supplemental scores, we see a very similar trend to the BLiMP results: our model outperforms the baseline RoBERTa model by 24.4 p.p. while slightly underperforming against the LTG-BERT model by 0.2 p.p. Table 8 shows a breakdown of the aggregated scores.

GLUE.
A standard LM benchmark that tests the ability to be finetuned for general language understanding tasks.Focusing on the results in Table 2, we see that our model outperforms both the encoder baseline and the LTG-BERT model in the STRICT and STRIC-SMALL tracks.The improvement against LTG-BERT is rather modest and could be caused by random variation.If we look at Table 9 we see that the variation is greatly affected by the WSC task -ignoring it, we get a score of 80.49 ±1.44 for our model and 79.52 ±1.13  for LTG-BERT.
MSGS.Finally, this benchmark evaluates the preference towards linguistic explanations over spurious surface explanations.For the aggregated STRICT MSGS results of Table 2, the comparison appears unclear due to the large standard deviation.However, a closer inspection reveals that ELC-BERT significantly outperforms LTG-BERT by 0.16 LBS points. 3Figure 2 and Table 10 shows a detailed view on the score distribution.
Shared task results.The official Dynabench results for the top-5 models for the STRICT and STRICT-SMALL track can be found in Table 1.Looking first at the STRICT track results, we see that our model achieves the highest total score and BLiMP score, while we are second for GLUE and MSGS.On the STRICT-SMALL track our model performs best on all benchmarks and by a substantial margin for all benchmarks.

Model variations
We compare the following modifications of the ELC-BERT architecture from Section 3: 1. Zero initialization: The layer weights are all initialized as zeros, without any bias towards the previous layer.This model also uses the residual MLP input from Equation (2).This variation is used in the STRICT-SMALL track.

Strict normalization:
This follows the previous variant with every h i out normalized to a unit vector.
3. Weighted output: Follows the first variant and the input to the LM head is a weighted sum of all layers.To be more concrete, we replace Equation (3) by y ← LM_head Evaluation.Based on Table 3, we see that different variations have varying effects on the evaluation scores.
When changing the α initialization to zero, we see a significant increase in performance on both the BLiMP Supplemental and the GLUE benchmarks. 4 However, the model suffers in performance on both the BLiMP and MSGS. 5 Overall, we see that this variation leads to better zero-shot and fine-tuning results while biasing the model more towards spurious surface features rather than linguistic features, as can be seen in Figure 3.
If we then focus on the normalization variation, we see that it underperforms in all benchmarks but one, MSGS, where it significantly performs better by 0.13 LBS points,6 as can be seen in more detail in Figure 3.
Finally, when looking at our weighted output variation, we see a substantial gain in performance on the BLiMP benchmark while the results on MSGS and GLUE are similar, and the results on Supplemental BLiMP slightly decrease.More detailed results on all these benchmarks can be found in Appendix D.

Layer importance
The empirical evaluation suggests that learnable layer weights are a simple but effective architectural change -but how do these learnt weights look like?In this section, we investigate the α values of the normalized ELC-BERT variant.Looking at the importance matrix of ELC-BERT in Figure 1, we posit that the first 5 layers focus on surface-level information found in the embedding layer explaining its enhanced importance for the embedding layer.The next 5 layers (6-10) focus on more linguistic features by virtually ignoring the first 4 layers (0-3) and focusing primarily on the previous three layers as well as layers 4 and 5 to get some transformed information from the embedding layer.Layer 11 does much the same but focuses more on Layer 4, potentially trying to obtain some surface knowledge found in it.Finally, Layer 12 behaves similarly to Layer 11 but also puts high importance (3 rd most) on the embedding layer.This is most likely to recuperate some surface information lost in previous layers to pass to the language modelling head.

Conclusion
In this paper, we proposed a novel and simple modification of the transformer architecture for language modelling.We empirically tested the efficacy of our approach by participating in the BabyLM challenge -a shared task for data-efficient language modelling.Our submission ranked first on both tracks that we participated in.A more detailed evaluation shows that, when compared to a strong baseline, our approach reliably performs better on (Super)GLUE tasks.The evaluation on MSGS suggests that our approach is more likely to prefer linguistic features over spurious surface features, and the BLiMP benchmarks show comparable performance to the baseline.Finally, our proposed modification shows that the assumption that all layers are equally important is incorrect, and a more complex layer structure helps the model.

B Fine-tuning details
For the fine-tuning experiments, we will run multiple seeds and (for MSGS) multiple learning rates, to be able to get a more robust comparison of model performance.The detailed hyperparameters for fine-tuning can be found in Table 5.

B.0.1 GLUE
To finetune, we will use 5 different seeds: 12, 642, 369, 1267, and 2395.We will use a validation set to find our best model with early-stopping, and then test our model on a test set (here the validation set is 10% of the training sets from https://github.com/babylm/evaluation-pipelineand the test set is their validation set).
Hyperparameter 0.9 0.9 0.9 Adam β2 0.999 0.999 0.999 Table 5: Hyperparameters for fine-tuning the GLUE, SuperGLUE task and MSGS tasks.We use the same hyperparameters for all ELC-BERT models, not performing any per-model hyperparameter search.The values for MSGS are adopted from (Warstadt et al., 2020b).For all models, we measure the statistics over 5 random seeds for GLUE tasks: 12, 642, 369, 1267, and 2395; and 3 seeds for MSGS tasks: 12, 369, and 2395

D Detailed Results
This section breaks down the aggregate scores of the benchmarks into their composing tasks.It also describes or name each task

D.1 BLiMP
The BabyLM challenge uses the BLiMP benchmark (Warstadt et al., 2020a) to evaluate the syntactic understanding of the models.Our detailed results can be found in Table 7.Its composing tasks are as follows (with descriptions taken from Warstadt et al. (2020a)): • ANAPHOR AGREEMENT (AA): the requirement that reflexive pronouns like herself (also known as anaphora) agree with their antecedents in person, number, gender, and animacy.
• ARGUMENT STRUCTURE (AS): the ability of different verbs to appear with different types of arguments.For instance, different verbs can appear with a direct object, participate in the causative alternation, or take an inanimate argument.
• BINDING (B): the structural relationship between a pronoun and its antecedent.
• CONTROL/RAISING (CR): syntactic and semantic differences between various types of predicates that embed an infinitival VP.This includes control, raising, and tough-movement predicates.
• ELLIPSIS (E): the possibility of omitting expressions from a sentence.Because this is difficult to illustrate with sentences of equal length, our paradigms cover only special cases of noun phrase ellipsis that meet this constraint.
• ISLAND EFFECTS (IE): restrictions on syntactic environments where the gap in a filler-gap dependency may occur.
• NPI LICENSING (NL): restrictions on the distribution of negative polarity items like any and ever limited to, for example, the scope of negation and only.
• QUANTIFIERS (Q): restrictions on the distribution of quantifiers.Two such restrictions are covered: superlative quantifiers (e.g., at least) cannot be embedded under negation, and definite quantifiers and determiners cannot be subjects in existential-there constructions.
• SUBJECT-VERB AGREEMENT (SVA): subjects and present tense verbs must agree in number.Table 8: BLiMP supplemental results for models trained both on the 100M (above the mid-horizontal line) and the 10M (below the mid-horizontal line) Baby LM dataset.The bold results represent the best model for the task.The metric used to measure is accuracy.The results are in percentage.

D.3 GLUE
The BabyLM challenge involves slightly modified GLUE and SuperGLUE benchmarks.It uses only a subset of the subtasks, the datasets are filtered so that they do not contain out-of-vocabulary words, and it sometimes uses non-standard metrics.Our detailed results can be found in Table 9.We list all subtasks and their metrics below: • Boolean Questions (BoolQ; Clark et al., 2019), a yes/no Q/A dataset evaluated with accuracy.
• The Multi-Genre Natural Language Inference Corpus (MNLI; Williams et al., 2018).Its development set consists of two parts: matched, sampled from the same data source as the training set, and mismatched, which is sampled from a different domain.Both parts are evaluated with accuracy.
• The Microsoft Research Paraphrase Corpus (MRPC; Dolan and Brockett, 2005), evaluated with both F 1 -score (originally also evaluated with accuracy).
• Multi-Sentence Reading Comprehension (MultiRC; Khashabi et al., 2018), a multiple choice question answering dataset, evaluated with accuracy (originally evaluated with the exact match accuracy (EM) and F 1 -score (over all answer options)).

D.4 MSGS
The BabyLM challenge uses a reduced set of the MSGS benchmark (Warstadt et al., 2020b) to evaluate whether the model biases linguistic features or surface features.A score of 1 means only using the linguistic features, while a score of -1 is surface features only.• LEXICAL CONTENT (LC): This feature is 1 iff the sentence contains the.
• RELATIVE TOKEN POSITION (RTP): This feature is 1 when the precedes a, and 0 when a precedes the.
The linguistic features are (definitions taken from Warstadt et al. (2020b)): • MAIN VERB (MV): This feature is 1 iff the sentence's main verb is in the -ing form.
• CONTROL/RAISING (CR): This feature has value 1 iff the sentence contains the control construction.
The results are reported in percentage.The bold result indicates the best model for each dataset.

Figure 1 :
Figure 1: Every layer can select which outputs from previous layers it wants as its input, these heatmaps show the weights given to each previous layer output.The unit weights of the BERT model (and of any standard transformer-based model) are inferred from Equation (4).The right heatmap shows the α weights of the normalized ELC-BERT variant; for clear visual comparison between the two models, we rescale the α weights so that the kth row sums to k.Note that the layer 0 is the embedding layer, as in Equation (1).

Figure 2 :
Figure 2: Violin plots of each model's Linguistic Bias Scores (LBS) and the base model.The white dot shows the median LBS and the edge of the boxes are the 1 st and 3 rd quartiles.The width of the violins shows the density of results at that score.

Figure 3 :
Figure 3: Detailed LBS for each model and each combination of surface and linguistic features.The Y-axis (Main Verb, Syntactic Category, and Control Raising) show the linguistic features, while the X-axis (Lexical Content, Relative Token Position) represent the surface features.Each dot represents a different fine-tuned model.

Table 3 :
Results for the BabyLM challenge suite of evaluation datasets.We compare the performance of different variants of our model to the one submitted to the BabyLM challenge as well as the backbone model LTG-BERT on the STRICT dataset.

Table 4 :
Pre-training hyperparameters for the small-sized models (trained on STRICT-SMALL) and for the base-sized models (trained on the STRICT track).
Table 6 is a detailed overview of the BabyLM dataset:

Table 7 :
BLiMP results for models trained both on the 100M (above the mid-horizontal line) and the 10M (below the mid-horizontal line) Baby LM dataset.The bold results represent the best model for the task.The metric used to measure is accuracy.The results are in percentage.
evaluated with accuracy.

Table 9 :
±1.0 89.3 ±0.5 85.0 ±1.8 86.7 ±0.3 79.2 ±0.3 79.9 ±0.2 85.8 ±0.4 55.4 ±2.6 69.3 ±2.0 62.2 ±1.0 59.0 ±5.4 75.3 ±2.1 A subset of GLUE results (defined by the Baby LM challenge) for both the models trained on 100M and 10M words.All the results indicate the model accuracy for the task except for MRPC and QQP where the results are based on the F1-score of the positive class.To obtain the standard deviation, each model is trained with 5 seeds, and the average accuracy/F1-score is reported.The results are reported in percentage.The bold result indicates the best model for each dataset.
Warstadt et al. (2020b)ed results of the reduced MSGS benchmark.The first 5 results (MVC to RTPC) are controls, checking whether the model can recognize the feature, while the next six evaluate whether the model biases linguistic or surface features.To evaluate the performance we use the Mathews Correlation Coefficient (MCC), also called Linguistic Bias Score (LBS) for the last six tasks.The surface features in this dataset are (definitions taken fromWarstadt et al. (2020b)):