BERT Has More to Offer: BERT Layers Combination Yields Better Sentence Embeddings

,


Introduction
Learning sentence vector representations is a crucial problem in natural language processing (NLP) and has been widely studied in the literature (Conneau et al., 2017;Cer et al., 2018;Li et al., 2020).Given a sentence, the goal is to acquire a vector that semantically and/or syntactically represents it.BERT (Devlin et al., 2019) has set new state-of-theart records on many NLP tasks (Madabushi et al., 2020;Hu et al., 2022;Skorupa Parolin et al., 2022;Wang and Kuo, 2020).However, this is achieved by fine-tuning all of BERT's layers.The disadvantage of fine-tuning is that it is computationally expensive as even BERT-base has 110M parameters; hence, pre-computing a representation of the data and using it for the downstream task is much less computationally expensive (Devlin et al., 2019).
Moreover, for sentence-pair tasks, BERT uses a cross-encoder; nonetheless, this setup is inappropriate for certain pair regression tasks, such as finding the most similar sentence in a data set to a specific sentence due to the large number of possible combinations (Reimers and Gurevych, 2019).
Considering the aforementioned drawbacks, researchers have tried to derive fixed-sized sentence embeddings from BERT or proposed new BERTs with the exact same architecture but different ways of training (Reimers and Gurevych, 2019;Li et al., 2020).After training, the resultant BERT is used in a feature-based manner by passing the sentence to it and obtaining its embedding vector in different ways, such as averaging the last layer of BERT.
In this paper, we present a simple, yet effective and novel method called BERT-LC (BERT Layers Combination).BERT-LC combines certain layers of BERT in order to obtain the representation of a sentence.As we will show, this model significantly outperforms its correspondent BERT baseline with no need of any further training.Our work was inspired by Jawahar et al. (2019), who show that different layers of BERT carry different features, such as surface, syntactic, and semantic.We argue that each data set with its unique distribution might need a different set of features for its sentences, which can only be fully exploited by combining different layers of BERT in an unsupervised way.
Our contributions are as follows: (1) We propose a new method called BERT-LC that is capable of acquiring superior results by combining certain layers of BERT instead of just the last layer, in an unsupervised manner.We also include the embedding layer, which was to our knowledge ignored in previous works.(2) We additionally show that our method improves SBERT (Reimers and Gurevych, 2019) and SimCSE (Gao et al., 2021), which were specifically designed for obtaining sentence representations (opposite to BERT and RoBERTa (Liu et al., 2019)).(3) We developed an algorithm that speeds up the process of finding the best layer combination (among 2 13 layer combinations in base cases) by a factor of 189 times.(4) We propose an innovative method that integrates the layer combination method with the CLS pooling head, improving the performance metrics for certain models.
(5) We achieve state-of-the-art performance on the transfer tasks using layer combination.
We demonstrate the superiority of our approach through conducting extensive experiments on seven standard semantic textual similarity (STS) data sets and eight transfer tasks.On the STS data sets, our method is able to outperform its corresponding baseline by up to 25.75% and on average 16.32% for BERT-large-uncased.We also achieve the stateof-the-art performances on transfer tasks, reducing the previous best model's relative error rate by an average of 17.92% and up to 37.41%.

Related Work
Learning sentence embeddings is a well-studied realm in NLP.There are mainly two methods used for this purpose: methods that use unlabeled data or labeled data.Although the latter use labeled data, the target data sets and tasks on which they are tested are different from the training data set and task.Early work on sentence embedding utilized the distributional hypothesis by predicting surrounding sentences of a sentence (Kiros et al., 2015;Hill et al., 2016;Logeswaran and Lee, 2018).Pagliardini et al. 2018 build on the idea of word2vec (Mikolov et al., 2013) using ngram embeddings.Some researchers simply use the average of BERT's last layer embeddings as the sentence embedding (Reimers and Gurevych, 2019).Recently, contrastive learning has proven to be very powerful in many domains (Khorram et al., 2022;Munia et al., 2021;Hu et al., 2021); thus, some recent methods exploit contrastive learning ( Gao et al., 2021;Zhang et al., 2022;Yan et al., 2021) and utilize the same sentence by looking at it from multiple angles.For instance, the popular method SimCSE leverages different outputs of the same sentence from BERT's standard dropout.
There are some previous works (Ethayarajh, 2019;Jawahar et al., 2019;Bommasani et al., 2020) that investigate the impact of different BERT layers.However, they either investigate each layer individually or are restricted to considering a set of consecutive layers strictly starting from layer 1.
We apply our method to different variations of BERT, RoBERTa, SBERT, and SimCSE.To the best of our knowledge, no similar work has been done that whether combines different and arbitrary layers of these models or further integrates layer combination and the CLS pooling head.

Approach
Given an input sentence S = (s 1 , s 2 , . . ., s N ), the goal of a sentence embedding model is to output a vector E S ∈ IR d which carries the semantic and/or syntactic information of the sentence.In order to obtain a sentence embedding, we first pass the sentence to a BERT-like model, which outputs the tensor H ∈ IR L ′ ×N ×d , in which d is the dimension of the token vectors in each layer, and L ′ = L+1 is the number of layers (including Layer 0) in BERT.
We then apply to this tensor a pooling function p, which can be max or mean, but we choose mean as it yielded better results in our experiments.
The pooling is done across all the tokens in all the desired layers set, D. For example, for mean pooling, p is defined as: where H l,n,: ∈ IR d is the Transformer vector at layer number l corresponding to the nth token.
Previous works usually set D to {L} or use CLS pooler, the output of the MLP layer attached to the last layer's first token ([CLS]) to obtain the sentence embedding.Using CLS pooler was shown to underperform last layer averaging (Reimers and Gurevych, 2019).Further, as we will empirically show, choosing D = {L} leads to significant underperformance as well; hence, in this work, we iterate through all possible Ds (D ∈ P(A) − ∅, where A = {0, . . ., L}), and choose the bestperforming D as our layers to pool from.
As iterating through all possible Ds is very timeconsuming, we propose an algorithm that speeds up the process of finding the best layer combination (2 13 layer combinations in base cases) by a factor of 189 times, which can be found in Appendix A.
We further propose an extension to our method, and that is exploiting the MLP head and layer combination simultaneously.The idea is to pass p(H, D) in Eq. 1 to the MLP head and use the output of the MLP head as the new p.We spot that this method works better than merely using the layer combination on SimCSE.We conjecture that this is because SimCSE's MLP head was trained for learn-ing sentence embeddings as opposed to the other methods, such as BERT, SBERT, and RoBERTa; hence, it carries important information to be utilized.Consequently, we use the two-step pipeline of layer combination and MLP for SimCSE models, and we propose that the two-step pipeline is utilized for any other BERT-based sentence embedding model whose MLP head has been trained for sentence embedding.

Experiments and Results
We carry out our experiments on two different tasks: transfer and STS tasks.In this section, we discuss the tranasfer tasks, while the STS tasks are discussed in Appendix B.
For each data set, we combine all of its data subsets and randomly split them into trainingdevelopment (train-dev) and test data with ratios 85% and 15%.This outer cross-validation is done randomly for 10 times, and the average accuracy results on the test data are reported.We further utilize a 10-fold inner cross-validation on train-dev data set, with ratios 82% (train) and 18% (dev).
The sentence embeddings by our method or the baselines are used as feature vectors for a logistic regression classifier.Note that even though there exists a training data set, it is only utilized by the logistic regression classifier and not in our method.
For the base and large models, we consider layer combinations with up to four and three layers, respectively, as we did not spot much difference when combining more layers.
For the logistic regression, we use SAGA (Defazio et al., 2014) as the optimizer, 0.01 as the tolerance level, 200 as the maximum number of iterations, and 10 as the regularization parameter.

Results and Discussion
The results for our proposed method and the baselines on transfer tasks are shown in Table 1.The results are statistically significant (p-value < 0.01).Note that our methods are denoted by an LC at the end of them (e.g., RoBERTa-L-LC means RoBERTa-L when layer combination is used).
Since all models have accuracies close to 100% on most of the transfer data sets, we believe it is more reasonable to see how much improvement we are acquiring by considering the relative error reduction.Relative error for a model and its baseline with respective accuracies of A M and A M B is defined as From Table 1, we observe that our method significantly outperforms its corresponding baselines for all the models on all the data sets.For example, RoBERTa-B-LC, SBERT-L-LC, and UnSupSimCSE-RL-LC improve on their baselines by reducing the relative errors by up to 36.19%, 45.93%, and 40.80% and on average 19.89%, 14.95%, and 27.31%, respectively.
Furthermore, we spot that our method is more effective on unsupervised versions of SimCSE models.For instance, the relative error improvement over UnSupSimCSE-RL is 27.31%, whereas it is 14.61% for SupSimCSE-RL.However, there is still important information to be extracted from all the layers of the supervised SimCSE models as well.
We see that even an unsupervised model such as RoBERTa-L when upgraded with LC can achieve 84.58%, beating SupSimCSE-RL, a supervised model which held the previous best average result (83.73%).RoBERTA-L-LC's average accuracy does not fall very short of the new state-of-the-art model, SupSimCSE-RL-LC (84.68% vs. 85.89%).Finally, we have improved the previous state-ofthe-art models on all of the transfer data sets by reducing the relative (absolute) errors by up to 37.41% (3.67%) and on average 17.92% (1.90%), respectively.We achieve a state-of-the-art average accuracy of 86.36% on the transfer data sets.We have provided more baselines, which can be found in Appendix D.

Ablation Studies
While more ablation studies, such as the effect of the number of layers in layer combination and the effect of excluding a particular layer from layer combination, can be found in Appendix E, in this section, we discuss the performance of layer combination in limited data settings.For this, we reduce the size of training data for MR, SUBJ, and TREC, while keeping the test data untouched.Fig. 1 shows the results for RoBERTA-B (RB), BERT-B cased (BBC) and SRoBERTa-B (SRB) on these data sets.We report the average results of 10 different seeds.As we can see, for all the models on all the data sets, the more reduction in size is done, the more the difference is between the LC version and the last layer average.For instance, for RB on SUBJ, on the original training data, the accuracies for RB-LC and RB are 94.77% and 93.73%, respectively, while when reducing the size to 1/256th, the respective accuracies are 78.99% and 50.83%.This suggests that it is even more beneficial to use LC when one has small training data.

Conclusion and Future Work
In this paper, we proposed a new method called BERT Layer Combination, a simple, yet effective framework, which when applied to various BERT-based models, significantly improves them for the downstream tasks of STS and transfer learning.Further, it achieves the state-of-the-art performances on eight transfer tasks.Our method combines certain layers of BERT-based models in an unsupervised manner, which shows that different layers of BERT hold important information which was previously ignored.We demonstrated the effectiveness of our approach by conducting comprehensive experiments on various BERT-based models and on a host of different tasks and data sets.
As future work, we would like to apply the technique of layer combination to other NLP tasks (e.g., punctuation insertion (Hosseini and Sameti, 2017) and question answering (Qu et al., 2019)) and also utilize it in other domains where deep learning models are used (e.g., biosignal analysis (Munia et al., 2020, Munia et al., 2023) and image captioning (Huang et al., 2019)).

Limitation
One limitation of our work is that though we applied our layer combination technique to a raft of different models, all of them are BERT based; ergo, whether or not this technique works on other deep learning NLP models remains unsettled.We leave this experimentation as future work.

A Optimized Algorithm for Finding the Best Layer Combination
For a data set X = {(x 1 , y 1 ), . . ., (x |X| , y |X| )} where each sample consists of a list of sentences x i = {S 1 i , . . ., S n i } (n=2 for sentence-pair tasks such as STS) and a label y i , the best-performing layers set (D * ) is calculated by the following equation: where m is the desired metric function, and f i is (3) is the tensor of the ith sample's jth sentence, and f is a function, such as cosine similarity.
Eq. 3 requires calculating p(H, D) for each sentence in X and for every possible D, which can be very time-consuming.In this subsection, we discuss our proposed algorithm for efficiently calculating all p(H, D)s in a step-by-step manner.Each method improves over the previous one, with Method 1 being the naive algorithm.The algorithms discussed here utilize mean as the pooling function, yet they can be easily modified to use other functions such as max.Method 1: For every . Then, for every other D ∈ P + (A), calculate p(H, D) as p(H, D) = 1 |D| l∈D p l .Method 3: This method is explained in Algos. 1 and 2 (maxOptim=False).The idea is that for calculating every p(H, D), we can exploit other P (H, D ′ )s that we have calculated before.For instance, p(H, {1, 3, 6, 8, 10}) = 3 * p(H,{1,3,6})+2 * p(H,{8,10}) 5 . This method uses bottom-up dynamic programming to store the previous p(H, D)s.powerset(A, i) in Algo. 1 is equal to {D|D ∈ P(A) ∧ |D|= i}.mem is a hash map with key-value pairs as (D, p(H, D)).greedyP art(D, mem) returns an array of partitions of D such that all partitions are present in mem, and each partition has the highest length possible.This is done in a greedy manner, by first dividing D into two partitions, and then three partitions, and so on.If no such partitions exist, the resultant array is empty.Method 4: This method further improves over Method 3 by avoiding certain multiplications (Algo.2, maxOptim=True).For instance, if α = p(H, {1, . . ., 4}) and β = p(H, {5, . . ., 8}), then p(H, {1, . . ., 8}) can be calculated as α+β 2 instead of α * 4+β * 4 8 , reducing two vector multiplications.getP artLens(parts) in Algo. 2 returns a set of the lengths of parts' elements.To show the effectiveness of our algorithm, we carried out an experiment.We randomly select 1000 samples from the SICK dataset and try to find the best layer combination among all possible 8192 layer combinations, once with our algorithm and once with the naive algorithm (Method 2).We do this experimentation five times and report the average results.Our algorithm takes 5.65 seconds to find the best layer combination after the tensor of all layer and token vectors are obtained by BERT-base-uncased, while Method 2 takes 1067 seconds.The one-time forward pass of BERT takes 10 seconds.

B.1 Data Sets and Evaluation Setup
For STS tasks, we use the seven standard STS data sets: STS2012-2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016)), STS Benchmark (Cer et al., 2017), and SICK Relatedness (Marelli et al., 2014).These data sets consist of sentence pairs and a similarity score from 1 to 5 assigned to each one of them.For each data set, we first combine all of its data subsets, shuffle it, and choose a small We do this splitting 5 times randomly and report the average test performance.Please note that our method does not require training data.Nonetheless, we note that our method still needs a development data set in order for it to be bootstrapped.However, the required data set can be as small as 350 samples.Furthermore, as opposed to the other models, this data is not leveraged for fine-tuning, which is very time-consuming.We use the sentence embeddings obtained by our method or the baselines to calculate the cosine similarity between two sentences and then use Spearman's correlation (ρ) for evaluating the models' performances as suggested by (Reimers and Gurevych, 2019).

B.2 Baselines and Hyper-parameters
We use the same baselines that we use for the transfer tasks except for the MLM versions of the Sim-CSE models as those were proposed in the SimCSE paper to strengthen their models for the transfer tasks.
For the base models with 13 layers, we iterate through all the possible layer combinations for choosing the best one, yet for the large models with 25 layers, we only iterate through all the layer combinations with up to eight layers since in our experiments, we observed that combining more than eight layers hurts the performance.

B.3 Results and Discussion
The results for STS tasks are shown in Table 2. From this table, we observe that our method significantly outperforms its corresponding variations of BERT, RoBERTa, SBERT, and SRoBERTa baselines on all the STS data sets.For instance, for BERT-L uncased , RoBERTa-L, and SRoBERTa-B, we obtain ρ improvements of up to 25.75%, 13.94%, and 3.95% and on average 16.32%, 9.52%, and 2.75%, respectively, compared to their best baselines (last avg).SimCSE-LC versions also outperform most of their baselines by a decent margin.For example, UnSupSimCSE-BL-LC improves its baseline's ρ by up to 2.30% and on average 0.70%.Finally, the best SimCSE (Sup-RL)'s performances are also improved by up to 1.14%.
We can also see that our method is more effective on unsupervised models such as BERT and RoBERTa, as we spot a higher improvement compared with the supervised models such as SBERT or SupSimCSE.This suggests that the unsupervised models' layers hold more important information than the supervised models for STS tasks.However, there is still crucial information to be exploited from all the layers of the supervised models as well.
We also observe that we have higher improvements for uncased BERT versions over their baseline compared to the cased versions.Nonetheless, after applying our method to either one, the final ρs are within a 1.25% range of each other.This suggests that a lot of BERT uncased 's information is carried in its layers, and when exploited, it can perform more closely to BERT cased -LC.

C Computation Setup
To conduct the experiments discussed in this paper, we used a computer with 128GB of RAM and one Intel Core i9-9980XE CPU.Our code uses PyTorch (Paszke et al., 2019), huggingface2 , and sci-kit learn (Pedregosa et al., 2011).

D More Baselines
In this section, we compare the results of the layer combination technique with more baselines on both STS and transfer tasks.We use the same baseline models as shown in Table 2, but instead of only comparing with the last layer for BERT/RoBERTa models and original SimCSE and SBERT/SRoBERTa models (which use CLS pooler and last layer average, respectively), we also compare the results with these two new baselines: last four layers average and random layers.The results are shown in Table 3 and Table 4.The results show that our method performs better the new baselines as well.

E Ablation studies E.1 Effect of the Number of Layers
In this subsection, we show the effect of combining only N layers of a model, N varying from 1 to 13 for the base models.To this end, we show this effect on the data sets STS16, STSB, and SICK, and for the models BERT-B cased , RoBERTa-B, SBERT-B, and SRoBERTa-B.From Figs. 2 and 3, we can see that (1) For all the data sets and all the models adding more layers shows an upward trend in performance up to some peak point (e.g., N = 4 for BERT-B cased on STS16), and adding more layers after that point shows a downward trend in performance; however, this point is different for different models and data sets, yet it is always at most 6.
(2) For BERT-B cased , RoBERTa-B, and SBERT-B, moving from one layer to two layers leads to a huge increase in performances for all but one data set-model pairs, yet combining more than two layers always leads to a further substantial increase in performance.For instance, on STSB, an absolute ρ improvement of 1.22% can be obtained when moving from two layers to six layers.We, nonetheless, do not see drastic changes for SRoBERTa-B when moving from one layer to two layers.

E.2 Effect of Excluding a Particular Layer from Layer Combination
In this section, we discuss how excluding one particular layer from layer combination affects the performances.For this, we show the results for the models BERT-B cased , RoBERTa-B, and SBERT-B on the data sets STS16, STSB, and SICK in Fig. 4. We can mention these observations: 1. Most of the time, excluding layer 11 hurts the performance, showing that this is an important layer.
2. Excluding either of the layers 4, 5, or 6 does not hurt the performance, suggesting that these layers are not important for these models and data sets.
3.Last layer (L12) is always important for SBERT-B, is important for BERT-B cased in two cases (STS16 and STSB), and is only important for RoBERTa-B in one case (STSB).
4. Layer 0, which is the embedding layer, proves to be important in more than half of the cases here, showing that it carries important information to be considered for STS tasks, and it cannot be ignored.
5. Excluding any particular layer in any of the nine model-data set pairs shown in Fig. 4, leads to a maximum decrease of 0.97% (L12 for SBERT-B on STS16) in the performance.However as shown in Table 2, in any of these 9 cases, we get at least an improvement of 1.86% by combining certain layers (considering all the layers).This suggests that layer combination still outperforms the baselines even if one (any) layer is not considered at all.

Figure 1 :
Figure 1: Varying the size of the training data for transfer tasks.

Algorithm 2 :
Layer Combination Pooler input :Tensor H, Set D, HashMap mem, Bool maxOptim, Int maxM em output :Array P 1 Function pool(H, D, mem, maxOptim, maxM em): lens do 9 cur P = array(P .length,0) 10 foreach part of size len in parts do 11 cur P += mem [part ] first 350 sentence pairs) as our development data set and the rest as our test data set.

Table 1 :
Transfer task results for different sentence embedding models (measured as accuracy * 100)