Self-Guided Contrastive Learning for BERT Sentence Representations

Although BERT and its variants have reshaped the NLP landscape, it still remains unclear how best to derive sentence embeddings from such pre-trained Transformers. In this work, we propose a contrastive learning method that utilizes self-guidance for improving the quality of BERT sentence representations. Our method fine-tunes BERT in a self-supervised fashion, does not rely on data augmentation, and enables the usual [CLS] token embeddings to function as sentence vectors. Moreover, we redesign the contrastive learning objective (NT-Xent) and apply it to sentence representation learning. We demonstrate with extensive experiments that our approach is more effective than competitive baselines on diverse sentence-related tasks. We also show it is efficient at inference and robust to domain shifts.


Introduction
Pre-trained Transformer (Vaswani et al., 2017) language models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have been integral to achieving recent improvements in natural language understanding.However, it is not straightforward to directly utilize these models for sentencelevel tasks, as they are basically pre-trained to focus on predicting (sub)word tokens given context.The most typical way of converting the models into sentence encoders is to fine-tune them with supervision from a downstream task.In the process, as initially proposed by Devlin et al. (2019), a pre-defined token's (a.k.a.[CLS]) embedding from the last layer of the encoder is deemed as the representation of an input sequence.This simple but effective method is possible because, during supervised fine-tuning, the [CLS] embedding functions as the only communication gate between the pre-trained encoder Figure 1: BERT(-base)'s layer-wise performance with different pooling methods on the STS-B test set.We observe that the performance can be dramatically varied according to the selected layer and pooling strategy.Our self-guided training (SG / SG-OPT) assures much improved results compared to those of the baselines.and a task-specific layer, encouraging the [CLS] vector to capture the holistic information.
On the other hand, in cases where labeled datasets are unavailable, it is unclear what the best strategy is for deriving sentence embeddings from BERT. 1 In practice, previous studies (Reimers and Gurevych, 2019;Li et al., 2020;Hu et al., 2020) reported that naïvely (i.e., without any processing) leveraging the [CLS] embedding as a sentence representation, as is the case of supervised finetuning, results in disappointing outcomes.Currently, the most common rule of thumb for building BERT sentence embeddings without supervision is to apply mean pooling on the last layer(s) of BERT.
Yet, this approach can be still sub-optimal.In a preliminary experiment, we constructed sentence embeddings by employing various combinations of different BERT layers and pooling methods, and tested them on the Semantic Textual Similarity (STS) benchmark dataset (Cer et al., 2017). 2 We discovered that BERT(-base)'s performance, measured in Spearman correlation (× 100), can range from as low as 16.71 ([CLS], the 10 th layer) to 63.19 (max pooling, the 2 nd layer) depending on the selected layer and pooling method (see Figure 1).This result suggests that the current practice of building BERT sentence vectors is not solid enough, and that there is room to bring out more of BERT's expressiveness.
In this work, we propose a contrastive learning method that makes use of a newly proposed selfguidance mechanism to tackle the aforementioned problem.The core idea is to recycle intermediate BERT hidden representations as positive samples to which the final sentence embedding should be close.As our method does not require data augmentation, which is essential in most recent contrastive learning frameworks, it is much simpler and easier to use than existing methods (Fang and Xie, 2020;Xie et al., 2020).Moreover, we customize the NT-Xent loss (Chen et al., 2020), a contrastive learning objective widely used in computer vision, for better sentence representation learning with BERT.We demonstrate that our approach outperforms competitive baselines designed for building BERT sentence vectors (Li et al., 2020;Wang and Kuo, 2020) in various environments.With comprehensive analyses, we also show that our method is more computationally efficient than the baselines at inference in addition to being more robust to domain shifts.

Related Work
Contrastive Representation Learning.Contrastive learning has been long considered as effective in constructing meaningful representations.For instance, Mikolov et al. (2013) propose to learn word embeddings by framing words nearby a target word as positive samples while others as negative.Logeswaran and Lee (2018) generalize the approach of Mikolov et al. (2013) for sentence representation learning.More recently, several studies (Fang and Xie, 2020;Giorgi et al., 2020;Wu et al., 2020) suggest to utilize contrastive learning for training Transformer models, similar to our approach.However, they generally require data augmentation techniques, e.g., back-translation (Sennrich et al., 2016), or prior knowledge on training data such as order information, while our method does not.Furthermore, we focus on revising BERT for computing better sentence embeddings rather than training a language model from scratch.
On the other hand, contrastive learning has been also receiving much attention from the computer vision community (Chen et al. (2020); Chen and He (2020); He et al. (2020), inter alia).We improve the framework of Chen et al. (2020) by optimizing its learning objective for pre-trained Transformerbased sentence representation learning.For extensive surveys on contrastive learning, refer to Le-Khac et al. (2020) and Jaiswal et al. (2020).
Fine-tuning BERT with Supervision.It is not always trivial to fine-tune pre-trained Transformer models of gigantic size with success, especially when the number of target domain data is limited (Mosbach et al., 2020).To mitigate this training instability problem, several approaches (Aghajanyan et al., 2020;Jiang et al., 2020;Zhu et al., 2020) have been recently proposed.In particular, Gunel et al. (2021) propose to exploit contrastive learning as an auxiliary training objective during fine-tuning BERT with supervision from target tasks.In contrast, we deal with the problem of adjusting BERT when such supervision is not available.
Sentence Embeddings from BERT.Since BERT and its variants are originally designed to be fine-tuned on each downstream task to attain their optimal performance, it remains ambiguous how best to extract general sentence representations from them, which are broadly applicable across diverse sentence-related tasks.Following Conneau et al. (2017), Reimers and Gurevych (2019) (SBERT) propose to compute sentence embeddings by conducting mean pooling on the last layer of BERT and then fine-tuning the pooled vectors on the natural language inference (NLI) datasets (Bowman et al., 2015;Williams et al., 2018).Meanwhile, some other studies concentrate on more effectively leveraging the knowledge embedded in BERT to construct sentence embeddings without supervision.Specifically, Wang and Kuo (2020) propose a pooling method based on linear algebraic algorithms to draw sentence vectors from BERT's intermediate layers.Li et al. (2020) suggest to learn a mapping from the average of the embeddings obtained from the last two layers of BERT to a spherical Gaussian distribution using a flow model, and to leverage the redistributed embeddings in place of the original BERT representations.We follow the setting of Li et al. (2020) in that we only utilize plain text during training, however, unlike all the others that rely on a certain pooling method even after training, we directly refine BERT so that the typical [CLS] vector can function as a sentence embedding.Note also that there exists concurrent work (Carlsson et al., 2021;Gao et al., 2021;Wang et al., 2021) whose motivation is analogous to ours, attempting to improve BERT sentence embeddings in an unsupervised fashion.

Method
As BERT mostly requires some type of adaptation to be properly applied to a task of interest, it might not be desirable to derive sentence embeddings directly from BERT without fine-tuning.While Reimers and Gurevych (2019) attempt to alleviate this problem with typical supervised fine-tuning, we restrict ourselves to revising BERT in an unsupervised manner, meaning that our method only demands a bunch of raw sentences for training.
Among possible unsupervised learning strategies, we concentrate on contrastive learning which can inherently motivate BERT to be aware of similarities between different sentence embeddings.Considering that sentence vectors are widely used in computing the similarity of two sentences, the inductive bias introduced by contrastive learning can be helpful for BERT to work well on such tasks.The problem is that sentence-level contrastive learning usually requires data augmentation (Fang and Xie, 2020) or prior knowledge on training data, e.g., order information (Logeswaran and Lee, 2018), to make plausible positive/negative samples.We attempt to circumvent these constraints by utilizing the hidden representations of BERT, which are readily accessible, as samples in the embedding space.

Contrastive Learning with Self-Guidance
We aim at developing a contrastive learning method that is free from external procedure such as data augmentation.A possible solution is to leverage (virtual) adversarial training (Miyato et al., 2018) in the embedding space.However, there is no assurance that the semantics of a sentence embedding would remain unchanged when it is added with a random noise.As an alternative, we propose to utilize the hidden representations from BERT's intermediate layers, which are conceptually guaranteed to represent corresponding sentences, as pivots that BERT sentence vectors should be close to or be away from.We call our method as self-guided contrastive learning since we exploit internal training signals made by BERT itself to fine-tune it.We describe our training framework in Figure 2. First, we clone BERT into two copies, BERT F (fixed) and BERT T (tuned) respectively.BERT F is fixed during training to provide a training signal while BERT T is fine-tuned to construct better sentence embeddings.The reason why we differentiate BERT F from BERT T is that we want to prevent the training signal computed by BERT F from being degenerated as the training procedure continues, which often happens when BERT F = BERT T .This design decision also reflects our philosophy that our goal is to dynamically conflate the knowledge stored in BERT's different layers to produce sentence embeddings, rather than introducing new information via extra training.Note that in our setting, the [CLS] vector from the last layer of BERT T , i.e., c i , is regarded as the final sentence embedding we aim to optimize/utilize during/after fine-tuning.
Second, given b sentences in a mini-batch, say s 1 , s 2 , • • • , s b , we feed each sentence s i into BERT F and compute token-level hidden representations H i,k ∈ R len(s i )×d : where 0 ≤ k ≤ l (0: the non-contextualized layer), l is the number of hidden layers in BERT, len(s i ) is the length of the tokenized sentence, and d is the size of BERT's hidden representations.Then, we apply a pooling function p to H i,k for deriving diverse sentence-level views h i,k ∈ R d from all layers, i.e., h i,k = p(H i,k ).Finally, we choose the final view to be utilized by applying a sampling function σ: As we have no specific constraints in defining p and σ, we employ max pooling as p and a uniform sampler as σ for simplicity, unless otherwise stated.This simple choice for the sampler implies that each h i,k has the same importance, which is persuasive considering it is known that different BERT layers are specialized at capturing disparate linguistic concepts (Jawahar et al., 2019). 3hird, we compute our sentence embedding c i for s i as follows: where BERT(•) [CLS] corresponds to the [CLS] vector obtained from the last layer of BERT.Next, we collect the set of the computed vectors into X = {x|x ∈ {c i } ∪ {h i }}, and for all x m ∈ X, we compute the NT-Xent loss (Chen et al., 2020): Note that τ is a temperature hyperparameter, f is a projection head consisting of MLP layers,4 g(u, v) = u • v/ u v is the cosine similarity function, and µ(•) is the matching function defined as follows, Lastly, we sum all L base m divided by 2b, and add a regularizer As a result, the final loss L base is: where the coefficient λ is a hyperparameter.
To summarize, our method refines BERT so that the sentence embedding c i has a higher similarity with h i , which is another representation for the sentence s i , in the subspace projected by f while being relatively dissimilar with c j,j =i and h j,j =i .After training is completed, we remove all the components except BERT T and simply use c i as the final sentence representation.

Learning Objective Optimization
In Section 3.1, we relied on a simple variation of the general NT-Xent loss, which is composed of four factors.Given sentence s i and s j without loss of generality, the factors are as follows (Figure 3): (1) c i →← h i (or c j →← h j ): The main component that mirrors our core motivation that a BERT sentence vector (c i ) should be consistent with intermediate views (h i ) from BERT.(2) c i ←→ c j : A factor that forces sentence embeddings (c i , c j ) to be distant from each other.(3) c i ←→ h j (or c j ←→ h i ): An element that makes c i being inconsistent with views for other sentences (h j ).(4) h i ←→ h j : A factor that causes a discrepancy between views of different sentences (h i , h j ).
Even though all the four factors play a certain role, some components may be useless or even cause a negative influence on our goal.For instance, Chen and He (2020) have recently reported that in image representation learning, only (1) is vital while others are nonessential.Likewise, we customize the training loss with three major modifications so that it can be more well-suited for our purpose.First, as our aim is to improve c i with the aid of h i , we re-define our loss focusing more on c i rather than considering c i and h i as equivalent entities: where Ẑ = b j=1,j =i φ(c i , c j ) + b j=1 φ(c i , h j ).In other words, h i only functions as points that c i is encouraged to be close to or away from, and is not deemed as targets to be optimized.This revision naturally results in removing (4).Furthermore, we discover that (2) is also insignificant for improving performance, and thus derive L opt2 i : . Lastly, we diversify signals from ( 1) and ( 3) by allowing multiple views {h i,k } to guide c i : We expect with this refinement that the learning objective can provide more precise and fruitful training signals by considering additional (and freely available) samples being provided with.The final form of our optimized loss is: In Section 5.1, we show the decisions made in this section contribute to improvements in performance.

General Configurations
In terms of pre-trained encoders, we leverage BERT (Devlin et al., 2019) for English datasets and MBERT, which is a multilingual variant of BERT, for multilingual datasets.We also employ RoBERTa (Liu et al., 2019) and SBERT (Reimers and Gurevych, 2019) in some cases to evaluate the generalizability of tested methods.We use the suffixes '-base' and '-large' to distinguish small and large models.Every trainable model's performance is reported as the average of 8 separate runs to reduce randomness.Hyperparameters are optimized on the STS-B validation set using BERTbase and utilized across different models.See

Semantic Textual Similarity Tasks
We first evaluate our method and baselines on Semantic Textual Similarity (STS) tasks.Given two sentences, we derive their similarity score by computing the cosine similarity of their embeddings.
Baselines and Model Specification.We first prepare two non-BERT approaches as baselines, i.e., Glove (Pennington et al., 2014) mean embeddings and Universal Sentence Encoder (USE; Cer et al. ( 2018)).In addition, various methods for BERT sentence embeddings that do not require supervision are also introduced as baselines: • CLS token embedding: It regards the [CLS] vector from the last layer of BERT as a sentence representation.• Mean pooling: This method conducts mean pooling on the last layer of BERT and use the output as a sentence embedding.• WK pooling: This follows the method of Wang and Kuo (2020), which exploits QR decomposition and extra techniques to derive meaningful sentence vectors from BERT.• Flow: This is BERT-flow proposed by Li et al. (2020), which is a flow-based model that maps the vectors made by taking mean pooling on the last two layers of BERT to a Gaussian space.6 • Contrastive (BT): Following Fang and Xie (2020), we revise BERT with contrastive learning.However, this method relies on back-translation to obtain positive samples, unlike ours.Details about this baseline are specified in Appendix A.2.
We make use of plain sentences from STS-B to fine-tune BERT using our approach, identical with Flow. 7We name the BERT instances trained with our self-guided method as Contrastive (SG) and Contrastive (SG-OPT), which utilize L base and L opt in Section 3 respectively.
Results.We report the performance of different approaches on STS tasks in Table 1 and Table 11 (Appendix A.6). From the results, we confirm the fact that our methods (SG and SG-OPT) mostly outperform other baselines in a variety of experimental settings.As reported in earlier studies, the naïve [CLS] embedding and mean pooling are turned out to be inferior to sophisticated methods.
To our surprise, WK pooling's performance is even lower than that of mean pooling in most cases, and the only exception is when WK pooling is applied to SBERT-base.Flow shows its strength outperforming the simple strategies.Nevertheless, its performance is shown to be worse than that of our methods (although some exceptions exist in the case of SBERT-large).Note that contrastive learning becomes much more competitive when it is combined with our self-guidance algorithm rather than back-translation.It is also worth mentioning that the optimized version of our method (SG-OPT) generally shows better performance than the basic one (SG), proving the efficacy of learning objective optimization (Section 3.2).To conclude, we demonstrate that our self-guided contrastive learning is effective in improving the quality of BERT sentence embeddings when tested on STS tasks.

Multilingual STS Tasks
We expand our experiments to multilingual settings by utilizing MBERT and cross-lingual zero-shot transfer.Specifically, we refine MBERT using only From Table 2, we see that MBERT with mean pooling already outperforms the best system (at the time of the competition was held) on SemEval-2014 and that our method further boosts the model's performance.In contrast, in the case of SemEval-2017 (Table 3), MBERT with mean pooling even fails to beat the strong Cosine baseline. 8owever, MBERT becomes capable of outperforming (in English/Spanish) or being comparable with (Arabic) the baseline by adopting our algorithm.We observe that while cross-lingual transfer using MBERT looks promising for the languages analogous to English (e.g., Spanish), its effectiveness may shrink on distant languages (e.g., Arabic).Compared against the best system which is trained on task-specific data, MBERT shows reasonable performance considering that it is never exposed to any labeled STS datasets.In summary, we demonstrate that MBERT fine-tuned with our method has a potential to be used as a simple but effective tool for multilingual (especially European) STS tasks.

SentEval and Supervised Fine-tuning
We also evaluate BERT sentence vectors using the SentEval (Conneau and Kiela, 2018) toolkit.Given sentence embeddings, SentEval trains linear classifiers on top of them and estimates the quality of the vectors via their performance (accuracy) on down-

Models
MR CR SUBJ MPQA SST2 TREC MRPC Avg.stream tasks.Among available tasks, we employ 7: MR, CR, SUBJ, MPQA, SST2, TREC, MRPC. 9n Table 4, we compare our method (SG-OPT) with two baselines. 10We find that our method is helpful over usual mean pooling in improving the performance of BERT-like models on SentEval.SG-OPT also outperforms WK pooling on BERTbase/large while being comparable on SBERT-base.From the results, we conjecture that self-guided contrastive learning and SBERT training suggest a similar inductive bias in a sense, as the benefit we earn by revising SBERT with our method is relatively lower than the gain we obtain when fine-tuning BERT.Meanwhile, it seems that WK pooling provides an orthogonal contribution that is effective in the focused case, i.e., SBERT-base.

BERT-base
In addition, we examine how our algorithm impacts on supervised fine-tuning of BERT, although it is not the main concern of this work.Briefly reporting, we identify that the original BERT(-base) and one tuned with SG-OPT show comparable performance on the GLUE (Wang et al., 2019) validation set, implying that our method does not influence much on BERT's supervised fine-tuning.We refer readers to Appendix A.4 for more details.

Analysis
We here further investigate the working mechanism of our method with supplementary experiments.All the experiments conducted in this section follow the configurations stipulated in Section 4.1 and 4.2.

Ablation Study
We conduct an ablation study to justify the decisions made in optimizing our algorithm.To this end, we evaluate each possible variant on the test sets of STS tasks.From Table 5, we confirm that all our modifications to the NT-Xent loss contribute to improvements in performance.Moreover, we show that correct choices for hyperparameters are important for achieving the optimal performance, and that the projection head (f ) plays a significant role as in Chen et al. (2020).

Robustness to Domain Shifts
Although our method in principle can accept any sentences in training, its performance might be varied with the training data it employs (especially depending on whether the training and test data share the same domain).To explore this issue, we apply SG-OPT on BERT-base by leveraging the mix of NLI datasets (Bowman et al., 2015;Williams et al., 2018)  that no matter which test set is utilized (STS-B or all the seven STS tasks), our method clearly outperforms Flow in every case, showing its relative robustness to domain shifts.SG-OPT only loses 1.83 (on the STS-B test set) and 1.63 (on average when applied to all the STS tasks) points respectively when trained with NLI rather than STS-B, while Flow suffers from the considerable losses of 12.16 and 4.19 for each case.Note, however, that follow-up experiments in more diverse conditions might be desired as future work, as the NLI dataset inherently shares some similarities with STS tasks.

Computational Efficiency
In this part, we compare the computational efficiency of our method to that of other baselines.For each algorithm, we measure the time elapsed during training (if required) and inference when tested on STS-B.All methods are run on the same machine (an Intel Xeon CPU E5-2620 v4 @ 2.10GHz and a Titan Xp GPU) using batch size 16.The experimental results specified in Table 6 show that although our method demands a moderate amount of time (< 8 min.)for training, it is the most efficient at inference, since our method is free from any post-processing such as pooling once training is completed.

Representation Visualization
We visualize a few variants of BERT sentence representations to grasp an intuition on why our method is effective in improving performance.Specifically, we sample 20 positive pairs (red, whose similarity scores are 5) and 20 negative pairs (blue, whose scores are 0) from the STS-B validation set.Then we compute their vectors and draw them on the 2D space with the aid of t-SNE.In Figure 5, we confirm that our SG-OPT encourages BERT sentence embeddings to be more well-aligned with their positive pairs while still being relatively far from their negative pairs.We also visualize embeddings from SBERT (Figure 6 in Appendix A.5), and identify that our approach and the supervised fine-tuning  used in SBERT provide a similar effect, making the resulting embeddings more suitable for calculating correct similarities between them.

Discussion
In this section, we discuss a few weaknesses of our method in its current form and look into some possible avenues for future work.
First, while defining the proposed method in Section 3, we have made decisions on some parts without much consideration about their optimality, prioritizing simplicity instead.For instance, although we proposed utilizing all the intermediate layers of BERT and max pooling in a normal set-ting (indeed, it worked pretty well for most cases), a specific subset of the layers or another pooling method might bring better performance in a particular environment, as we observed in Section 4.4 that we could achieve higher numbers by employing mean pooling and excluding lower layers in the case of SentEval (refer to Appendix A.3 for details).Therefore, in future work, it is encouraged to develop a systematic way of making more optimized design choices in specifying our method by considering the characteristics of target tasks.
Second, we expect that the effectiveness of contrastive learning in revising BERT can be improved further by properly combining different techniques developed for it.As an initial attempt towards this direction, we conduct an extra experiment where we test the ensemble of back-translation and our self-guidance algorithm by inserting the original sentence into BERT T and its back-translation into BERT F when running our framework.In Table 7, we show that the fusion of the two techniques generally results in better performance, shedding some light on our future research direction.

Conclusion
In this paper, we have proposed a contrastive learning method with self-guidance for improving BERT sentence embeddings.Through extensive experiments, we have demonstrated that our method can enjoy the benefit of contrastive learning without relying on external procedures such as data augmentation or back-translation, succeeding in generating higher-quality sentence representations compared to competitive baselines.Furthermore, our method is efficient at inference because it does not require any post-processing once its training is completed, and is relatively robust to domain shifts.We here investigate the impact of our method on typical supervised fine-tuning of BERT models.Concretely, we compare the original BERT with one fine-tuned using our SG-OPT method on the GLUE (Wang et al., 2019) benchmark.Note that we use the first 10% of the GLUE validation set as the real validation set and the last 90% as the test set, as the benchmark does not officially provide its test data.We report experimental results tested on 5 sub-tasks in Table 10.The results show that our method brings performance improvements for 3 tasks (QNLI, SST2, and RTE).However, it seems that SG-OPT does not influence much on supervised fine-tuning results, considering that the absolute performance gap between the two models is not significant.We leave more analysis on this part as future work.A.6 RoBERTa's Performance on STS Tasks

A.5 Representation Visualization (SBERT)
In Table 11, we additionally report the performance of sentence embeddings extracted from RoBERTa using different methods.Our methods, SG and SG-OPT, demonstrate their competitive performance

Figure 2 :
Figure 2: Self-guided contrastive learning framework.We clone BERT into two copies at the beginning of training.BERT T (except Layer 0) is then fine-tuned to optimize the sentence vector c i while BERT F is fixed.

Figure 3 :
Figure 3: Four factors of the original NT-Xent loss.Green and yellow arrows represent the force of attraction and repulsion, respectively.Best viewed in color.

Figure 4 :
Figure 4: Domain robustness study.The yellow bars indicate the performance gaps each method has according to which data it is trained with: in-domain (STS-B) or out-of-domain (NLI).Our method (SG-OPT) clearly shows its relative robustness compared to Flow.

Figure 5 :
Figure 5: Sentence representation visualization.(Top) Embeddings from the original BERT.(Bottom) Embeddings from the BERT instance fine-tuned with SG-OPT.Red numbers correspond to positive sentence pairs and blue to negative pairs.

Table 6 :
instead of STS-B, and observe the difference.From Figure4, we confirm the fact Computational efficiency tested on STS-B.

Table 10 :
Experimental results on a portion of the GLUE validation set.