ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Learning high-quality sentence representations benefits a wide range of natural language processing tasks. Though BERT-based pre-trained language models achieve high performance on many downstream tasks, the native derived sentence representations are proved to be collapsed and thus produce a poor performance on the semantic textual similarity (STS) tasks. In this paper, we present ConSERT, a Contrastive Framework for Self-Supervised SEntence Representation Transfer, that adopts contrastive learning to fine-tune BERT in an unsupervised and effective way. By making use of unlabeled texts, ConSERT solves the collapse issue of BERT-derived sentence representations and make them more applicable for downstream tasks. Experiments on STS datasets demonstrate that ConSERT achieves an 8% relative improvement over the previous state-of-the-art, even comparable to the supervised SBERT-NLI. And when further incorporating NLI supervision, we achieve new state-of-the-art performance on STS tasks. Moreover, ConSERT obtains comparable results with only 1000 samples available, showing its robustness in data scarcity scenarios.


Introduction
Sentence representation learning plays a vital role in natural language processing tasks (Kiros et al., 2015;Hill et al., 2016;Conneau et al., 2017;Cer et al., 2018).Good sentence representations benefit a wide range of downstream tasks, especially for computationally expensive ones, including largescale semantic similarity comparison and information retrieval.
Recently, BERT-based pre-trained language models have achieved high performance on many downstream tasks with additional supervision.However, the native sentence representations derived from BERT1 are proved to be of low-quality (Reimers and Gurevych, 2019;Li et al., 2020).As shown in Figure 1a, when directly adopt BERTbased sentence representations to semantic textual similarity (STS) tasks, almost all pairs of sentences achieved a similarity score between 0.6 to 1.0 , even if some pairs are regarded as completely unrelated by the human annotators.In other words, the BERT-derived native sentence representations are somehow collapsed (Chen and He, 2020), which means almost all sentences are mapped into a small area and therefore produce high similarity.
Such phenomenon is also observed in several previous works (Gao et al., 2019;Wang et al., 2019;Li et al., 2020).They find the word representation space of BERT is anisotropic, the high-frequency words are clustered and close to the origin, while low-frequency words disperse sparsely.When averaging token embeddings, those high-frequency words dominate the sentence representations, inducing biases against their real semantics 2 .As a result, it is inappropriate to directly apply BERT's native sentence representations for semantic matching or text retrieval.Traditional methods usually fine-tune BERT with additional supervision.However, human annotation is costly and often unavailable in real-world scenarios.
To alleviate the collapse issue of BERT as well as reduce the requirement for labeled data, we propose a novel sentence-level training objective based on contrastive learning (He et al., 2020;Chen et al., 2020a,b).By encouraging two augmented views from the same sentence to be closer while keeping views from other sentences away, we reshape the BERT-derived sentence representation space and successfully solve the collapse issue (shown in Figure 1b).Moreover, we propose multiple data augmentation strategies for contrastive learning, including adversarial attack (Goodfellow et al., 2014;Kurakin et al., 2016), token shuffling, cutoff (Shen et al., 2020) and dropout (Hinton et al., 2012), that effectively transfer the sentence representations to downstream tasks.We name our approach ConSERT, a Contrastive Framework for SEntence Representation Transfer.
ConSERT has several advantages over previous approaches.Firstly, it introduces no extra structure or specialized implementation during inference.The parameter size of ConSERT keeps the same as BERT, making it easy to use.Secondly, compared with pre-training approaches, ConSERT is more efficient.With only 1,000 unlabeled texts drawn from the target distribution (which is easy to collect in real-world applications), we achieve 35% relative performance gain over BERT, and the training stage takes only a few minutes (1-2k steps) on a single V100 GPU.Finally, it includes several effective and convenient data augmentation methods with minimal semantic impact.Their effects are validated and analyzed in the ablation studies.
Our contributions can be summarized as follows: 1) We propose a simple but effective sentence-level training objective based on contrastive learning.It mitigates the collapse of BERT-derived representations and transfers them to downstream tasks.2) We explore various effective text augmentation strategies to generate views for contrastive learning and analyze their effects on unsupervised sentence representation transfer.3) With only fine-tuning on unsupervised target datasets, our approach achieves significant improvement on STS tasks.When further incorporating with NLI supervision, our ap-proach achieves new state-of-the-art performance.We also show the robustness of our approach in data scarcity scenarios and intuitive analysis of the transferred representations.(Bowman et al., 2015) and Multi-Genre NLI (MNLI) (Williams et al., 2018).Universal Sentence Encoder (Cer et al., 2018) adopts a Transformer-based architecture and uses the SNLI dataset to augment the unsupervised training.SBERT (Reimers and Gurevych, 2019) proposes a siamese architecture with a shared BERT encoder and is also trained on SNLI and MNLI datasets.
Self-supervised Objectives for Pre-training BERT (Devlin et al., 2019) proposes a bidirectional Transformer encoder for language model pre-training.It includes a sentence-level training objective, namely next sentence prediction (NSP), which predicts whether two sentences are adjacent or not.However, NSP is proved to be weak and has little contribution to the final performance (Liu et al., 2019).After that, various self-supervised objectives are proposed for pre-training BERT-like sentence encoders.Cross-Thought (Wang et al., 2020) and CMLM (Yang et al., 2020) are two similar objectives that recover masked tokens in one sentence conditioned on the representations of its contextual sentences.SLM (Lee et al., 2020) proposes an objective that reconstructs the correct sentence ordering given the shuffled sentences as the input.However, all these objectives need document-level corpus and are thus not applicable to downstream tasks with only short texts.
Unsupervised Approaches BERT-flow (Li et al., 2020) proposes a flow-based approach that maps BERT embeddings to a standard Gaussian latent space, where embeddings are more suitable for comparison.However, this approach introduces extra model structures and need specialized implementation, which may limit its application.

Contrastive Learning
Contrastive Learning for Visual Representation Learning Recently, contrastive learning has become a very popular technique in unsupervised visual representation learning with solid performance (Chen et al., 2020a;He et al., 2020;Chen et al., 2020b).They believe that good representation should be able to identify the same object while distinguishing itself from other objects.Based on this intuition, they apply image transformations (e.g.cropping, rotation, cutout, etc.) to randomly generate two augmented versions for each image and make them close in the representation space.Such approaches can be regarded as the invariance modeling to the input samples.Chen et al. (2020a) proposes SimCLR, a simple framework for contrastive learning.They use the normalized temperature-scaled cross-entropy loss (NT-Xent) as the training loss, which is also called InfoNCE in the previous literature (Hjelm et al., 2018).
Contrastive Learning for Textual Representation Learning Recently, contrastive learning has been widely applied in NLP tasks.Many works use it for language model pre-training.IS-BERT (Zhang et al., 2020) proposes to add 1-D convolutional neural network (CNN) layers on top of BERT and train the CNNs by maximizing the mutual information (MI) between the global sentence embedding and its corresponding local contexts embeddings.CERT (Fang and Xie, 2020) adopts a similar structure as MoCo (He et al., 2020) and uses back-translation for data augmentation.However, the momentum encoder needs extra memory and back-translation may produce false positives.BERT-CT (Carlsson et al., 2021) uses two individual encoders for contrastive learning, which also needs extra memory.Besides, they only sample 7 negatives, resulting in low training efficiency.De-CLUTR (Giorgi et al., 2020) adopts the architecture of SimCLR and jointly trains the model with contrastive objective and masked language model objective.However, they only use spans for contrastive learning, which is fragmented in semantics.CLEAR (Wu et al., 2020) uses the same architecture and objectives as DeCLUTR.Both of them are used to pre-train the language model, which needs a large corpus and takes a lot of resources.

Approach
In this section, we present ConSERT for sentence representation transfer.Given a BERT-like pretrained language model M and an unsupervised dataset D drawn from the target distribution, we aim at fine-tuning M on D to make the sentence representation more task-relevant and applicable to downstream tasks.We first present the general framework of our approach, then we introduce several data augmentation strategies for contrastive learning.Finally, we talk about three ways to further incorporate supervision signals.

General Framework
Our approach is mainly inspired by SimCLR (Chen et al., 2020a).As shown in Figure 2, there are three major components in our framework: • A data augmentation module that generates different views for input samples at the token embedding layer.
• A shared BERT encoder that computes sentence representations for each input text.During training, we use the average pooling of the token embeddings at the last layer to obtain sentence representations.
• A contrastive loss layer on top of the BERT encoder.It maximizes the agreement between one representation and its corresponding version that is augmented from the same sentence while keeping it distant from other sentence representations in the same batch.For each input text x, we first pass it to the data augmentation module, in which two transformations T 1 and T 2 are applied to generate two versions of token embeddings: e i = T 1 (x), e j = T 2 (x), where e i , e j ∈ R L×d , L is the sequence length and d is the hidden dimension.After that, both e i and e j will be encoded by multi-layer transformer blocks in BERT and produce the sentence representations r i and r j through average pooling.
Following Chen et al. (2020a), we adopt the normalized temperature-scaled cross-entropy loss (NT-Xent) as the contrastive objective.During each training step, we randomly sample N texts from D to construct a mini-batch, resulting in 2N representations after augmentation.Each data point is trained to find out its counterpart among 2(N − 1) in-batch negative samples: , where sim(•) indicates the cosine similarity function, τ controls the temperature and 1 is the indicator.Finally, we average all 2N in-batch classification losses to obtain the final contrastive loss L con .

Data Augmentation Strategies
We explore four different data augmentation strategies to generate views for contrastive learning, including adversarial attack (Goodfellow et al., 2014;Kurakin et al., 2016), token shuffling, cutoff (Shen et al., 2020) and dropout (Hinton et al., 2012), as illustrated in Figure 3. Adversarial Attack Adversarial training is generally used to improve the model's robustness.They generate adversarial samples by adding a worst-case perturbation to the input sample.We implement this strategy with Fast Gradient Value (FGV) (Rozsa et al., 2016), which directly uses the gradient to compute the perturbation and thus is faster than two-step alternative methods.Note that this strategy is only applicable when jointly training with supervision since it relies on supervised loss to compute adversarial perturbations.
Token Shuffling In this strategy, we aim to randomly shuffle the order of the tokens in the input sequences.Since the bag-of-words nature in the transformer architecture, the position encoding is the only factor about the sequential information.Thus, similar to Lee et al. (2020), we implement this strategy by passing the shuffled position ids to the embedding layer while keeping the order of the token ids unchanged.
Cutoff Shen et al. (2020) proposes a simple and efficient data augmentation strategy called cutoff.They randomly erase some tokens (for token cutoff), feature dimensions (for feature cutoff), or token spans (for span cutoff) in the L × d feature matrix.In our experiments, we only use token cutoff and feature cutoff and apply them to the token embeddings for view generation.
Dropout Dropout is a widely used regularization method that avoids overfitting.However, in our experiments, we also show its effectiveness as an augmentation strategy for contrastive learning.For this setting, we randomly drop elements in the token embedding layer by a specific probability and set their values to zero.Note that this strategy is different from Cutoff since each element is considered individually.

Incorporating Supervision Signals
Besides unsupervised transfer, our approach can also be incorporated with supervised learning.We take the NLI supervision as an example.It is a sentence pair classification task, where the model are trained to distinguish the relation between two sentences among contradiction, entailment and neutral.The classification objective can be expressed as following: , where r 1 and r 2 denote two sentence representations.
We propose three ways for incorporating additional supervised signals: • Joint training (joint) We jointly train the model with the supervised and unsupervised objectives L joint = L ce + αL con on NLI dataset.α is a hyper-parameter to balance two objectives.
• Supervised training then unsupervised transfer (sup-unsup) We first train the model with L ce on NLI dataset, then use L con to finetune it on the target dataset.
• Joint training then unsupervised transfer (joint-unsup) We first train the model with the L joint on NLI dataset, then use L con to fine-tune it on the target dataset.

Experiments
To verify the effectiveness of our proposed approach, we conduct experiments on Semantic Textual Similarity (STS) tasks under the unsupervised and supervised settings.
For supervised experiments, we use the combination of SNLI (570k samples) (Bowman et al., 2015) and MNLI (430k samples) (Williams et al., 2018) to train our model.In the joint training setting, the NLI texts are also used for contrastive objectives.
Baselines To show our effectiveness on unsupervised sentence representation transfer, we mainly select BERT-flow (Li et al., 2020) for comparison, since it shares the same setting as our approach.For unsupervised comparison, we use the average of GloVe embeddings, the BERT-derived native embeddings, CLEAR (Wu et al., 2020) (trained on BookCorpus and English Wikipedia corpus), IS-BERT (Zhang et al., 2020) (trained on unlabeled texts from NLI datasets), BERT-CT (Carlsson et al., 2021) (trained on English Wikipedia corpus).For comparison with supervised methods, we select In-ferSent (Conneau et al., 2017), Universal Sentence Encoder (Cer et al., 2018), SBERT (Reimers and Gurevych, 2019) and BERT-CT (Carlsson et al., 2021) as baselines.They are all trained with NLI supervision.
Evaluation When evaluating the trained model, we first obtain the representation of sentences by averaging the token embeddings at the last two layers4 , then we report the spearman correlation between the cosine similarity scores of sentence representations and the human-annotated gold scores.When calculating spearman correlation, we merge all sentences together (even if some STS datasets have multiple splits) and calculate spearman correlation for only once 5 .Implementation Details Our implementation is based on the Sentence-BERT 6 (Reimers and Gurevych, 2019).We use both the BERT-base and BERT-large for our experiments.The max sequence length is set to 64 and we remove the default dropout layer in BERT architecture considering the cutoff and dropout data augmentation strategies used in our framework.The ratio of token cutoff and feature cutoff is set to 0.15 and 0.2 respectively, as suggested in Shen et al. (2020).The ratio of dropout is set to 0.2.The temperature τ of NT-Xent loss is set to 0.1, and the α is set to 0.15 for the joint training setting.We adopt Adam optimizer and set the learning rate to 5e-7.We use a linear learning rate warm-up over 10% of the training steps.The batch size is set to 96 in most of our experiments.We use the dev set of STSb to tune the hyperparameters (including the augmentation strategies) and evaluate the model every 200 steps during training.The best checkpoint on the dev set of STSb is saved for test.We further discuss the influence of the batch size and the temperature in the subsequent section.

Unsupervised Results
For unsupervised evaluation, we load the pretrained BERT to initialize the BERT encoder in our framework.Then we randomly mix the unlabeled texts from 7 STS datasets and use them to fine-tune our model.
SentEval toolkit, which calculates spearman correlation for each split and reports the mean or weighted mean scores.
6 https://github.com/UKPLab/sentence-transformers The results are shown in Table 2.We can observe that both BERT-flow and ConSERT can improve the representation space and outperform the GloVe and BERT baselines with unlabeled texts from target datasets.However, ConSERT large achieves the best performance among 6 STS datasets, significantly outperforming BERT large -flow with an 8% relative performance gain on average (from 70.76 to 76.45).Moreover, it is worth noting that ConSERT large even outperforms several supervised baselines (see Figure 3) like InferSent (65.01) and Universal Sentence Encoder (71.72), and keeps comparable to the strong supervised method SBERT large -NLI (76.55).For the BERT base architecture, our approach ConSERT base also outperforms BERT base -flow with an improvement of 3.17 (from 69.57 to 72.74).

Supervised Results
For supervised evaluation, we consider the three settings described in Section 3.3.Note that in the joint setting, only NLI texts are used for contrastive learning, making it comparable to SBERT-NLI.We use the model trained under the joint setting as the initial checkpoint in the joint-unsup setting.We also re-implement the SBERT-NLI baselines and use them as the initial checkpoint in the sup-unsup setting.
The results are illustrated in Table 3.For the models trained with NLI supervision, we find that ConSERT joint consistently performs better than SBERT, revealing the effectiveness of our proposed contrastive objective as well as the data augmentation strategies.On average, ConSERT base joint Method STS12 STS13 STS14 STS15 STS16 STSb SICK-R Avg.achieves a performance gain of 2.88 over the reimplemented SBERT base -NLI, and ConSERT large joint achieves a performance gain of 2.70.When further performing representation transfer with STS unlabeled texts, our approach achieves even better performance.On average, ConSERT large joint-unsup outperforms the initial checkpoint ConSERT large joint with 1.84 performance gain, and outperforms the previous state-ofthe-art BERT large -flow with 2.92 performance gain.The results demonstrate that even for the models trained under supervision, there is still a huge potential of unsupervised representation transfer for improvement.

Using NLI supervision
5 Qualitative Analysis

Analysis of BERT Embedding Space
To prove the hypothesis that the collapse issue is mainly due to the anisotropic space that is sensitive to the token frequency, we conduct experiments that mask the embeddings of several most frequent tokens when applying average pooling to calculate the sentence representations.The relation between the number of removed top-k frequent tokens and the average spearman correlation is shown in Figure 4. We can observe that when removing a few top frequent tokens, the performance of BERT improves sharply on STS tasks.When removing 34 most frequent tokens, the best performance is achieved (61.66), and there is an improvement of 7.8 from the original performance (53.86).For ConSERT, we find that removing a few most frequent tokens only results in a small improvement of less than 0.3.The results show that our approach reshapes the BERT's original embedding space, reducing the influence of common tokens on sentence representations.

Effect of Data Augmentation Strategy
In this section, we study the effect of data augmentation strategies for contrastive learning.We consider 5 options for each transformation, including None (i.e.doing nothing), Shuffle, Token Cutoff, sentations of augmented views are exactly the same.
On the contrary, it tunes the representation space by pushing each representation away from others.We believe that the improvement is mainly due to the collapse phenomenon of BERT's native representation space.To some extent, it also explains why our method works.

Performance under Few-shot Settings
To validate the reliability and the robustness of ConSERT under the data scarcity scenarios, we conduct the few-shot experiments.We limit the number of unlabeled texts to 1, 10, 100, 1000, and 10000 respectively, and compare their performance with the full dataset.Figure 6 presents the results.For both the unsupervised and the supervised settings, our approach can make a huge improvement over the baseline with only 100 samples available.When the training samples increase to 1000, our approach can basically achieve comparable results with the models trained on the full dataset.The results reveal the robustness and effectiveness of our approach under the data scarcity scenarios, which is common in reality.With only a small amount of unlabeled texts drawn from the target data distribution, our approach can also tune the representation space and benefit the downstream tasks.

Influence of Temperature
The temperature τ in NT-Xent loss (Equation 1) is used to control the smoothness of the distribution normalized by softmax operation and thus influences the gradients when backpropagation.A large temperature smooths the distribution while a small temperature sharpens the distribution.In our experiments, we explore the influence of temperature   and present the result in Figure 7.
As shown in the figure, we find the performance is extremely sensitive to the temperature.Either too small or too large temperature will make our model perform badly.And the optimal temperature is obtained within a small range (from about 0.08 to 0.12).This phenomenon again demonstrates the collapse issue of BERT embeddings, as most sentences are close to each other, a large temperature may make this task too hard to learn.We select 0.1 as the temperature in most of our experiments.

Influence of Batch Size
In some previous works of contrastive learning, it is reported that a large batch size benefits the final performance and accelerates the convergence of the model since it provides more in-batch negative samples for contrastive learning (Chen et al., 2020a).Those in-batch negative samples improve the training efficiency.We also analyze the influence of the batch size for unsupervised sentence representation transfer.
The results are illustrated in Table 4.We show both the spearman correlation and the corresponding training steps.We find that a larger batch size does achieve better performance.However, the improvement is not so significant.Meanwhile, a larger batch size does speed up the training process, but it also needs more GPU memories at the same time.

Conclusion
In this paper, we propose ConSERT, a selfsupervised contrastive learning framework for transferring sentence representations to downstream tasks.The framework does not need extra structure and is easy to implement for any encoder.We demonstrate the effectiveness of our framework on various STS datasets, both our unsupervised and supervised methods achieve new state-of-the-art performance.Furthermore, few-shot experiments suggest that our framework is robust in the data scarcity scenarios.We also compare multiple combinations of data augmentation strategies and provide fine-grained analysis for interpreting how our approach works.We hope our work will provide a new perspective for future researches on sentence representation transfer.

Broader Impact
Sentence representation learning is a basic task in natural language processing and benefits many downstream tasks.This work proposes a contrastive learning based framework to solve the collapse issue of BERT and transfer BERT sentence representations to target data distribution.Our approach not only provides a new perspective about BERT's representation space, but is also useful in practical applications, especially for data scarcity scenarios.When applying our approach, the user should collect a few unlabeled texts from target data distribution and use our framework to finetune BERT encoder in a self-supervised manner.Since our approach is self-supervised, no bias will be introduced from human annotations.Moreover, our data augmentation strategies also have little probability to introduce extra biases since they are all based on random sampling.However, it is still possible to introduce data biases from the unlabeled texts.Therefore, users should pay special attention to ensure that the training data is ethical, unbiased, and closely related to downstream tasks.

Figure 1 :
Figure 1: The correlation diagram between the gold similarity score (x-axis) and the model predicted cosine similarity score (y-axis) on the STS benchmark dataset.

Figure 2 :
Figure 2: The general framework of our proposed approach.

Figure 3 :
Figure 3: The four data augmentation strategies used in our experiments.

Figure 4 :
Figure 4: The average spearman correlation on STS tasks w.r.t. the number of removed top-k frequent tokens.Note that we also considered the [CLS] and [SEP] tokens and they are the 2 most frequent tokens.The frequency of each token is calculated through the test split of the STS Benchmark dataset.

Figure 6 :
Figure6: The few-shot experiments under the unsupervised and supervised settings.We report the average spearman correlation on STS datasets with 1, 10, 100, 1,000, and 10,000 unlabeled texts available, respectively.The full dataset indicates all 89192 unlabeled texts from 7 STS datasets.

Figure 7 :
Figure 7: The influence of different temperatures in NT-Xent.The best performance is achieved when the temperature is set to 0.1. 3

Table 1 :
The statistics of STS datasets.

Table 2 :
The performance comparison of ConSERT with other methods in an unsupervised setting.We report the spearman correlation ρ × 100 on 7 STS datasets.Methods with † indicate that we directly report the scores from the corresponding paper, while methods with ‡ indicate our implementation.

Table 3 :
The performance comparison of ConSERT with other methods in a supervised setting.We report the spearman correlation ρ × 100 on 7 STS datasets.Methods with † indicate that we directly report the scores from the corresponding paper, while methods with ‡ indicate our implementation.

Table 4 :
The average spearman correlation as well as the training steps of our unsupervised approach with different batch sizes.