PromptBERT: Improving BERT Sentence Embeddings with Prompts

We propose PromptBERT, a novel contrastive learning method for learning better sentence representation. We firstly analysis the drawback of current sentence embedding from original BERT and find that it is mainly due to the static token embedding bias and ineffective BERT layers. Then we propose the first prompt-based sentence embeddings method and discuss two prompt representing methods and three prompt searching methods to make BERT achieve better sentence embeddings .Moreover, we propose a novel unsupervised training objective by the technology of template denoising, which substantially shortens the performance gap between the supervised and unsupervised settings. Extensive experiments show the effectiveness of our method. Compared to SimCSE, PromptBert achieves 2.29 and 2.58 points of improvement based on BERT and RoBERTa in the unsupervised setting.


Introduction
In recent years, we have witnessed the success of pre-trained language models like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) in sentence embeddings (Gao et al., 2021b;Yan et al., 2021).However, original BERT still shows poor performance in sentence embeddings (Reimers and Gurevych, 2019;Li et al., 2020).The most commonly used example is that it underperforms the traditional word embedding methods like GloVe (Pennington et al., 2014).
Previous research has linked anisotropy to explain the poor performance of original BERT (Li et al., 2020;Yan et al., 2021;Gao et al., 2021b).Anisotropy makes the token embeddings occupy a narrow cone, resulting in a high similarity between any sentence pair (Li et al., 2020).Li et al. (2020) proposed a normalizing flows method to transform the sentence embeddings distribution to a smooth and isotropic Gaussian distribution and Yan et al. (2021) presented a contrastive framework to transfer sentence representation.The goal of these methods is to eliminate anisotropy in sentence embeddings.However, we find that anisotropy may not be the primary cause of poor semantic similarity.For example, averaging the last layer of original BERT is even worse than averaging its static token embeddings in semantic textual similarity task, but the sentence embeddings from last layer are less anisotropic than static token embeddings.
Following this result, we find original BERT layers actually damage the quality of sentence embeddings.However, if we treat static token embeddings as word embedding, it still yields unsatisfactory results compared to GloVe.Inspired by (Li et al., 2020), who found token frequency biases its distribution, we find the distribution of token embeddings is not only biased by frequency, but also case sensitive and subword in WordPiece (Wu et al., 2016).We design a simple experiment to test our conjecture by simply removing these biased tokens (e.g., high frequency subwords and punctuation) and using the average of the remaining token embeddings as sentence representation.It can outperform the Glove and even achieve results comparable to post-processing methods BERT-flow (Li et al., 2020) and BERT-whitening (Su et al., 2021).
Motivated by these findings, avoiding embedding bias can improve the performance of sentence representations.However, it is labor-intensive to manually remove embedding biases and it may result in the omission of some meaningful words if the sentence is too short.Inspired by (Brown et al., 2020), which has reformulated the different NLP tasks as fill-in-the-blanks problems by different prompts, we propose a prompt-based method by using the template to obtain the sentence represen-tations in BERT.Prompt-based method can avoid embedding bias and utilize the original BERT layers.We find original BERT can achieve reasonable performance with the help of the template in sentence embeddings, and it even outperforms some BERT based methods, which fine-tune BERT in down-stream tasks.
Our approach is equally applicable to fine-tuned settings.Current methods utilize contrastive learning to help the BERT learn better sentence embeddings (Gao et al., 2021b;Yan et al., 2021).However, the unsupervised methods still suffer from leaking proper positive pairs.Yan et al. (2021) discuss four data augmentation methods, but the performance seems worse than directly using the dropout in BERT as noise (Gao et al., 2021b).We find prompts can provide a better way to generate positive pairs by different viewpoints from different templates.To this end, we propose a prompt based contrastive learning method with template denoising to leverage the power of BERT in an unsupervised setting, which significantly shortens the gap between the supervised and unsupervised performance.Our method achieves state-of-the-art results in both unsupervised and supervised settings.

Related Work
Learning sentence embeddings as a fundamental NLP problem has been largely studied.Currently, how to leverage the power of BERT in sentence embeddings has become a new trend.Many works (Li et al., 2020;Gao et al., 2021b) achieved strong performance with BERT in both supervised and unsupervised settings.Among these works, contrastive learning based methods achieve state-ofthe-art results.Gao et al. (2021b) proposed a novel contrastive training objective to directly use inner dropout as noise to construct positive pairs.Yan et al. (2021) discussed four methods to construct positive pairs.Although BERT achieved success in sentence embeddings, original BERT shows unsatisfactory performance (Reimers and Gurevych, 2019;Li et al., 2020).One explanation is the anisotropy in original BERT, which causes sentence pairs to have high similarity, some works (Li et al., 2020;Su et al., 2021) focused on reducing the anisotropy by post-processing sentence embeddings.

Rethinking the Sentence Embeddings of the Original BERT
Previous works (Yan et al., 2021;Gao et al., 2021b) explained the poor performance of the original BERT is mainly due to the learned anisotropic token embeddings space, where the token embeddings occupy a narrow cone.However, we find that anisotropy is not a key factor to inducing poor semantic similarity by examining the relationship between the aniostropy and performance.We think the main reasons are the ineffective BERT layers and static token embedding biases.
Observation 1: Original BERT layers fail to improve the performance.In this section, we analyze the influence of BERT layers by comparing the two sentence embedding methods: averaging static token embeddings (input of the BERT layers) and averaging last layer (output of the BERT layers).We report the sentence embedding performance and its sentence level anisotropy.
To measure the anisotropy, we follow the work of Ethayarajh (2019) to measure the sentence level anisotropy in sentence embeddings.Let s i be a sentence that appears in corpus {s 1 , ..., s n }.The anisotropy can be measured as follows: where M denotes the sentence embedding method, which maps the raw sentence to its embedding and cos is the cosine similarity.In other words, the anisotropy of M is measured by the average cosine similarity of a set of sentences.If sentence embeddings are isotropic (i.e., directionally uniform), then the average cosine similarity between uniformly randomly sampled sentences would be 0 (Arora et al., 2017).The closer it is to 1, the more anisotropic the embedding of sentences.We randomly sample 100,000 sentences from the Wikipedia corpus to compute the anisotropy.We compare different pre-trained models (bert-base-uncased, bert-base-cased and roberta-base) in combination with different sentence embedding methods ( last layer average, averaging of last hidden layer tokens as sentence embeddings and static token embeddings, directly averaging of static token embeddings).We list the spearman correlation and sentence level anisotropy of each combination in Table 1 shows the BERT layers in bertbase-uncased and roberta-base significantly harm the sentence embeddings performance.Even in bert-base-cased, the gain of BERT layers is trivial with only 0.28 improvement.We also show the sentence level anisotropy of each method.The performance degradation of the BERT layers seems not to be related to the sentence level anisotropy.For example, the last layer averaging is more isotropic than the static token embeddings averaging in bert-base-uncased.However, the static token embeddings average achieves better sentence embeddings performance.
Observation 2: Embedding biases harms the sentence embeddings performance.Li et al. (2020) found that token embeddings can be biased to token frequency.Similar problems have been studied in (Yan et al., 2021).The anisotropy in BERT static token embeddings is sensitive to token frequency.Therefore, we investigate whether embedding bias yields unsatisfactory performance of sentence embeddings.We observe that the token embeddings is not only biased by token frequency, but also subwords in WordPiece (Wu et al., 2016) and case sensitive.
As shown in Figure 1, we visualize these biases in the token embeddings of bert-baseuncased, bert-base-cased and robertabase.The token embeddings of three pre-trained models are highly biased by the token frequency, subword and case.The token embeddings can be roughly divided into three regions according to the subword and case biases: 1) the lowercase beginof-word tokens, 2) the uppercase begin-of-word tokens and 3) the subword tokens.For uncased pretrained model bert-base-uncased, the token embeddings also can be roughly divided into two regions: 1) the begin-of-word tokens, 2) the subword tokens.
For frequency bias, we can observe that high frequency tokens are clustered, while low frequency tokens are dispersed sparsely in all three models (Yan et al., 2021).The begin-of-word tokens are more vulnerable to frequency than subword tokens in BERT.However, the subword tokens are more vulnerable in RoBERTa.
Previous works (Yan et al., 2021;Li et al., 2020) often link token embeddings bias to the token embedding anisotropy and argue it is the main reason for the bias.However, we believe the anisotropy is unrelated to the bias.The bias means the distribution of embedding is disturbed by some irrelevant information like token frequency, which can be directly visualized according to the PCA.For the anisotropy, it means the whole embedding occupies a narrow cone in the high dimensional vector space, which cannot be directly visualized.Table 2 shows the static token embeddings anisotropy of three pre-trained models in Figure 1 according to the average the cosine similarity between any two token embeddings.Contrary to the previous conclusion (Yan et al., 2021;Li et al., 2020), we find only bert-base-uncased's static token embeddings is highly anisotropic.The static token embeddings like roberta-base are isotropic with 0.0235 average cosine similarity.For biases, these models suffer from the biases in static token embeddings, which is irrelevant to the anisotropy.
To prove the negative impact of biases, we show the influence of biases to the sentence embeddings with averaging static token embeddings as sentence embeddings (without BERT layers).The results of eliminating embedding biases are quite impressive on three pre-trained models in Table 3. Simply removing a set of tokens, the result can be improved by 9.22, 7.08 and 11.76 respectively.The final result of roberta-base can outperform postprocessing methods such as BERT-flow (Li et al.,   Table 3: The influence of static embedding biases in spearman correlation.The spearman correlation is the average of STS12-16, STS-B and SICK.Cased, uncased and roberta represent bert-base-cased, bert-base-uncased and roberta-base.For Freq., Sub., Case.and Pun., we remove the top frequency tokens, subword tokens, uppercase tokens and punctuation respectively.More details can be found in Appendix A. 2020) and BERT-whitening (Su et al., 2021) with only using static token embeddings.
Manually removing embedding biases is a simple method to improve the performance of sentence embeddings.However, if the sentence is too short, this is not an adequate solution, which may result in the omission of some meaningful words.

Prompt Based Sentence Embeddings
Inspired by Brown et al. (2020), we propose a prompt based sentence method to obtain sentence embeddings.By reformulating the sentence embedding task as the mask language task, we can effectively use original BERT layers by leveraging the large-scale knowledge.We also avoid the embedding biases by representing sentences from [MASK] tokens.
However, unlike the text classification and question-answering tasks, the output in sentence embeddings is not the label tokens predicted by MLM classification head, but the vector to represent the sentence.We discuss the implementation of prompt based sentence embeddings through addressing the following two questions: 1) how to represent sentences with the prompt, and 2) how to find a proper prompt for sentence embeddings.Based on these, we propose a prompt based contrastive learning method to fine-tuning BERT on sentence embeddings.

Represent Sentence with the Prompt
In this section, we discuss two methods to represent one sentence with a prompt.For example, we have a template "[X] means [MASK]", where [X] is a placeholder to put sentences and [MASK] represents the [MASK] token.Given a sentence x in , we map x in to x prompt with the template.Then we feed x prompt to a pre-trained model to generate sentence representation h.
One method is to use the hidden vector of [MASK] token as sentence representation: (2) For the second method like other prompt based tasks, we get the top-k tokens according to h [MASK] and MLM classification head, then calculate the weighted average of these tokens according to probability distribution.The h can be formulated as: (3) where v is the BERT token in the top-k tokens set V top-k , W v is the static token embeddings of v and P v = [MASK]|h [MASK] denotes the probability of token v be predicted by MLM head with h The second method, which maps the sentence to the tokens, is more conventional than the first.But its disadvantages are obvious: 1) as previously noted, due to the sentence embeddings from averaging of static token embeddings, it still suffers from biases.2) weight averaging makes the BERT hard to fine-tune in down-stream tasks.For these reasons, we represent the sentence with the prompt by the first method.

Prompt Search
For prompt based tasks, one key challenge is to find templates.We discuss three methods to search for templates in this section: manual search, template generation based on T5 (Gao et al., 2021a) and OptiPrompt (Zhong et al., 2021).We use the spearman correlation in the STS-B development set as the main metric to evaluate different templates.
For manual search, we need to hand-craft templates and give a strong hint that the whole sentence is represented as h [MASK] .To search templates, we divide the template into two parts: relationship tokens, which denote the relationship between [X] and [MASK], and prefix tokens, which wrap [X].Then we greedily search for templates following the relationship tokens and prefix tokens.For template generation based on T5, Gao et al. (2021a) proposed a novel method to automatically generate templates by using T5 to generate templates according to the sentences and corresponding labels.The generated templates can outperform the manual searched templates in the GLUE benchmark (Wang et al., 2018).
However, the main issue to implement it is the lack of label tokens.Tsukagoshi et al. (2021) successfully transformed the sentence embeddings task to the text classification task by classifying the definition sentence to its word according to the dictionary.Inspired by this, we use words and corresponding definitions to generate 500 templates (e.g., orange: a large round juicy citrus fruit with a tough bright reddish-yellow rind).Then we evaluate these templates in the STS-B development set, the best spearman correlation is 64.75 with the template "Also called [MASK].[X]".Perhaps it is the gap between sentence embeddings and word definition.This method cannot generate better templates compared to manual searching.
OptiPrompt (Zhong et al., 2021) replaced discrete template with the continuous template.To optimize the continuous template, we use the unsupervised contrastive learning as training objective following the settings in Gao et al. (2021b) with freezing the whole BERT parameters, and the continuous template is initialized by manual template's static token embeddings.Compared to the input manual template, the continuous template can increase the spearman correlation from 73.44 to 80.90 on STS-B development set.

Prompt Based Contrastive Learning with Template Denoising
Recently, contrastive learning successfully leverages the power of BERT in sentence embeddings.
A challenge in sentence embeddings contrastive learning is how to construct proper positive instances.Gao et al. (2021b) directly used the dropout in the BERT as positive instances.Yan et al. (2021) discussed the four data augmentation strategies such as adversarial attack, token shuffling, cutoff and dropout in the input token embeddings to construct positive instances.Motivated by the prompt based sentence embeddings, we propose a novel method to reasonably generate positive instances based on prompt.The idea is using the different templates to represent the same sentence as different points of view, which helps model to produce more reasonable positive pairs.To reduce the influence of the template itself on the sentence representation, we propose a novel way to denoise the template information.Given the sentence x i , we first calculate the corresponding sentence embeddings h i with a template.Then we calculate the template bias ĥi by directly feeding BERT with the template and the same template position ids.For example, if the x i has 5 tokens, then the position ids of template tokens after the [X] will be added by 5 to make sure the position ids of template are same.Finally, we can directly use the h i − ĥi as the denoised sentence representation.For the template denoising, more details can be found in Discussion.
Formally, let h ′ i and h i denote the sentence embeddings of x i with different templates, ĥ′ i and ĥi denotes the two template biases of the x i respectively, the final training objective is as follows: where τ is a temperature hyperparameter in contrastive learning and N is the size of mini-batch.

Experiments
We conduct experiments on STS tasks with non fine-tuned and fine-tuned BERT settings.For non fine-tuned BERT settings, we exploit the performance of original BERT in sentence embeddings, which corresponds to the previous findings of the poor performance of original BERT.For fine-tuned BERT settings, we report the unsupervised and supervised results by fine-tuning BERT with downstream tasks.The results of transfer tasks are in Appendix C.

Baselines
We compare our method with both enlightening and state-of-the-art methods.To validate the effectiveness of our method in the non fine-tuned setting, we use the GLoVe (Pennington et al., 2014) and postprocess methods: BERT-flow (Li et al., 2020) and BERT-whitening (Su et al., 2021) as baselines.For the fine-tuned setting, we compare our method with IS-BERT (Zhang et al., 2020), InferSent (Conneau et al., 2017), Universal Sentence Encoder (Cer et al., 2018), SBERT (Reimers and Gurevych, 2019) and the contrastive learning based methods: SimCSE (Gao et al., 2021b) and ConSERT (Yan et al., 2021).

Implementation Details
For the non fine-tuned setting, we report the result of BERT to validate the effectiveness of our representation method.For the fine-tuned setting, we use BERT and RoBERTa with the same unsupervised and supervised training data with (Gao et al., 2021b).Our methods are trained with prompt based contrastive learning with template denosing.The templates used for both settings are manual searched according to Table 4.More details can be found in Appendix B.

Non Fine-Tuned BERT Results
To connect with the previous analysis of the poor performance of original BERT, we report our prompt based methods with non fine-tuned BERT in Table 5.Using templates can substantially improve the results of original BERT on all datasets.Compared to pooling methods like averaging of last layer or averaging of first and last layers, our methods can improve spearman correlation by more than 10%.Compared to the postprocess methods: BERT-flow and BERT-whitening, only using the manual template surpasses can these methods.Moreover, we can use the continuous template by OptiPrompt to help original BERT achieve much better results, which even outperforms unsupervised ConSERT in Table 6.

Fine-Tuned BERT Results
The results of fine-tuned BERT are shown in Table 6.Following previous works (Reimers and Gurevych, 2019), we run unsupervised and supervised methods respectively.Although the current Table 5: The performance comparison of our unfine-tuned BERT method on STS tasks.† : results from (Gao et al., 2021b).The BERT-flow (Li et al., 2020) and BERT-whitening (Su et al., 2021)  Table 6: The performance comparison of our fine-tuned BERT methods on STS tasks.For unsupervised models, we found the result of unsupervised constrastive learning is unstable, and we train our model with 10 random seeds.†: results from (Gao et al., 2021b).‡: results from (Yan et al., 2021).§: results from (Reimers and Gurevych, 2019).¶: results from (Zhang et al., 2020).
contrastive learning based methods (Gao et al., 2021b;Yan et al., 2021) achieved significant improvement compared to the previous methods, our method still outperforms them.Prompt based contrastive learning objective significantly shortens the gap between the unsupervised and supervised methods.It also proves our method can leverage the knowledge of unlabeled data with different templates as positive pairs.Moreover, we report the unsupervised performance with 10 random seeds to achieve more accurate results.In Discussion, we also report the result of SimCSE with 10 random seeds.Compared to SimCSE, our method shows more stable results than it.

Effectiveness of Prompt Based Contrastive Learning with Template Denoising
We report the results of different unsupervised training objectives in prompt based BERT.We use the following training objectives: 1) the same template, which uses inner dropout noise as data augmentation (Gao et al., 2021b) 2) the different templates as positive pairs 3) the different templates with template denoising (our default method

Template Denoising
We find the template denoising efficiently removes the bias from templates and improves the quality of top-k tokens predicted by MLM head in original BERT.As Table 7 shows, we predict some sentences' top-5 tokens in the [MASK] tokens.We find the template denoising removes the unrelated tokens like "nothing,no,yes" and helps the model predict more related tokens.To quantify this, we also represent the sentence from the Eq. 3 by using the weighted average of top-200 tokens as the sentence embeddings.The results are shown in Table 9.The template denoising significantly improves the quality of tokens predicted by MLM head.However, it can't improve the performance for our default represent method in the Eq. 2 ([MASK] token in Table 9

Stability in Unsupervised Contrastive Learning
To prove the unstable results in unsupervised contrastive learning in sentence embeddings, we also reproduce the result of unsupervised SimCSE-BERT base with 10 random seeds in Table 10.Our results are more stable than SimCSE.The difference between the best and worst results can be up to 3.14% in SimCSE.However, the gap in our method is only 0.53.

Conclusion
In this paper, we analyzed the poor performance of original BERT for sentence embeddings, and find original BERT is underestimated in sentence embeddings due to inappropriate sentence representation methods.These methods suffer from static token embedding bias and do not effectively use the original BERT layer.To better leverage BERT in sentence embeddings, we propose a prompt-based sentence embedding method, which helps original BERT achieve impressive performance in sentence embeddings.To further improve our method in finetuning, we proposed a contrastive learning method based on template denoising.Our extensive experiments demonstrate the efficiency of our method on STS tasks and transfer tasks.

Limitation
While our methods achieve reasonable performance on both unsupervised and supervised settings, the templates used are still manually generated.Although we have tried automatic templates generated by T5, these templates still underperform manual templates.Furthermore, we also show the performance with continuous templates, which verify the efficiency of prompts in sentence embeddings.We expect that a carefully designed automatic template-generated mechanism can lead to higher improvement.We leave it in the future.

A Static Token Embeddings Biases
A.1 Eliminating Biases by Removing Tokens We reported the detailed implementation of eliminating static token embeddings biases by deleting tokens on bert-base-uncased, bertbase-cased and roberta-base.For Freq. tokens, we follow the settings in (Yan et al., 2021) and remove the top 36 frequent tokens.The removed Freq.tokens are shown in Table 11.For Sub. tokens, we directly remove all subword tokens (yellow tokens in Figure 2).For Case. tokens, only SICK (Marelli et al., 2014) has sentences with upper and lower case, and we lowercase these sentences to remove the uppercased tokens (red tokens in Figure 2).For Pun., we remove the tokens, which contain only punctuations.

A.2 Eliminating Biases by Pre-training
According to (Gao et al., 2019), we find the most of biases in static token embeddings are gradient from the MLM classification head weight, which transform the last hidden vector of [MASK] to the probability of all tokens.The tying weight between the static token embeddings and MLM classification head causes static token embeddings to suffer from bias problems.
We have pre-trained two BERT-like models with the MLM pre-training objective.The only difference between the two pre-trained models is tying and untying the weight between static token embeddings and MLM classification head.We have pre-trained these two models on 125k steps with 2k batch sizes.
As shown in Figure 2, we have shown the static token embeddings of the untying model, MLM  head weight of untying model and static token embeddings (MLM head weight) of the tying model.The distribution of the tying model and the head weight of the untying model is same with bertbase-cased in Figure 1, which severely suffers from the embedding biases.However, the distribution of the token embeddings in the untying weights model is less influenced by these biased.We also report the average spearman correlation of three embedding on STS tasks in Table 12.Static token embeddings of the untying model achieves the best correlation among the three embeddings.For the non fine-tuned setting, the manual template we used is This sentence : "[X]" means [MASK] .. For OptPrompt, we first initialize the template embeddings with the manual template and then train these template embeddings by freezing BERT with the unsupervised training task followed by (Gao et al., 2021b), and the batch size, learningrate, epoch and valid steps are 256, 3e-5, 5 and 1000.
For the fine-tuned setting, all training data is same with (Gao et al., 2021b).The max sentence sequence length is set to 32.For templates, we only use the manual templates, which are manually searched according to STS-B dev in unfinetuned models.The templates is shown in Table 13.For unsupervised method, we use two different templates for unsupervised training with template denosing according to our prompt based training objective.In predicting, we directly use the one template without template denoising.For super- vised method, we use template denoising with same template for contrastive learning, because we already have supervised negative samples.We also report other training details in Table 14.

C Transfer Tasks
We also evaluate our models on the following transfer tasks: MR, CR, SUBJ, MPQA, SST-2, TREC and MRPC.We follow the default configurations in SentEval1 .The results are shown in Table 15.Comparing to SimCSE, our RoBERTa based method can improve 2.52 and 0.92 on unsupervised and supervised models respectively.

Figure 1 :
Figure1: 2D visualization of token embeddings with different biases.For frequency bias, the darker the color, the higher the token frequency.For subword and case bias, yellow represents subword and red represents the token contains capital letters.
(a) Frequency bias in static token embeddings of untying weights pre-trained model.(b) Frequency bias in MLM head of untying weights pre-trained model.(c) Frequency bias in tying weights pretrained model.(d) Subword and Case biases in static token embeddings of untying weights pretrained model.(e) Subword and Case biases in MLM head of untying weights pre-trained model.(f) Subword and Case biases in tying weights pre-trained model.

Figure 2 :
Figure2: 2D visualization of static token embeddings in untying and tying weights pre-trained model.For frequency bias, the darker the color, the higher the token frequency.For subword and case bias, yellow represents subword and red represents the token contains capital letters.
of untying model 49.41 Static token embeddings of tying model 45.68

Table 2 :
The average cosine similarity in static token embeddings
Some results of greedy searching are shown in

Table 4 .
When it comes to sentence embeddings, different templates produce extremely varied results.Compared to simply concatenating the [X] and [MASK], complex templates like This sentence : "[X]" means[MASK]., can improve the spearman correlation by 34.10.
(Su et al., 2021)ting.All BERT based methods use bert-base-uncased .Last avg.denotes averaging the last layer of BERT.Static avg.denotes averaging the static token embedding of BERT.First-last avg.(Su et al., 2021)uses the first and last layer.Static remove biases avg.means removing biased tokens in static avg., which we have introduced before.
).Moreover, we use the same template and setting to predict and only change the way to generate positive pairs in the training stage.All results are from 10 random ,unhappy,upset,angry the man is playing the guitar.guitar,song,music,guitarist,bassguitar,guitarist,guitars,playing,guitarists the man is playing the piano.piano,music,no,yes,basspiano,pianist,pianos,playing,guitar

Table 7 :
The top-5 tokens predicted by manual template with original BERT.runs.The result is shown in Table8.We observe our method can achieve the best and most stable results among three training objectives.

Table 8 :
Comparison of different unsupervised training objectives.
).In this work, we only use the template denoising in our contrastive training objective, which helps us eliminate different template biases.

Table 10 :
Results in unsupervised contrastive learning.

Table 12 :
The avg. spearman correlation of three embeddings.

Table 13 :
Templates for our method in fine-tuned setting

Table 14 :
Hyperparameters for our method in fine-tuned setting

Table 15 :
Transfer task results of different sentence embedding models.