SimCSE: Simple Contrastive Learning of Sentence Embeddings

This paper presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework, by using “entailment” pairs as positives and “contradiction” pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman’s correlation respectively, a 4.2% and 2.2% improvement compared to previous best results. We also show—both theoretically and empirically—that contrastive learning objective regularizes pre-trained embeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.


Introduction
Learning universal sentence embeddings is a fundamental problem in natural language processing and has been studied extensively in the literature (Kiros et al., 2015;Hill et al., 2016;Conneau et al., 2017;Logeswaran and Lee, 2018;Cer et al., 2018;Reimers and Gurevych, 2019, inter alia).
In this work, we advance state-of-the-art sentence embedding methods and demonstrate that a contrastive objective can be extremely effective when coupled with pre-trained language models such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019).We present SimCSE, a simple contrastive sentence embedding framework, which can produce superior sentence embeddings, from either unlabeled or labeled data.
Our unsupervised SimCSE simply predicts the input sentence itself with only dropout (Srivastava et al., 2014) used as noise (Figure 1(a)).In other words, we pass the same sentence to the pre-trained encoder twice: by applying the standard dropout twice, we can obtain two different embeddings as "positive pairs".Then we take other sentences in the same mini-batch as "negatives", and the model predicts the positive one among the negatives.Although it may appear strikingly simple, this approach outperforms training objectives such as predicting next sentences (Logeswaran and Lee, 2018) and discrete data augmentation (e.g., word deletion and replacement) by a large margin, and even matches previous supervised methods.Through careful analysis, we find that dropout acts as minimal "data augmentation" of hidden representations while removing it leads to a representation collapse.
Our supervised SimCSE builds upon the recent success of using natural language inference (NLI) datasets for sentence embeddings (Conneau et al., 2017;Reimers and Gurevych, 2019) and incorporates annotated sentence pairs in contrastive learning (Figure 1(b)).Unlike previous work that casts it as a 3-way classification task (entailment, neutral, and contradiction), we leverage the fact that entailment pairs can be naturally used as positive instances.We also find that adding corresponding contradiction pairs as hard negatives further improves performance.This simple use of NLI datasets achieves a substantial improvement compared to prior methods using the same datasets.We also compare to other labeled sentence-pair arXiv:2104.08821v4[cs.CL] 18 May 2022 The pets are sitting on a couch.

Different hidden dropout masks in two forward passes
There are animals outdoors.
There is a man.
The man wears a business suit.
A kid is skateboarding.
A kit is inside the house.
Two dogs are running.
A man surfing on the sea.
A kid is on a skateboard.datasets and find that NLI datasets are especially effective for learning sentence embeddings.
To better understand the strong performance of SimCSE, we borrow the analysis tool from Wang and Isola (2020), which takes alignment between semantically-related positive pairs and uniformity of the whole representation space to measure the quality of learned embeddings.Through empirical analysis, we find that our unsupervised Sim-CSE essentially improves uniformity while avoiding degenerated alignment via dropout noise, thus improving the expressiveness of the representations.The same analysis shows that the NLI training signal can further improve alignment between positive pairs and produce better sentence embeddings.We also draw a connection to the recent findings that pre-trained word embeddings suffer from anisotropy (Ethayarajh, 2019;Li et al., 2020) and prove that-through a spectrum perspective-the contrastive learning objective "flattens" the singular value distribution of the sentence embedding space, hence improving uniformity.
We conduct a comprehensive evaluation of Sim-CSE on seven standard semantic textual similarity (STS) tasks (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016;;Cer et al., 2017;Marelli et al., 2014) and seven transfer tasks (Conneau and Kiela, 2018).On the STS tasks, our unsupervised and supervised models achieve a 76.3% and 81.6% averaged Spearman's correlation respectively using BERT base , a 4.2% and 2.2% improvement compared to previous best results.We also achieve competitive performance on the transfer tasks.Finally, we identify an incoherent evaluation issue in the literature and consolidate the results of different settings for future work in evaluation of sentence embeddings.

Background: Contrastive Learning
Contrastive learning aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors (Hadsell et al., 2006).It assumes a set of paired examples , where x i and x + i are semantically related.We follow the contrastive framework in Chen et al. (2020) and take a cross-entropy objective with in-batch negatives (Chen et al., 2017;Henderson et al., 2017): let h i and h + i denote the representations of x i and x + i , the training objective for (x i , x + i ) with a mini-batch of N pairs is: where τ is a temperature hyperparameter and sim(h 1 , h 2 ) is the cosine similarity In this work, we encode input sentences using a pre-trained language model such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019): h = f θ (x), and then fine-tune all the parameters using the contrastive learning objective (Eq.1).
Positive instances.One critical question in contrastive learning is how to construct (x i , x + i ) pairs.In visual representations, an effective solution is to take two random transformations of the same image (e.g., cropping, flipping, distortion and rotation) as x i and x + i (Dosovitskiy et al., 2014).A similar approach has been recently adopted in language representations (Wu et al., 2020;Meng et al., 2021) by applying augmentation techniques such as word deletion, reordering, and substitution.However, data augmentation in NLP is inherently difficult because of its discrete nature.As we will see in §3, simply using standard dropout on intermediate representations outperforms these discrete operators.
In NLP, a similar contrastive learning objective has been explored in different contexts (Henderson et al., 2017;Gillick et al., 2019;Karpukhin et al., 2020).In these cases, (x i , x + i ) are collected from supervised datasets such as question-passage pairs.Because of the distinct nature of x i and x + i , these approaches always use a dual-encoder framework, i.e., using two independent encoders f θ 1 and f θ 2 for x i and x + i .For sentence embeddings, Logeswaran and Lee (2018) also use contrastive learning with a dual-encoder approach, by forming current sentence and next sentence as (x i , x + i ).Alignment and uniformity.Recently, Wang and Isola (2020) identify two key properties related to contrastive learning-alignment and uniformityand propose to use them to measure the quality of representations.Given a distribution of positive pairs p pos , alignment calculates expected distance between embeddings of the paired instances (assuming representations are already normalized): (2) On the other hand, uniformity measures how well the embeddings are uniformly distributed: where p data denotes the data distribution.These two metrics are well aligned with the objective of contrastive learning: positive instances should stay close and embeddings for random instances should scatter on the hypersphere.In the following sections, we will also use the two metrics to justify the inner workings of our approaches.

Unsupervised SimCSE
The idea of unsupervised SimCSE is extremely simple: we take a collection of sentences {x i } m i=1 and use x + i = x i .The key ingredient to get this to work with identical positive pairs is through the use of independently sampled dropout masks for x i and x + i .In standard training of Transformers (Vaswani et al., 2017), there are dropout masks placed on fully-connected layers as well as attention probabilities (default p = 0.1).We denote h z i = f θ (x i , z) where z is a random mask for dropout.We simply feed the same input to the encoder twice and get The two columns denote whether we use one encoder or two independent encoders.Next 3 sentences: randomly sample one from the next 3 sentences.Delete one word: delete one word randomly (see Table 1).
two embeddings with different dropout masks z, z , and the training objective of SimCSE becomes: for a mini-batch of N sentences.Note that z is just the standard dropout mask in Transformers and we do not add any additional dropout.
Dropout noise as data augmentation.We view it as a minimal form of data augmentation: the positive pair takes exactly the same sentence, and their embeddings only differ in dropout masks.
We compare this approach to other training objectives on the STS-B development set (Cer et al., 2017)  h = f θ (g(x), z) and g is a (random) discrete operator on x.We note that even deleting one word would hurt performance and none of the discrete augmentations outperforms dropout noise.
We also compare this self-prediction training objective to the next-sentence objective used in Logeswaran and Lee ( 2018), taking either one encoder or two independent encoders.As shown in Table 2, we find that SimCSE performs much better than the next-sentence objectives (82.5 vs 67.4 on STS-B) and using one encoder instead of two makes a significant difference in our approach.

Why does it work?
To further understand the role of dropout noise in unsupervised SimCSE, we try out different dropout rates in Table 3 and observe that all the variants underperform the default dropout probability p = 0.1 from Transformers.We find two extreme cases particularly interesting: "no dropout" (p = 0) and "fixed 0.1" (using default dropout p = 0.1 but the same dropout masks for the pair).In both cases, the resulting embeddings for the pair are exactly the same, and it leads to a dramatic performance degradation.We take the checkpoints of these models every 10 steps during training and visualize the alignment and uniformity metrics 3 in Figure 2, along with a simple data augmentation model "delete one word".As clearly shown, starting from pre-trained checkpoints, all models greatly improve uniformity.However, the alignment of the two special variants also degrades drastically, while our unsupervised SimCSE keeps a steady alignment, thanks to the use of dropout noise.It also demonstrates that starting from a pretrained checkpoint is crucial, for it provides good initial alignment.At last, "delete one word" improves the alignment yet achieves a smaller gain on the uniformity metric, and eventually underperforms unsupervised SimCSE. 3We take STS-B pairs with a score higher than 4 as ppos and all STS-B sentences as p data .

Supervised SimCSE
We have demonstrated that adding dropout noise is able to keep a good alignment for positive pairs (x, x + ) ∼ p pos .In this section, we study whether we can leverage supervised datasets to provide better training signals for improving alignment of our approach.Prior work (Conneau et al., 2017;Reimers and Gurevych, 2019) has demonstrated that supervised natural language inference (NLI) datasets (Bowman et al., 2015;Williams et al., 2018) are effective for learning sentence embeddings, by predicting whether the relationship between two sentences is entailment, neutral or contradiction.In our contrastive learning framework, we instead directly take (x i , x + i ) pairs from supervised datasets and use them to optimize Eq. 1.
Choices of labeled data.We first explore which supervised datasets are especially suitable for constructing positive pairs (x i , x + i ).We experiment with a number of datasets with sentence-pair examples, including 1) QQP4 : Quora question pairs; 2) Flickr30k (Young et al., 2014): each image is annotated with 5 human-written captions and we consider any two captions of the same image as a positive pair; 3) ParaNMT (Wieting and Gimpel, 2018): a large-scale back-translation paraphrase dataset5 ; and finally 4) NLI datasets: SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018).
We train the contrastive learning model (Eq. 1) with different datasets and compare the results in Table 4.For a fair comparison, we also run experiments with the same # of training pairs.Among all the options, using entailment pairs from the NLI (SNLI + MNLI) datasets performs the best.We think this is reasonable, as the NLI datasets consist of high-quality and crowd-sourced pairs.Also, human annotators are expected to write the hypotheses manually based on the premises and two sentences tend to have less lexical overlap.For instance, we find that the lexical overlap (F1 measured between two bags of words) for the entailment pairs (SNLI + MNLI) is 39%, while they are 60% and 55% for QQP and ParaNMT.
Contradiction as hard negatives.Finally, we further take the advantage of the NLI datasets by using its contradiction pairs as hard negatives 6 .In NLI datasets, given one premise, annotators are required to manually write one sentence that is absolutely true (entailment), one that might be true (neutral), and one that is definitely false (contradiction).Therefore, for each premise and its entailment hypothesis, there is an accompanying contradiction hypothesis7 (see Figure 1 for an example).
Formally, we extend , where x i is the premise, x + i and x − i are entailment and contradiction hypotheses.The training objective i is then defined by (N is mini-batch size): .
(5) As shown in Table 4, adding hard negatives can further improve performance (84.9 → 86.2) and this is our final supervised SimCSE.We also tried to add the ANLI dataset (Nie et al., 2020) or combine it with our unsupervised SimCSE approach, but didn't find a meaningful improvement.We also considered a dual encoder framework in supervised SimCSE and it hurt performance (86.2 → 84.2). demonstrate that language models trained with tied input/output embeddings lead to anisotropic word embeddings, and this is further observed by Ethayarajh (2019) in pre-trained contextual representations.Wang et al. (2020) show that singular values of the word embedding matrix in a language model decay drastically: except for a few dominating singular values, all others are close to zero.
A simple way to alleviate the problem is postprocessing, either to eliminate the dominant principal components (Arora et al., 2017;Mu and Viswanath, 2018), or to map embeddings to an isotropic distribution (Li et al., 2020;Su et al., 2021).Another common solution is to add regularization during training (Gao et al., 2019;Wang et al., 2020).In this work, we show that-both theoretically and empirically-the contrastive objective can also alleviate the anisotropy problem.
The anisotropy problem is naturally connected to uniformity (Wang and Isola, 2020), both highlighting that embeddings should be evenly distributed in the space.Intuitively, optimizing the contrastive learning objective can improve uniformity (or ease the anisotropy problem), as the objective pushes negative instances apart.Here, we take a singular spectrum perspective-which is a common practice in analyzing word embeddings (Mu and Viswanath, 2018;Gao et al., 2019;Wang et al., 2020), and show that the contrastive objective can "flatten" the singular value distribution of sentence embeddings and make the representations more isotropic.
Following Wang and Isola (2020), the asymptotics of the contrastive learning objective (Eq. 1) can be expressed by the following equation when the number of negative instances approaches infinity (assuming f (x) is normalized): where the first term keeps positive instances similar and the second pushes negative pairs apart.When p data is uniform over finite samples {x i } m i=1 , with h i = f (x i ), we can derive the following formula from the second term with Jensen's inequality: Let W be the sentence embedding matrix corresponding to {x i } m i=1 , i.e., the i-th row of W is h i .Optimizing the second term in Eq. 6 essentially minimizes an upper bound of the summation of all elements in WW , i.e., Sum(WW ) = m i=1 m j=1 h i h j .Since we normalize h i , all elements on the diagonal of WW are 1 and then tr(WW ) (the sum of all eigenvalues) is a constant.According to Merikoski (1984), if all elements in WW are positive, which is the case in most times according to Figure G.1, then Sum(WW ) is an upper bound for the largest eigenvalue of WW .When minimizing the second term in Eq. 6, we reduce the top eigenvalue of WW and inherently "flatten" the singular spectrum of the embedding space.Therefore, contrastive learning is expected to alleviate the representation degeneration problem and improve uniformity of sentence embeddings.
Compared to post-processing methods in Li et al. (2020); Su et al. (2021), which only aim to encourage isotropic representations, contrastive learning also optimizes for aligning positive pairs by the first term in Eq. 6, which is the key to the success of SimCSE.A quantitative analysis is given in §7.

Evaluation Setup
We conduct our experiments on 7 semantic textual similarity (STS) tasks.Note that all our STS experiments are fully unsupervised and no STS training sets are used.Even for supervised SimCSE, we simply mean that we take extra labeled datasets for training, following previous work (Conneau et al., 2017).We also evaluate 7 transfer learning tasks and provide detailed results in Appendix E. We share a similar sentiment with Reimers and Gurevych (2019) that the main goal of sentence embeddings is to cluster semantically similar sentences and hence take STS as the main result.
Semantic textual similarity tasks.We evaluate on 7 STS tasks: STS 2012-2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016)), STS Benchmark (Cer et al., 2017) and SICK-Relatedness (Marelli et al., 2014).When comparing to previous work, we identify invalid comparison patterns in published papers in the evaluation settings, including (a) whether to use an additional regressor, (b) Spearman's vs Pearson's correlation, and (c) how the results are aggregated (Table B.1).We discuss the detailed differences in Appendix B and choose to follow the setting of Reimers and Gurevych (2019) in our evaluation (no additional regressor, Spearman's correlation, and "all" aggregation).We also report our replicated study of previous work as well as our results evaluated in a different setting in Table B.2 and Table B.3.We call for unifying the setting in evaluating sentence embeddings for future research.
Training details.We start from pre-trained checkpoints of BERT (Devlin et al., 2019) (uncased) or RoBERTa (Liu et al., 2019) (cased) and take the [CLS] representation as the sentence embedding9 (see §6.3 for comparison between different pooling methods).We train unsupervised SimCSE on 10 6 randomly sampled sentences from English Wikipedia, and train supervised SimCSE on the combination of MNLI and SNLI datasets (314k).More training details can be found in Appendix A. Table 5: Sentence embedding performance on STS tasks (Spearman's correlation, "all" setting).We highlight the highest numbers among models with the same pre-trained encoder.♣: results from Reimers and Gurevych (2019); ♥: results from Zhang et al. (2020); all other results are reproduced or reevaluated by ourselves.For BERT-flow (Li et al., 2020) and whitening (Su et al., 2021), we only report the "NLI" setting (see Table C.1).

Main Results
We compare unsupervised and supervised Sim-CSE to previous state-of-the-art sentence embedding methods on STS tasks.Unsupervised baselines include average GloVe embeddings (Pennington et al., 2014), average BERT or RoBERTa embeddings 10 , and post-processing methods such as BERT-flow (Li et al., 2020) and BERTwhitening (Su et al., 2021).We also compare to several recent methods using a contrastive objective, including 1) IS-BERT (Zhang et al., 2020), which maximizes the agreement between global and local features; 2) DeCLUTR (Giorgi et al., 2021), which takes different spans from the same document as positive pairs; 3) CT (Carlsson et al., 2021), which aligns embeddings of the same sentence from two different encoders. 11Other supervised 10 Following Su et al. (2021), we take the average of the first and the last layers, which is better than only taking the last. 11We do not compare to CLEAR (Wu et al., 2020), because they use their own version of pre-trained models, and the numbers appear to be much lower.Also note that CT is a concurrent work to ours.methods include InferSent (Conneau et al., 2017), Universal Sentence Encoder (Cer et al., 2018), and SBERT/SRoBERTa (Reimers and Gurevych, 2019) with post-processing methods (BERT-flow, whitening, and CT).We provide more details of these baselines in Appendix C.
Table 5 shows the evaluation results on 7 STS tasks.SimCSE can substantially improve results on all the datasets with or without extra NLI supervision, greatly outperforming the previous stateof-the-art models.Specifically, our unsupervised SimCSE-BERT base improves the previous best averaged Spearman's correlation from 72.05% to 76.25%, even comparable to supervised baselines.When using NLI datasets, SimCSE-BERT base further pushes the state-of-the-art results to 81.57%.The gains are more pronounced on RoBERTa encoders, and our supervised SimCSE achieves 83.76% with RoBERTa large .
In Appendix E, we show that SimCSE also achieves on par or better transfer task performance compared to existing work, and an auxiliary MLM objective can further boost performance.

Ablation Studies
We investigate the impact of different pooling methods and hard negatives.All reported results in this section are based on the STS-B development set.We provide more ablation studies (normalization, temperature, and MLM objectives) in Appendix D.
Pooling methods.Reimers and Gurevych (2019); Li et al. (2020) show that taking the average embeddings of pre-trained models (especially from both the first and last layers) leads to better performance than [CLS].
where 1 j i ∈ {0, 1} is an indicator that equals 1 if and only if i = j.We train SimCSE with different values of α and evaluate the trained models on the development set of STS-B.We also consider taking neutral hypotheses as hard negatives.As shown in Table 7, α = 1 performs the best, and neutral hypotheses do not bring further gains.

Analysis
In this section, we conduct further analyses to understand the inner workings of SimCSE.
Uniformity and alignment.Figure 3 shows uniformity and alignment of different sentence embedding models along with their averaged STS results.In general, models which have both better alignment and uniformity achieve better performance, confirming the findings in Wang and Isola (2020).We also observe that (1) though pre-trained embeddings have good alignment, their uniformity is poor (i.e., the embeddings are highly anisotropic); (2) post-processing methods like BERT-flow and BERT-whitening greatly improve uniformity but also suffer a degeneration in alignment; (3) unsupervised SimCSE effectively improves uniformity of pre-trained embeddings whereas keeping a good alignment; (4) incorporating supervised data in SimCSE further amends alignment.In Appendix F, we further show that SimCSE can effectively flatten singular value distribution of pre-trained embeddings.In Appendix G, we demonstrate that SimCSE provides more distinguishable cosine similarities between different sentence pairs.Qualitative comparison.We conduct a smallscale retrieval experiment using SBERT base and SimCSE-BERT base .We use 150k captions from Flickr30k dataset and take any random sentence as query to retrieve similar sentences (based on cosine similarity).As several examples shown in Table 8, the retrieved sentences by SimCSE have a higher quality compared to those retrieved by SBERT.

Related Work
Early work in sentence embeddings builds upon the distributional hypothesis by predicting surrounding sentences of a given one (Kiros et al., 2015;Hill

Supervised SimCSE-BERT base
Query: A man riding a small boat in a harbor.
#1 A group of men traveling over the ocean in a small boat.A man on a moored blue and white boat.#2 Two men sit on the bow of a colorful boat.
A man is riding in a boat on the water.#3 A man wearing a life jacket is in a small boat on a lake.A man in a blue boat on the water.
Query: A dog runs on the green grass near a wooden fence.
#1 A dog runs on the green grass near a grove of trees.
The dog by the fence is running on the grass.#2 A brown and white dog runs through the green grass.
Dog running through grass in fenced area.#3 The dogs run in the green field.
A dog runs on the green grass near a grove of trees.2. et al., 2016;Logeswaran and Lee, 2018).Pagliardini et al. (2018) show that simply augmenting the idea of word2vec (Mikolov et al., 2013) with n-gram embeddings leads to strong results.Several recent (and concurrent) approaches adopt contrastive objectives (Zhang et al., 2020;Giorgi et al., 2021;Wu et al., 2020;Meng et al., 2021;Carlsson et al., 2021;Kim et al., 2021;Yan et al., 2021) by taking different views-from data augmentation or different copies of models-of the same sentence or document.Compared to these work, SimCSE uses the simplest idea by taking different outputs of the same sentence from standard dropout, and performs the best on STS tasks.Supervised sentence embeddings are promised to have stronger performance compared to unsupervised counterparts.Conneau et al. (2017) propose to fine-tune a Siamese model on NLI datasets, which is further extended to other encoders or pre-trained models (Cer et al., 2018;Reimers and Gurevych, 2019).Furthermore, Wieting and Gimpel (2018); Wieting et al. (2020) demonstrate that bilingual and back-translation corpora provide useful supervision for learning semantic similarity.Another line of work focuses on regularizing embeddings (Li et al., 2020;Su et al., 2021;Huang et al., 2021) to alleviate the representation degeneration problem (as discussed in §5), and yields substantial improvement over pre-trained language models.

Conclusion
In this work, we propose SimCSE, a simple contrastive learning framework, which greatly improves state-of-the-art sentence embeddings on semantic textual similarity tasks.We present an unsupervised approach which predicts input sentence itself with dropout noise and a supervised approach utilizing NLI datasets.We further justify the inner workings of our approach by analyzing alignment and uniformity of SimCSE along with other baseline models.We believe that our contrastive objective, especially the unsupervised one, may have a broader application in NLP.It provides a new perspective on data augmentation with text input, and can be extended to other continuous representations and integrated in language model pre-training.and Gurevych (2019).Since the "all" setting fuses data from different topics together, it makes the evaluation closer to real-world scenarios, and unless specified, we take the "all" setting.
We list evaluation settings for a number of previous work in Table B.1.Some of the settings are reported by the paper and some of them are inferred by comparing the results and checking their code.As we can see, the evaluation protocols are very incoherent across different papers.We call for unifying the setting in evaluating sentence embeddings for future research.We will also release our evaluation code for better reproducibility.Since previous work uses different evaluation protocols from ours, we further evaluate our models in these settings to make a direct comparison to the published numbers.We evaluate SimCSE with "wmean" and Spearman's correlation to directly compare to Li et al. (2020) and Su et al. (2021)

C Baseline Models
We elaborate on how we obtain different baselines for comparison in our experiments: • For average GloVe embedding (Pennington et al., 2014), InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018), we directly report the results from Reimers and Gurevych (2019), since our evaluation setting is the same as theirs.
• For BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), we download the pretrained model weights from HuggingFace's Transformers13 , and evaluate the models with our own scripts.
• For SBERT and SRoBERTa (Reimers and Gurevych, 2019), we reuse the results from the original paper.For results not reported by Reimers and Gurevych (2019), such as the performance of SRoBERTa on transfer tasks, we download the model weights from Sen-tenceTransformers 14 and evaluate them.

Figure 1 :
Figure 1: (a) Unsupervised SimCSE predicts the input sentence itself from in-batch negatives, with different hidden dropout masks applied.(b) Supervised SimCSE leverages the NLI datasets and takes the entailment (premisehypothesis) pairs as positives, and contradiction pairs as well as other in-batch instances as negatives.

Figure 2 :
Figure2: alignuniform plot for unsupervised SimCSE, "no dropout", "fixed 0.1", and "delete one word".We visualize checkpoints every 10 training steps and the arrows indicate the training direction.For both align and uniform , lower numbers are better.

Figure 3 :
Figure 3: alignuniform plot of models based on BERT base .Color of points and numbers in brackets represent average STS performance (Spearman's correlation).Next3Sent: "next 3 sentences" from Table2.

Table 1 :
(Ma, 2019) of data augmentations on STS-B development set (Spearman's correlation).Crop k%: keep 100-k% of the length; word deletion k%: delete k% words; Synonym replacement: use nlpaug(Ma, 2019)to randomly replace one word with its synonym; MLM k%: use BERT base to replace k% of words.

Table 3 :
Effects of different dropout probabilities p on the STS-B development set (Spearman's correlation, BERT base ).Fixed 0.1: default 0.1 dropout rate but apply the same dropout mask on both x i and x + i .
2. Table1compares our approach to common data augmentation techniques such as crop, word deletion and replacement, which can be viewed as

Table 4 :
Comparisons of different supervised datasets as positive pairs.Results are Spearman's correlations on the STS-B development set using BERT base (we use the same hyperparameters as the final SimCSE model).Numbers in brackets denote the # of pairs.Sample: subsampling 134k positive pairs for a fair comparison among datasets; full: using the full dataset.In the last block, we use entailment pairs as positives and contradiction pairs as hard negatives (our final model).

Table 6 :
Ablation studies of different pooling methods in unsupervised and supervised SimCSE.[CLS] w/ MLP (train): using MLP on [CLS] during training but removing it during testing.The results are based on the development set of STS-B using BERT base .

Table 7 :
STS-B development results with different hard negative policies."N/A": no hard negative.
in Table B.3.