DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

Sentence embeddings are an important component of many natural language processing (NLP) systems. Like word embeddings, sentence embeddings are typically learned on large text corpora and then transferred to various downstream tasks, such as clustering and retrieval. Unlike word embeddings, the highest performing solutions for learning sentence embeddings require labelled data, limiting their usefulness to languages and domains where labelled data is abundant. In this paper, we present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Inspired by recent advances in deep metric learning (DML), we carefully design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data. When used to extend the pretraining of transformer-based language models, our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders. Importantly, our experiments suggest that the quality of the learned embeddings scale with both the number of trainable parameters and the amount of unlabelled training data. Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.


Introduction
Due to the limited amount of labelled training data available for many natural language processing (NLP) tasks, transfer learning has become ubiquitous (Ruder et al., 2019).For some time, transfer learning in NLP was limited to pretrained word embeddings (Mikolov et al., 2013;Pennington et al., 2014).Recent work has demonstrated strong transfer task performance using pretrained sentence embeddings.These fixed-length vectors, often referred to as "universal" sentence embeddings, are typically learned on large corpora and then transferred to various downstream tasks, such as clustering (e.g.topic modelling) and retrieval (e.g.semantic search).Indeed, sentence embeddings have become an area of focus, and many supervised (Conneau et al., 2017), semi-supervised (Subramanian et al., 2018;Phang et al., 2018;Cer et al., 2018;Reimers and Gurevych, 2019) and unsupervised (Le and Mikolov, 2014;Jernite et al., 2017;Kiros et al., 2015;Hill et al., 2016;Logeswaran and Lee, 2018) approaches have been proposed.However, the highest performing solutions require labelled data, limiting their usefulness to languages and domains where labelled data is abundant.Therefore, closing the performance gap between unsupervised and supervised universal sentence embedding methods is an important goal.
Pretraining transformer-based language models has become the primary method for learning textual representations from unlabelled corpora (Radford et al., 2018;Devlin et al., 2019;Dai et al., 2019;Yang et al., 2019;Liu et al., 2019;Clark et al., 2020).This success has primarily been driven by masked language modelling (MLM).This selfsupervised token-level objective requires the model to predict the identity of some randomly masked tokens from the input sequence.In addition to MLM, some of these models have mechanisms for learning sentence-level embeddings via self-supervision.In BERT (Devlin et al., 2019), a special classification token is prepended to every input sequence, and its representation is used in a binary classifi-arXiv:2006.03659v4 [cs.CL] 27 May 2021 cation task to predict whether one textual segment follows another in the training corpus, denoted Next Sentence Prediction (NSP).However, recent work has called into question the effectiveness of NSP (Conneau and Lample, 2019;You et al., 1904;Joshi et al., 2020).In RoBERTa (Liu et al., 2019), the authors demonstrated that removing NSP during pretraining leads to unchanged or even slightly improved performance on downstream sentencelevel tasks (including semantic text similarity and natural language inference).In ALBERT (Lan et al., 2020), the authors hypothesize that NSP conflates topic prediction and coherence prediction, and instead propose a Sentence-Order Prediction objective (SOP), suggesting that it better models inter-sentence coherence.In preliminary evaluations, we found that neither objective produces good universal sentence embeddings (see Appendix A).Thus, we propose a simple but effective self-supervised, sentence-level objective inspired by recent advances in metric learning.
Metric learning is a type of representation learning that aims to learn an embedding space where the vector representations of similar data are mapped close together, and vice versa (Lowe, 1995;Mika et al., 1999;Xing et al., 2002).In computer vision (CV), deep metric learning (DML) has been widely used for learning visual representations (Wohlhart and Lepetit, 2015;Wen et al., 2016;Zhang and Saligrama, 2016;Bucher et al., 2016;Leal-Taixé et al., 2016;Tao et al., 2016;Yuan et al., 2020;He et al., 2018;Grabner et al., 2018;Yelamarthi et al., 2018;Yu et al., 2018).Generally speaking, DML is approached as follows: a "pretext" task (often self-supervised, e.g.colourization or inpainting) is carefully designed and used to train deep neural networks to generate useful feature representations.Here, "useful" means a representation that is easily adaptable to other downstream tasks, unknown at training time.Downstream tasks (e.g.object recognition) are then used to evaluate the quality of the learned features (independent of the model that produced them), often by training a linear classifier on the task using these features as input.The most successful approach to date has been to design a pretext task for learning with a pair-based contrastive loss function.For a given anchor data point, contrastive losses attempt to make the distance between the anchor and some positive data points (those that are similar) smaller than the distance between the anchor and some neg-ative data points (those that are dissimilar) (Hadsell et al., 2006).The highest-performing methods generate anchor-positive pairs by randomly augmenting the same image (e.g. using crops, flips and colour distortions); anchor-negative pairs are randomly chosen, augmented views of different images (Bachman et al., 2019;Tian et al., 2020;He et al., 2020;Chen et al., 2020).In fact, Kong et al., 2020 demonstrate that the MLM and NSP objectives are also instances of contrastive learning.
Inspired by this approach, we propose a selfsupervised, contrastive objective that can be used to pretrain a sentence encoder.Our objective learns universal sentence embeddings by training an encoder to minimize the distance between the embeddings of textual segments randomly sampled from nearby in the same document.We demonstrate our objective's effectiveness by using it to extend the pretraining of a transformer-based language model and obtain state-of-the-art results on SentEval (Conneau and Kiela, 2018) -a benchmark of 28 tasks designed to evaluate universal sentence embeddings.Our primary contributions are: • We propose a self-supervised sentence-level objective that can be used alongside MLM to pretrain transformer-based language models, inducing generalized embeddings for sentence-and paragraph-length text without any labelled data (subsection 5.1).
• We perform extensive ablations to determine which factors are important for learning highquality embeddings (subsection 5.2).
• We demonstrate that the quality of the learned embeddings scale with model and data size.Therefore, performance can likely be improved simply by collecting more unlabelled text or using a larger encoder (subsection 5.3).
• We open-source our solution and provide detailed instructions for training it on new data or embedding unseen text.2

Related Work
Previous works on universal sentence embeddings can be broadly grouped by whether or not they use labelled data in their pretraining step(s), which we refer to simply as supervised or semi-supervised and unsupervised, respectively.
Supervised or semi-supervised The highest performing universal sentence encoders are pretrained on the human-labelled natural language inference (NLI) datasets Stanford NLI (SNLI) (Bowman et al., 2015) and MultiNLI (Williams et al., 2018).NLI is the task of classifying a pair of sentences (denoted the "hypothesis" and the "premise") into one of three relationships: entailment, contradiction or neutral.The effectiveness of NLI for training universal sentence encoders was demonstrated by the supervised method InferSent (Conneau et al., 2017).Universal Sentence Encoder (USE) (Cer et al., 2018) is semi-supervised, augmenting an unsupervised, Skip-Thoughts-like task (Kiros et al. 2015, see section 2) with supervised training on the SNLI corpus.The recently published Sentence Transformers (Reimers and Gurevych, 2019) method fine-tunes pretrained, transformer-based language models like BERT (Devlin et al., 2019) using labelled NLI datasets.
Unsupervised Skip-Thoughts (Kiros et al., 2015) and FastSent (Hill et al., 2016) are popular unsupervised techniques that learn sentence embeddings by using an encoding of a sentence to predict words in neighbouring sentences.However, in addition to being computationally expensive, this generative objective forces the model to reconstruct the surface form of a sentence, which may capture information irrelevant to the meaning of a sentence.QuickThoughts (Logeswaran and Lee, 2018) addresses these shortcomings with a simple discriminative objective; given a sentence and its context (adjacent sentences), it learns sentence representations by training a classifier to distinguish context sentences from non-context sentences.The unifying theme of unsupervised approaches is that they exploit the "distributional hypothesis", namely that the meaning of a word (and by extension, a sentence) is characterized by the word context in which it appears.
Our overall approach is most similar to Sentence Transformers -we extend the pretraining of a transformer-based language model to produce useful sentence embeddings -but our proposed objective is self-supervised.Removing the dependence on labelled data allows us to exploit the vast amount of unlabelled text on the web without being restricted to languages or domains where labelled data is plentiful (e.g.English Wikipedia).Our objective most closely resembles QuickThoughts; some distinctions include: we relax our sampling to For simplicity, we illustrate the case where A = P = 1 and denote the anchor-positive span pair as s i , s j .Both spans are fed through the same encoder f (•) and pooler g(•) to produce the corresponding embeddings e i = g(f (s i )), e j = g(f (s j )).The encoder and pooler are trained to minimize the distance between embeddings via a contrastive prediction task, where the other embeddings in a minibatch are treated as negatives (omitted here for simplicity).(B) Positive spans can overlap with, be adjacent to or be subsumed by the sampled anchor span.(C) The length of anchors and positives are randomly sampled from beta distributions, skewed toward longer and shorter spans, respectively.textual segments of up to paragraph length (rather than natural sentences), we sample one or more positive segments per anchor (rather than strictly one), and we allow these segments to be adjacent, overlapping or subsuming (rather than strictly adjacent; see Figure 1, B).

Self-supervised contrastive loss
Our method learns textual representations via a contrastive loss by maximizing agreement between textual segments (referred to as "spans" in the rest of the paper) sampled from nearby in the same document.Illustrated in Figure 1, this approach comprises the following components: • A data loading step randomly samples paired anchor-positive spans from each document in a minibatch of size N .Let A be the number of anchor spans sampled per document, P be the number of positive spans sampled per anchor and i ∈ {1 . . .AN } be the index of an arbitrary anchor span.We denote an anchor span and its corresponding p ∈ {1 . . .P } positive spans as s i and s i+pAN respectively.This procedure is designed to maximize the chance of sampling semantically similar anchor-positive pairs (see subsection 3.2).
• An encoder f (•) maps each token in the input spans to an embedding.Although our method places no constraints on the choice of encoder, we chose f (•) to be a transformer-based language model, as this represents the state-ofthe-art for text encoders (see subsection 3.3).
• A pooler g(•) maps the encoded spans f (s i ) and f (s i+pAN ) to fixed-length embeddings e i = g(f (s i )) and its corresponding mean positive embedding Similar to Reimers and Gurevych 2019, we found that choosing g(•) to be the mean of the token-level embeddings (referred to as "mean pooling" in the rest of the paper) performs well (see Appendix, Table 4).We pair each anchor embedding with the mean of multiple positive embeddings.This strategy was proposed by Saunshi et al. 2019, who demonstrated theoretical and empirical improvements compared to using a single positive example for each anchor.
• A contrastive loss function defined for a contrastive prediction task.Given a set of embedded spans {e k } including a positive pair of examples e i and e i+AN , the contrastive prediction task aims to identify where sim(u, v) = u T v/||u|| 2 ||v|| 2 denotes the cosine similarity of two vectors u and v, 1 [i =k] ∈ {0, 1} is an indicator function evaluating to 1 if i = k, and τ > 0 denotes the temperature hyperparameter.
During training, we randomly sample minibatches of N documents from the train set and define the contrastive prediction task on anchorpositive pairs e i , e i+AN derived from the N documents, resulting in 2AN data points.As proposed in (Sohn, 2016), we treat the other 2(AN − 1) instances within a minibatch as negative examples.The cost function takes the following form This is the InfoNCE loss used in previous works (Sohn, 2016;Wu et al., 2018;Oord et al., 2018) and denoted normalized temperature-scale crossentropy loss or "NT-Xent" in (Chen et al., 2020).
To embed text with a trained model, we simply pass batches of tokenized text through the model, without sampling spans.Therefore, the computational cost of our method at test time is the cost of the encoder, f (•), plus the cost of the pooler, g(•), which is negligible when using mean pooling.

Span sampling
We start by choosing a minimum and maximum span length; in this paper, min = 32 and max = 512, the maximum input size for many pretrained transformers.Next, a document d is tokenized to produce a sequence of n tokens x d = (x 1 , x 2 . . .x n ).To sample an anchor span s i from x d , we first sample its length anchor from a beta distribution and then randomly (uniformly) sample its starting position We then sample p ∈ {1 . . .P } corresponding positive spans s i+pAN independently following a similar procedure where p anchor ∼ Beta(α = 4, β = 2), which skews anchor sampling towards longer spans, and p positive ∼ Beta(α = 2, β = 4), which skews positive sampling towards shorter spans (Figure 1,  C).In practice, we restrict the sampling of anchor spans from the same document such that they are a minimum of 2 * max tokens apart.In Appendix B, we show examples of text that has been sampled by our method.We note several carefully considered decisions in the design of our sampling procedure: • Sampling span lengths from a distribution clipped at min = 32 and max = 512 encourages the model to produce good embeddings for text ranging from sentence-to paragraphlength.At test time, we expect our model to be able to embed up-to paragraph-length texts.
• We found that sampling longer lengths for the anchor span than the positive spans improves performance in downstream tasks (we did not find performance to be sensitive to the specific choice of α and β).The rationale for this is twofold.First, it enables the model to learn global-to-local view prediction as in (Hjelm et al., 2019;Bachman et al., 2019;Chen et al., 2020) (referred to as "subsumed view" in Figure 1, B).Second, when P > 1, it encourages diversity among positives spans by lowering the amount of repeated text.
• Sampling positives nearby to the anchor exploits the distributional hypothesis and increases the chances of sampling valid (i.e.semantically similar) anchor-positive pairs.
• By sampling multiple anchors per document, each anchor-positive pair is contrasted against both easy negatives (anchors and positives sampled from other documents in a minibatch) and hard negatives (anchors and positives sampled from the same document).
In conclusion, the sampling procedure produces three types of positives: positives that partially overlap with the anchor, positives adjacent to the anchor, and positives subsumed by the anchor (Figure 1, B) and two types of negatives: easy negatives sampled from a different document than the anchor, and hard negatives sampled from the same document as the anchor.Thus, our stochastically generated training set and contrastive loss implicitly define a family of predictive tasks which can be used to train a model, independent of any specific encoder architecture.

Continued MLM pretraining
We use our objective to extend the pretraining of a transformer-based language model (Vaswani et al., 2017), as this represents the state-of-the-art encoder in NLP.We implement the MLM objective as described in (Devlin et al., 2019) on each anchor span in a minibatch and sum the losses from the MLM and contrastive objectives before backpropagating This is similar to existing pretraining strategies, where an MLM loss is paired with a sentence-level loss such as NSP (Devlin et al., 2019) or SOP (Lan et al., 2020).To make the computational requirements feasible, we do not train from scratch, but rather we continue training a model that has been pretrained with the MLM objective.Specifically, we use both RoBERTa-base (Liu et al., 2019) and DistilRoBERTa (Sanh et al., 2019) (a distilled version of RoBERTa-base) in our experiments.In the rest of the paper, we refer to our method as DeCLUTR-small (when extending DistilRoBERTa pretraining) and DeCLUTR-base (when extending RoBERTa-base pretraining).
4 Experimental setup

Dataset, training, and implementation
Dataset We collected all documents with a minimum token length of 2048 from OpenWebText (Gokaslan and Cohen, 2019) an open-access subset of the WebText corpus (Radford et al., 2019), yielding 497,868 documents in total.For reference, Google's USE was trained on 570,000 humanlabelled sentence pairs from the SNLI dataset (among other unlabelled datasets).InferSent and Sentence Transformer models were trained on both SNLI and MultiNLI, a total of 1 million humanlabelled sentence pairs.
Implementation We implemented our model in PyTorch (Paszke et al., 2017) using AllenNLP (Gardner et al., 2018).We used the NT-Xent loss function implemented by the PyTorch Metric Learning library (Musgrave et al., 2019) and the pretrained transformer architecture and weights from the Transformers library (Wolf et al., 2020).All models were trained on up to four NVIDIA Tesla V100 16 or 32GB GPUs.Training Unless specified otherwise, we train for one to three epochs over the 497,868 documents with a minibatch size of 16 and a temperature τ = 5 × 10 −2 using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate (LR) of 5 × 10 −5 and a weight decay of 0.1.For every document in a minibatch, we sample two anchor spans (A = 2) and two positive spans per anchor (P = 2).We use the Slanted Triangular LR scheduler (Howard and Ruder, 2018) with a number of train steps equal to training instances and a cut fraction of 0.1.The remaining hyperparameters of the underlying pretrained transformer (i.e.DistilRoBERTa or RoBERTa-base) are left at their defaults.All gradients are scaled to a vector norm of 1.0 before backpropagating.Hyperparameters were tuned on the SentEval validation sets.

Evaluation
We evaluate all methods on the SentEval benchmark, a widely-used toolkit for evaluating generalpurpose, fixed-length sentence representations.SentEval is divided into 18 downstream tasks -representative NLP tasks such as sentiment analysis, natural language inference, paraphrase detection and image-caption retrieval -and ten probing tasks, which are designed to evaluate what linguistic properties are encoded in a sentence representation.We report scores obtained by our model and the relevant baselines on the downstream and probing tasks using the SentEval toolkit3 with default parameters (see Appendix C for details).Note that all the supervised approaches we compare to are trained on the SNLI corpus, which is included as a downstream task in SentEval.To avoid train-test contamination, we compute average downstream scores without considering SNLI when comparing to these approaches in Table 2.

Baselines
We compare to the highest performing, most popular sentence embedding methods: InferSent, Google's USE and Sentence Transformers.For In-ferSent, we compare to the latest model. 4We use the latest "large" USE model 5 , as it is most similar in terms of architecture and number of parameters to DeCLUTR-base.For Sentence Transformers, we compare to "roberta-base-nli-mean-tokens"6 , which, like DeCLUTR-base, uses the RoBERTabase architecture and pretrained weights.The only difference is each method's extended pretraining strategy.We include the performance of averaged GloVe7 and fastText 8 word vectors as weak baselines.Trainable model parameter counts and sentence embedding dimensions are listed in Table 1.Despite our best efforts, we could not evaluate the pretrained QuickThought models against the full SentEval benchmark.We cite the scores from the paper directly.Finally, we evaluate the pretrained transformer model's performance before it is subjected to training with our contrastive objective, denoted "Transformer-*".We use mean pooling on the pretrained transformers token-level output to produce sentence embeddings -the same pooling strategy used in our method.

Results
In subsection 5.1, we compare the performance of our model against the relevant baselines.In the remaining sections, we explore which components contribute to the quality of the learned embeddings.

Comparison to baselines
Downstream task performance Compared to the underlying pretrained models DistilRoBERTa and RoBERTa-base, DeCLUTR-small and DeCLUTR-base obtain large boosts in average downstream performance, +4% and +6% respectively (Table 2).DeCLUTR-base leads to improved or equivalent performance for every downstream task but one (SST5) and DeCLUTR-small for all but three (SST2, SST5 and TREC).Compared to existing methods, DeCLUTR-base matches or even outperforms average performance without using any hand-labelled training data.Surprisingly, we also find that DeCLUTR-small outperforms Sentence Transformers while using ∼34% less trainable parameters.
Probing task performance With the exception of InferSent, existing methods perform poorly on the probing tasks of SentEval (Table 3).Sentence Transformers, which begins with a pretrained transformer model and fine-tunes it on NLI datasets, scores approximately 10% lower on the probing tasks than the model it fine-tunes.In contrast, both DeCLUTR-small and DeCLUTR-base perform comparably to the underlying pretrained model in terms of average performance.We note that the purpose of the probing tasks is not the development of ad-hoc models that attain top performance on them (Conneau et al., 2018).However, it is still interesting to note that high downstream task performance can be obtained without sacrificing probing task performance.Furthermore, these results suggest that fine-tuning transformer-based language models on NLI datasets may discard some of the linguistic information captured by the pretrained model's weights.We suspect that the inclusion of MLM in our training objective is responsible for DeCLUTR's relatively high performance on the probing tasks.

Supervised vs. unsupervised downstream tasks
The downstream evaluation of SentEval includes supervised and unsupervised tasks.In the unsupervised tasks, the embeddings of the method to evaluate are used as-is without any further training (see Appendix C for details).Interestingly, we find that USE performs particularly well across the unsupervised evaluations in SentEval (tasks marked with a * in Table 2).Given the similarity of the USE architecture to Sentence Transformers and DeCLUTR and the similarity of its supervised NLI training objective to InferSent and Sentence Transformers, we suspect the most likely cause is one or more of its additional training objectives.These include a conversational response prediction task (Henderson et al., 2017) and a Skip-Thoughts (Kiros et al., 2015) like task.

Ablation of the sampling procedure
We ablate several components of the sampling procedure, including the number of anchors sampled per document A, the number of positives sampled per anchor P , and the sampling strategy for those positives (Figure 2).We note that when A = 2, the model is trained on twice the number of spans and twice the effective batch size (2AN , where N is the number of documents in a minibatch) as compared to when A = 1.To control for this, all experi- ments where A = 1 are trained for two epochs (twice the number of epochs as when A = 2) and for two times the minibatch size (2N ).Thus, both sets of experiments are trained on the same number of spans and the same effective batch size (4N ), and the only difference is the number of anchors sampled per document (A).
We find that sampling multiple anchors per document has a large positive impact on the quality of learned embeddings.We hypothesize this is because the difficulty of the contrastive objective increases when A > 1. Recall that a minibatch is composed of random documents, and each anchor-positive pair sampled from a document is contrasted against all other anchor-positive pairs in the minibatch.When A > 1, anchor-positive pairs will be contrasted against other anchors and positives from the same document, increasing the difficulty of the contrastive objective, thus leading to better representations.We also find that a positive sampling strategy that allows positives to be adjacent to and subsumed by the anchor outperforms a strategy that only allows adjacent or subsuming views, suggesting that the information captured by these views is complementary.Finally, we note that sampling multiple positives per anchor (P > 1) has minimal impact on performance.This is in contrast to (Saunshi et al., 2019), who found both theoretical and empirical improvements when multiple positives are averaged and paired with a given anchor.

Training objective, train set size and model capacity
To determine the importance of the training objectives, train set size, and model capacity, we trained two sizes of the model with 10% to 100% (1 full epoch) of the train set (Figure 3).Pretraining the model with both the MLM and contrastive objectives improves performance over training with either objective alone.Including MLM alongside the contrastive objective leads to monotonic improvement as the train set size is increased.We hypothesize that including the MLM loss acts as a form of regularization, preventing the weights of the pretrained model (which itself was trained with an MLM loss) from diverging too dramatically, a phenomenon known as "catastrophic forgetting" (McCloskey and Cohen, 1989;Ratcliff, 1990).These results suggest that the quality of embeddings learned by our approach scale in terms of model capacity and train set size; because the training method is completely self-supervised, scaling the train set would simply involve collecting more unlabelled text.

Discussion and conclusion
In this paper, we proposed a self-supervised objective for learning universal sentence embeddings.Our objective does not require labelled training data and is applicable to any text encoder.We demonstrated the effectiveness of our objective by evaluating the learned embeddings on the SentE-val benchmark, which contains a total of 28 tasks designed to evaluate the transferability and linguistic properties of sentence representations.When used to extend the pretraining of a transformerbased language model, our self-supervised objective closes the performance gap with existing methods that require human-labelled training data.Our experiments suggest that the learned embeddings' quality can be further improved by increasing the model and train set size.Together, these results demonstrate the effectiveness and feasibility of replacing hand-labelled data with carefully designed self-supervised objectives for learning universal sentence embeddings.We release our model and code publicly in the hopes that it will be extended to new domains and non-English languages.
A Pretrained transformers make poor universal sentence encoders Certain pretrained transformers, such as BERT and ALBERT, have mechanisms for learning sequencelevel embeddings via self-supervision.These models prepend every input sequence with a special classification token (e.g."[CLS]"), and its representation is learned using a simple classification task, such as Next Sentence Prediction (NSP) or Sentence-Order Prediction (SOP) (see Devlin et al. 2019 andLan et al. 2020 respectively for details on these tasks).However, during preliminary experiments, we noticed that these models are not good universal sentence encoders, as measured by their performance on the SentEval benchmark (Conneau and Kiela, 2018).As a simple experiment, we evaluated three pretrained transformer models on SentEval: one trained with the NSP loss (BERT), one trained with the SOP loss (ALBERT) and one trained with neither, RoBERTa (Liu et al., 2019).We did not find that the CLS embeddings produced by models trained against the NSP or SOP losses to outperform that of a model trained without either loss and sometimes failed to outperform a bag-ofwords (BoW) baseline (Table 4).Furthermore, we find that pooling token embeddings via averaging (referred to as "mean pooling" in our paper) outperforms pooling via the CLS classification token.
Our results are corroborated by Liu et al. 2019, who find that removing NSP loss leads to the same or better results on downstream tasks and Reimers and Gurevych 2019, who find that directly using the output of BERT as sentence embeddings leads to poor performances on the semantic similarity tasks of SentEval.

B Examples of sampled spans
In Table 5, we present examples of anchor-positive and anchor-negative pairs generated by our sampling procedure.We show one example for each possible view of a sampled positive, e.g.positives adjacent to, overlapping with, or subsumed by the anchor.For each anchor-positive pair, we show examples of both a hard negative (derived from the same document) and an easy negative (derived from another document).Recall that a minibatch is composed of random documents, and each anchor-positive pair sampled from a document is contrasted against all other anchor-positive pairs in the minibatch.Thus, hard negatives, as we have described them here, are generated only when sampling multiple anchors per document (A > 1).

C SentEval evaluation details
SentEval is a benchmark for evaluating the quality of fixed-length sentence embeddings.It is divided into 18 downstream tasks, and 10 probing tasks.Sentence embedding methods are evaluated on these tasks via a simple interface9 , which standardizes training, evaluation and hyperparameters.
For most tasks, the method to evaluate is used to produce fix-length sentence embeddings, and a simple logistic regression (LR) or multi-layer perception (MLP) model is trained on the task using these embeddings as input.For other tasks (namely Table 4: Results on the downstream and probing tasks from the validation set of SentEval.We compare models trained with the Next Sentence Prediction (NSP) and Sentence-Order Prediction (SOP) losses to a model trained with neither, using two different pooling strategies: "*-CLS", where the special classification token is used as its sentence representation, and "*-mean", where each sentence is represented by the mean of its token embeddings.several semantic text similarity tasks), the embeddings are used as-is without any further training.
Note that this setup is different from evaluations on the popular GLUE benchmark (Wang et al., 2019), which typically use the task data to fine-tune the parameters of the sentence embedding model.In subsection C.1, we present the individual tasks of the SentEval benchmark.In subsection C.2, we explain our method for computing the average downstream and average probing scores presented in our paper.

C.1 SentEval tasks
The downstream tasks of SentEval are representative NLP tasks used to evaluate the transferability of fixed-length sentence embeddings.We give a brief overview of the broad categories that divide the tasks below (see Conneau and Kiela 2018 for more details): • Binary and multi-class classification: These tasks cover various types of sentence classification, including sentiment analysis (MR Pang andLee 2005, SST2 andSST5 Socher et al. 2013), question-type (TREC) (Voorhees and Tice, 2000), product reviews (CR) (Hu and Liu, 2004), subjectivity/objectivity (SUBJ) (Pang and Lee, 2004) and opinion polarity (MPQA) (Wiebe et al., 2005).
• Entailment and semantic relatedness: These tasks cover multiple entailment datasets (also known as natural language inference or NLI), including SICK-E (Marelli et al., 2014)  • Paraphrase detection Evaluated on the Microsoft Research Paraphrase Corpus (MRPC) (Dolan et al., 2004), this binary classification task is comprised of human-labelled sentence pairs, annotated according to whether they capture a paraphrase/semantic equivalence relationship.Players take on the role of a Wenja tribesman named Takkar, who is stranded in Oros with no weapons after his hunting party is ambushed by a Saber-tooth Tiger.
to such feelings.Fawkes cried out and flew ahead, and Albus Dumbledore followed.Further along the Dementors' path, people were still alive to be fought for.And no matter how much he himself was hurting, while there were still people who needed him he would go on.For • Caption-Image retrieval This task is comprised of two sub-tasks: ranking a large collection of images by their relevance for some given query text (Image Retrieval) and ranking captions by their relevance for some given query image (Caption Retrieval).Both tasks are evaluated on data from the COCO dataset (Lin et al., 2014).Each image is represented by a pretrained, 2048-dimensional embedding produced by a ResNet-101 (He et al., 2016).
The probing tasks are designed to evaluate what linguistic properties are encoded in a sentence representation.All tasks are binary or multi-class classification.We give a brief overview of each task below (see Conneau et al. 2018 for more details): • Sentence length (SentLen): A multi-class classification task where a model is trained to predict the length of a given input sentence, which is binned into six possible length ranges.
• Word content (WC): A multi-class classification task where, given 1000 words as targets, the goal is to predict which of the target words appears in a given input sentence.Each sentence contains a single target word, and the word occurs exactly once in the sentence.
• Tree depth (TreeDepth): A multi-class classification task where the goal is to predict the maximum depth (with values ranging from 5 to 12) of a given input sentence's syntactic tree.
• Bigram Shift (BShift): A multi-class classification task where the goal is to predict whether two consecutive tokens within a given sentence have been inverted.
• Top Constituents (TopConst): A multi-class classification task where the goal is to predict the top constituents (from a choice of 19) immediately below the sentence (S) node of the sentence's syntactic tree.
• Tense: A binary classification task where the goal is to predict the tense (past or present) of the main verb in a sentence.
• Subject number (SubjNum): A binary classification task where the goal is to predict the number (singular or plural) of the subject of the main clause.
• Object number (ObjNum): A binary classification task, analogous to SubjNum, where the goal is to predict the number (singular or plural) of the direct object of the main clause.
• Semantic odd man out (SOMO): A binary classification task where the goal is to predict whether a sentence has had a single randomly picked noun or verb replaced with another word with the same part-of-speech.
• Coordinate inversion (CoordInv): A binary classification task where the goal is to predict whether the order of two coordinate clauses in a sentence has been inverted.

C.2 Computing an average score
In our paper, we present averaged downstream and probing scores.Computing averaged probing scores was straightforward; each of the ten probing tasks reports a simple accuracy, which we averaged.
To compute an averaged downstream score, we do the following: • If a task reports Spearman correlation (i.e.SICK-R, STS-B), we use this score when computing the average downstream task score.If the task reports a mean Spearman correlation for multiple subtasks (i.e.STS12, STS13, STS14, STS15, STS16), we use this score.
• If a task reports both an accuracy and an F1score (i.e.MRPC), we use the average of these two scores.
• For the Caption-Image Retrieval task, we report the average of the Recall@K, where K ∈ {1, 5, 10} for the Image and Caption retrieval tasks (a total of six scores).This is the default behaviour of SentEval.
• Otherwise, we use the reported accuracy.

Figure 1 :
Figure 1: Overview of the self-supervised contrastive objective.(A) For each document d in a minibatch of size N , we sample A anchor spans per document and P positive spans per anchor.For simplicity, we illustrate the case where A = P = 1 and denote the anchor-positive span pair as s i , s j .Both spans are fed through the same encoder f (•) and pooler g(•) to produce the corresponding embeddings e i = g(f (s i )), e j = g(f (s j )).The encoder and pooler are trained to minimize the distance between embeddings via a contrastive prediction task, where the other embeddings in a minibatch are treated as negatives (omitted here for simplicity).(B) Positive spans can overlap with, be adjacent to or be subsumed by the sampled anchor span.(C) The length of anchors and positives are randomly sampled from beta distributions, skewed toward longer and shorter spans, respectively.

Figure 3 :
Figure 3: Effect of training objective, train set size and model capacity on SentEval performance.DeCLUTRsmall has 6 layers and ∼82M parameters.DeCLUTRbase has 12 layers and ∼125M parameters.Averaged downstream task scores are reported from the validation set of SentEval.100% corresponds to 1 epoch of training with all 497,868 documents from our Open-WebText subset.

Table 1 :
Trainable model parameter counts and sentence embedding dimensions.DeCLUTR-small and DeCLUTR-base are pretrained DistilRoBERTa and RoBERTa-base models respectively after continued pretraining with our method.

Table 5 :
Examples of text spans generated by our method.During training, we randomly sample one or more anchors from every document in a minibatch.For each anchor, we randomly sample one or more positives adjacent to, overlapping with, or subsumed by the anchor.All anchor-positive pairs are contrasted with every other anchor-positive pair in the minibatch.This leads to easy negatives (anchors and positives sampled from other documents in a minibatch) and hard negatives (anchors and positives sampled from the same document).Here, examples are capped at a maximum length of 64 tokens.During training, we sample spans up to a length of 512 tokens.