Pairwise Supervised Contrastive Learning of Sentence Representations

Many recent successes in sentence representation learning have been achieved by simply fine-tuning on the Natural Language Inference (NLI) datasets with triplet loss or siamese loss. Nevertheless, they share a common weakness: sentences in a contradiction pair are not necessarily from different semantic categories. Therefore, optimizing the semantic entailment and contradiction reasoning objective alone is inadequate to capture the high-level semantic structure. The drawback is compounded by the fact that the vanilla siamese or triplet losses only learn from individual sentence pairs or triplets, which often suffer from bad local optima. In this paper, we propose PairSupCon, an instance discrimination based approach aiming to bridge semantic entailment and contradiction understanding with high-level categorical concept encoding. We evaluate PairSupCon on various downstream tasks that involve understanding sentence semantics at different granularities. We outperform the previous state-of-the-art method with 10%–13% averaged improvement on eight clustering tasks, and 5%–6% averaged improvement on seven semantic textual similarity (STS) tasks.


Introduction
Learning high-quality sentence embeddings is a fundamental task in Natural Language Processing. The goal is to map semantically similar sentences close together and dissimilar sentences farther apart in the representation space. Many recent successes have been achieved by training on the NLI datasets (Bowman et al., 2015;Williams et al., 2017;Wang et al., 2018), where the task is often to classify each sentence pair into one of three categories: entailment, contradiction, or neutral. Despite promising results, prior work (Conneau et al., 2017;Cer et al., 2018;Reimers and Gurevych, 2019a) share a common weakness: the sentences forming a contradiction pair may not necessarily belong to different semantic categories. Consequently, optimizing the model for semantic entailment and contradiction understanding alone is inadequate to encode the high-level categorical concepts into the representations. Moreover, the vanilla siamese (triplet) loss only learns from the individual sentence pairs (triplets), which often requires substantial training examples to achieve competitive performance (Oh Song et al., 2016;Thakur et al., 2020). As shown in Section 4.1, the siamese loss can sometimes drive a model to bad local optima where the performance of high-level semantic concept encoding is degraded when compared with its baseline counterpart.
In this paper, we take inspiration from selfsupervised contrastive learning (Bachman et al., 2019;He et al., 2020;Chen et al., 2020) and propose jointly optimizing the pairwise semantic reasoning objective with an instance discrimination loss. We name our approach Pairwise Supervised Contrastive Learning (PairSupCon). As noticed by the recent work (Wu et al., 2018;, instance discrimination learning can implicitly group similar instances together in the representation space without any explicit learning force directs to do so. PairSupCon leverages this implicit grouping effect to bring together representations from the same semantic category while, simultaneously enhancing the semantic entailment and contradiction reasoning capability of the model. Although the prior work mainly focuses on pairwise semantic similarity related evaluations, we argue in this paper that the capability of encoding the high-level categorical semantic concept into the representations is an equally important aspect for evaluations. As shown in Section 4, the previous state-of-the-art model that performs best on the semantic textual similarity (STS) tasks results in degenerated embeddings of the categorical semantic structure. On the other hand, better capturing the high-level semantic concepts can in turn promote better performance of the low-level semantic entailment and contradiction reasoning. This assumption is consistent with how human categorize objects in a top-down reasoning manner. We further validate our assumption in Section 4, where PairSupCon achieves an averaged improvement of 10% − 13% over the prior work when evaluated on eight short text clustering tasks, and yields 5% − 6% averaged improvement on seven STS tasks.

Related Work
Sentence Representation Learning with NLI The suitability of leveraging NLI to promote better sentence representation learning is first observed by InferSent (Conneau et al., 2017), where a siamese BiLSTM network is optimized in a supervised manner with the semantic entailment and contraction classification objective. Universal Sentence Encoder (Cer et al., 2018) later augments an unsupervised learning objective with the supervised learning on NLI, and shows better transfer performance on various downstream tasks.
More recently, SBERT (Reimers and Gurevych, 2019b) finetunes a siamese BERT (Devlin et al., 2018) model on NLI and sets new state-of-the-art results. However, SBERT as well as the above work adopt the vanilla siamese or triplet loss, which often suffer from slow convergence and bad local optima (Oh Song et al., 2016;Thakur et al., 2020).

Self-Supervised Instance Discrimination
Another relevant line of work is self-supervised contrastive learning, which essentially solves an instance discrimination task that targets at discriminating each positive pair from all negative pairs within each batch of data (Oord et al., 2018;Bachman et al., 2019;He et al., 2020;Chen et al., 2020). Owing to their notable successes, self-supervised instance discrimination has become a prominent pre-training strategy for providing effective representations for a wide range of downstream tasks.
While recent successes are primarily driven by the computer vision domain, there is an increasing interest in leveraging variant instance discrimination tasks to support Pretraining Language Models (PLMs) (Meng et al., 2021;Giorgi et al., 2020;Rethmeier and Augenstein, 2021). Our proposal can be seen as complementary to this stream of work, considering that the instance discrimination learning based PLMs provide a good foundation for PairSupCon to further enhance the sentence representation quality by further learning on NLI. As demonstrated in Section 4, by training a pre-trained BERT-base model for less than an hour, PairSupCon attains substantial improvement on various downstream tasks that involve sentence semantics understanding at different granularities.
Deep Metric Learning Inspired by the pioneering work of (Hadsell et al., 2006;Weinberger and Saul, 2009), many recent works have shown significant benefit in learning deep representations using either siamese loss or triplet loss. However, both losses learn from individual pairs or triplets, which often require substantial training data to achieve competitive performance. Two different streams of work have been proposed to tackle this issue, with the shared focus on nontrivial pairs or triplets optimization. Wang et al. (2014);Schroff et al. (2015); Wu et al. (2017); Harwood et al. (2017) propose hard negative or hard positive mining that often requires expensive sampling. Oh Song et al. (2016) extends the vanilla triplet loss by contrasting each positive example against multiple negatives.
Our work leverages the strength of both lines, with the key difference being the above work requires categorical level supervision for selecting hard negatives. To be more specific, negative sam-  ples that have different categorical labels from the anchor but are currently mapped close to the anchor in the representation space, are likely to be more useful and hence being sampled. However, there are no categorical labels available in NLI. We thereby contrast each positive pair against multiple negatives collected using an unsupervised importance sampling strategy, for which the hypothesis is that hard negatives are more likely to locate close to the anchor. The effectiveness of this assumption is investigated in Section 4.3.

Model
Following SBERT (Reimers and Gurevych, 2019a), we adopt the SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2017) as our training data, and refer the combined data as NLI for convenience. The NLI data consists of labeled sentence pairs and each can be presented in the form: (premise, hypothesis, label). The premise sentences are selected from existing text sources and each premise sentence is paired with variant hypothesis sentences composed by human annotators. Each label indicates the hypothesis type and categorizes semantic relation of the associated premise and hypothesis sentence pair into one of three categories: entailment, contradiction, and neural, correspondingly. Prior work solely optimizes either siamese loss or triplet loss on NLI. We instead aiming to leverage the implicit grouping effect of instance discrimination learning to better capture the high-level categorical semantic structure of data while, simultaneously promoting better convergence of the lowlevel semantic textual entailment and contradiction reasoning objective.

Instance Discrimination
We leverage the positive (entailment) pairs of NLI to optimize an instance discrimination objective which tries to separate each positive pair apart from all other sentences. Let D = (x j , x j ), y j M j=1 denote a randomly sampled minibatch, with y i = ±1 indicating an entailment or contradiction pair. Then for a premise sentence x i within a positive pair (x i , x i ), we aim to separate its hypothesis sentence x i from all other 2M -2 sentences within the same batch D. To be more specific, let denote the corresponded indices of the sentence pairs in D, we then minimize the following for (1) In the above equation, z j = h(ψ(x j )) denotes the output of the instance discrimination head in Figure 2, τ denotes the temperature parameter, and s(·) is chosen as the cosine similarity, i.e., s(·) = z T i z i / z i z i . Notice that Equation (1) can be interpreted as a (2M -1)-way softmax based classification loss of classifying z i as z i .
Similarly, for the hypothesis sentence (x i ) we also try to discriminate its premise (x i ) from all the other sentences in D. We denote the corresponding loss as i ID that is defined by exchanging the roles of instances x i and x i in Equation (1), respectively. In summary, the final instance discrimination loss is averaged over all positive pairs in D, Here, 1 (·) denotes the indicator function, and P M is the number of positive pairs in D. As demonstrated in Section 4, optimizing the above loss not only helps implicitly encode categorical semantic structure into representations, but also promotes better pairwise semantic reasoning capability, though no pairwise supervision except the true entailment labels are present to the model.

Learning from Hard Negatives
Notice that Eq (1) can be rewritten as It can be interpreted as extending the vanilla triplet loss by treating the other 2M -2 samples within the same minibatch as negatives. However, the negatives are uniformly sampled from the training data, regardless of how informative they are. Ideally, we want to focus on hard negatives that are from different semantic groups but are mapped close to the anchor, i.e., z i , in the representation space. Although the categorical level supervision is not available in NLI, we approximate the importance of the negative samples via the following, Here , which can be interpreted as the relative importance of z j among all 2M -2 negatives of anchor z i . The assumption is that hard negatives are more likely to be those that are located close to the anchor in the representation space. Although there might exists false negatives, i.e., those located close to the anchor z i but are from the same category, the probability is low as long as the underlying number of categories of the training data is not too small and each minibatch D is uniformly sampled.

Entailment and Contradiction Reasoning
The instance discrimination loss mainly focuses on separating each positive pair apart from the others, whereas there is no explicit force in discriminating contradiction and entailment. To this end, we jointly optimize a pairwise entailment and contradiction reasoning objective. We adopt the softmax-based cross-entropy to form the pairwise classification objective. Let u i = ψ(x i ) denote the representation of sentence x i , then for each labeled sentence pair (x i , x i , y i ) we minimize the following, Here f denotes the linear classification head in Figure 1, and CE is the cross-entropy loss. Different from Reimers and Gurevych (2019b), we exclude the neural pairs from the original training set and focus on the binary classification of semantic entailment and contradiction only. Our motivation is that the concept semantic neural can be well captured by the instance discrimination loss. Therefore, we drop the neural pairs from the training data to reduce the functional redundancy of the two losses in PairSupCon and improve the learning efficiency as well.
Overall loss In summary, our overall loss is where i C and i wID , i wID are defined in Equations (4) and (3), respectively. In the above equation, β balances between the capability of pairwise semantic entailment and contradiction reasoning, and the capability of the high-level categorical semantic structure encoding. We dive deep into the trade-off between these two aspects by evaluating PairSup-Con with different β values in Section 4.3, and show how different β values can benefit different downstream tasks. Unless otherwise specified, we set β = 1 in this paper for the purpose of providing effective representations to various downstream tasks instead of being tailored to any specific ones.

Experiments
Baselines In this section, we mainly investigate the effective strategies of leveraging the labeled NLI data to enhance the sentence representations of the pre-trained language models (PLMs). We compare PairSupCon against the vanilla BERT (Devlin et al., 2018;Sanh et al., 2019) models and the previous state-of-the-art approach, SBERT (Reimers and Gurevych, 2019a). We noticed a concurrent work SimCSE (Gao et al., 2021)

Clustering
Motivation Existing work mainly focuses on the semantic similarity (a.k.a STS) related tasks. We argue that an equally important aspect of sentence representation evaluation -the capability of encoding the high-level categorical structure into the representations, has so far been neglected. Desirably, a model should map the instances from the same category close together in the representation space while mapping those from different categories farther apart. This expectation aligns well with the underlying assumption of clustering and is consistent with how human categorizes data. We evaluate the capabilities of categorical concept embedding using K-Means (MacQueen et al., 1967;Lloyd, 1982), given its simplicity and the fact that the algorithm itself manifests the above expectation. We consider eight benchmark datasets for short text clustering. As indicated in Table 2 1 , the datasets present the desired diversities of both the size of each cluster and the number of clusters of each dataset. Furthermore, each text instance consists of 6 to 28 words when averaged within each dataset, which well covers the spectrum of NLI where each instance has 12 words on average. Therefore, we believe the proposed datasets can provide an informative evaluation on whether an embedding model is capable of capturing the high-level categorical concept.

Evaluation Results
The evaluation results are summarized in Table 1. We run K-Means with the scikit-learn package (Pedregosa et al., 2011) on the representations provided by each model and report the clustering accuracy 2 averaged over 10 independent runs. 3 As Table 1 Table 3: Spearman rank correlation between the cosine similarity of sentence representations and the ground truth labels on seven Semantic Textual Similarity (STS) tasks. ♦ and ♠: results evaluated on the checkpoints provided by Reimers and Gurevych (2019a) and Gao et al. (2021), respectively.
parison with the vanilla BERT models, SBERT results in degenerated embedding of the categorical semantic structure by simply optimizing the pairwise siamese loss. One possible reason is that SBERT uses a large learning rate (2e-05) to optimize the transformer, which can cause catastrophic forgetting of the knowledge acquired in the original BERT models. We find using a smaller learning rate for the backbone can consistently improve the performance of SBERT (see the performance of BERT-base with "Classificaiton" in Table 4).
Nevertheless, PairSupCon leads to an averaged improvement of 10.8% to 15.2% over SBERT, which validates our motivation in leveraging the implicit grouping effect of the instance discrimination learning to better encode the high-level semantic concepts into representations. Moreover, PairSup-Con also attains better performance than SimCSE, and we suspect this is because PairSupCon better leverages the training data. Specifically, PairSup-Con aims to discriminate an positive (entailment) sentence pair apart from all other sentences, no matter they are premises or hypotheses. In contrast, SimCSE only separates a premise from the other premises through their entailment and contradiction hypotheses, while there is no explicit instance discrimination force within the premises or the hypotheses alone. Considering the statistic data difference (Williams et al., 2017) between premises and hypotheses, PairSupCon can potentially better capture categorical semantic concepts by leverag-ing additional intrinsic semantic properties of the premises or the hypotheses that are undiscovered by SimCSE.
The evaluation results are reported in Table 3. PairSupCon substantially outperforms both the vanilla BERT and SBERT models. This validates our assumption that, by implicitly encoding the high-level categorical structure into the representations, PairSupCon promotes better convergence of the low-level semantic entailment reasoning objective. This assumption is consistent with the top-down categorization behavior of humans. Although SimCSE leverages STS-Benchmark as the development set while PairSupCon is fully blind to the downstream tasks 5 , we hypothesize the performance gain of SimCSE on STS is mainly contributed by explicitly merging the entailment and contradiction separation into the instance discrimination loss. On the other hand, as we discussed in Section 4.1, PairSupCon achieves more obvious performance gain on the clustering tasks through a bidirectional instance discrimination loss. Therefore, developing a better instance discrimination based sentence representation learning objective by incorporating the strengths of both SimCSE and PairSupCon could be a promising direction.   Con. For notational convenience, we name the pairwise semantic relation classification objective in PairSupCon as Classification, and the instance discrimination objective as InstanceDisc.
PairSupCon versus. Its Components In Figure  3, we compare PairSupCon against its two components, namely Classification and InstanceDisc. As it shows, InstanceDisc itself outperforms Classification on both STS and categorical concept encoding. The result matches our expectation that contrasting each positive pair against multiple negatives, despite obtained through unsupervised sampling, yields better performance than simply learning from each individual pair. By jointly optimizing both objectives, PairSupCon leverages the implicit grouping effect of InstanceDisc to encode the high-level categorical structure into representations while, simultaneously complementing In-stanceDisc with more fine-grained semantic con-  Table 6: Few-shot learning evaluation on SentEval. For each task, we randomly sample 16 labeled instances per class and report the mean (standard deviation) performance over 5 different training sets. ♦ and ♠: results evaluated on the checkpoints provided by Reimers and Gurevych (2019a) and Gao et al. (2021), respectively. cept reasoning capability via Classification. Table 4 indicates a trade-off between the highlevel semantic structure encoding and the low-level pairwise entailment and contradiction reasoning capability of PairSupCon. Focusing more on the pairwise classification objective, i.e., using smaller β values, can hurt the embeddings of the high-level semantic structure. This result is not surprising, especially considering that sentences forming a contradiction pair are not necessarily belong to different semantic groups. On the other hand, In-stanceDisc only focuses on separating each positive pair from all other samples within the same minibatch, and an explicit force that discriminates semantic entailment from contradiction is necessary for PairSupCon to achieve competitive performance on the more fine-grained pairwise similarity reasoning on STS. As indicated in Table 4, we can tune the β values to attain effective representations for specific downstream tasks according to their semantic granularity focuses. We set β = 1 for all our experiments with the goal to provide effective universal sentence representations to different downstream tasks.
Hard Negative Sampling Helps In Table 5, we compare both PairSupCon and InstanceDisc against their counterparts where the negatives in the instance discrimination loss are uniformly sampled from data. As it shows, the hard negative sampling approach proposed in Section 3.2 leads to improved performance on both STS and clustering tasks. We associate this performance boost with our assumption that hard negatives are likely located close to the anchor. A properly designed distance-based sampling approach can drive the model to better focus on hard negative separation and hence lead to better performance.
On the other hand, hard negative sampling without any supervision is a very challenging problem, especially considering that samples within the local region of an anchor are also likely from the same semantic group as the anchor. As a consequence, a solely distance-based sampling approach can induce certain false negatives and hurt the performance. To tackle this issue, leveraging proper structure assumption or domain-specific knowledge could be potential directions, which we leave as future work.

Transfer Learning
In order to provide a fair and comprehensive comparison with the existing work, we also evaluate PairSupCon on the following seven transferring tasks: MR (Pang and Lee, 2005), CR (Hu and Liu, 2004), SUBJ (Pang and Lee, 2004), MPQA (Wiebe et al., 2005), SST (Socher et al., 2013), TREC (Li and Roth, 2002), and MRPC (Dolan et al., 2004). We follow the widely used evaluation protocol, where a logistic regression classifier is trained on top of the frozen representations, and the testing accuracy is used as a measure of the representation quality. We adopt the default configurations of the SentEval (Conneau and Kiela, 2018) toolkit and report the evaluation results in Table 7 in Appendix D. As we can see, the performance gap between different methods are relatively small.
We suspect the reason is that the transfer learning tasks do not present enough complexities to fully uncover the performance gap between different ap-proaches, especially considering that most tasks are binary classification with a large amount of labeled training examples. To further examine our hypothesis, we extend the evaluation to the setting of few-shot learning, where we uniformly sample 16 labeled instances per class for each task. We report the mean and standard deviation of the evaluation performance over 5 different sample sets in Table 6. Although we observe more obvious performance gap on each specific task, there is no consistent performance gap between different approaches when evaluated across different tasks. Therefore, to better evaluate the transfer learning performance of sentence representations, more complex and diverse datasets are required.

Discussion and Conclusion
In this paper, we present a simple framework for universal sentence representation learning. We leverage the implicit grouping effect of instance discrimination learning to better encoding the highlevel semantic structure of data into representations while, simultaneously promoting better convergence of the lower-level semantic entailment and contradiction reasoning objective. We substantially advance the previous state-of-the-art results when evaluated on various downstream tasks that involve understanding semantic concepts at different granularities.
We carefully study the key components of our model and pinpoint the performance gain contributed by each of them. We observe encouraging performance improvement by using the proposed hard negative sampling strategy. On the other hand, hard negative sampling without any supervision is an crucial, yet significantly challenging problem that should motivate further explorations. Possible directions include making proper structure assumption or leveraging domain-specific knowledge. The substantial performance gain attained by our model also suggests developing explicit grouping objectives could be another direction worth investigation. Aitor Gonzalez-Agirre. 2012. Semeval-2012

B Short Text Clustering Datasets
SearchSnippets is extracted from web search snippets, which contains 12340 snippets associated with 8 groups Phan et al. (2008).
StackOverflow is a subset of the challenge data published by Kaggle 6 , where 20000 question titles associated with 20 different categories are selected by Xu et al. (2017).
Biomedical is a subset of PubMed data distributed by BioASQ 7 , where 20000 paper titles from 20 groups are randomly selected by Xu et al. (2017).
AgNews is a subset of news titles (Zhang and LeCun, 2015), which contains 4 topics selected by Rakib et al. (2020).
Tweet consists of 89 categories with 2472 tweets in total (Yin and Wang, 2016).

C Leveraging STS-Benchmark as the Development Set
To better understand the underlying causes of the performance gap between PairSupCon and Sim-CSE on STS, we also train PairSupCon by using STS-Benchmark as the development set. We summarize the corresponding evaluation results in Table 8, which indicates that PairSupCon does not benefit from leveraging the STS-Benchmark. We thereby hypothesize the performance gain of Sim-CSE is mainly attributed by merging the entailment and contradiction discrimination into the instancewise contrastive learning objective.
On the other hand, as discussed in Section 4.1, SimCSE can be roughly interpreted as an unidirectional instance-wise contrastive learning. In contrast, PairSupCon utilizes the training data in a more efficient way through a bidirectional instance discrimination loss, and hence achieves more obvious performance gain on the clustering tasks. Therefore, developing a better instance discrimination based sentence representation learning objective by incorporating the strengths of both SimCSE and PairSupCon could be a promising direction.

D Transfer Learning
To provide a fair and more comprehensive comparison with the existing work, we also evaluate Pair-SupCon on the seven transferring tasks using the SentEval toolkit (Conneau and Kiela, 2018). We follow the widely used evaluation protocol, where a   (Gao et al., 2021), no obvious performance gain is attained by PairSupCon when using STS-B as the development set.
logistic regression classifier is trained on top of the frozen representations, and the testing accuracy is used as a measure of the representation quality. We report the evaluation results in Table 7. As we can see, the performance gap between different models are small, yet still not consistent across different tasks. As discussed in Section 4.4, one possible explanation is that the transfer learning tasks do not present enough complexities to discriminate the performance gap between different approaches, since most tasks are binary classification with a large amount of labeled training examples. Although we observe more obvious performance gap by extending the evaluation to the setting of few-shot learning (Table 6), there is no consistent performance gain across different tasks attained by any specific model investigated in this paper. Moving forward, having more complex and diverse datasets to evaluate the transfer learning performance could better direct the development of universal sentence representation learning.