Contrastive Learning with Prompt-derived Virtual Semantic Prototypes for Unsupervised Sentence Embedding

Contrastive learning has become a new paradigm for unsupervised sentence embeddings. Previous studies focus on instance-wise contrastive learning, attempting to construct positive pairs with textual data augmentation. In this paper, we propose a novel Contrastive learning method with Prompt-derived Virtual semantic Prototypes (ConPVP). Specifically, with the help of prompts, we construct virtual semantic prototypes to each instance, and derive negative prototypes by using the negative form of the prompts. Using a prototypical contrastive loss, we enforce the anchor sentence embedding to be close to its corresponding semantic prototypes, and far apart from the negative prototypes as well as the prototypes of other sentences. Extensive experimental results on semantic textual similarity, transfer, and clustering tasks demonstrate the effectiveness of our proposed model compared to strong baselines. Code is available at https://github.com/lemon0830/promptCSE.


Introduction
High-quality sentence embeddings can boost the performance of pre-trained language models (PLMs) on many downstream tasks (Kiros et al., 2015;Logeswaran and Lee, 2018;Reimers and Gurevych, 2019a).Recent research focuses on learning sentence embeddings in an unsupervised manner due to lack of large scale labeled data (Hill et al., 2016;Pagliardini et al., 2018;Wang et al., 2021b).Among these methods, contrastive learning has been extensively explored and achieved remarkable success (Gao et al., 2021b;Wu et al., 2021;Yan et al., 2021;Giorgi et al., 2021).Specifically, most of them construct a positive pair by operating various textual data augmentation methods, while regard two independent sentences sampled uniformly from the training data as a negative pair.In spite of effectiveness in easing the anisotropy problem, such instance-wise optimization leads to a locally smooth embedding space, and ignores semantic relevance to some extent (Li et al., 2020b).Moreover, due to the discrete nature of language, data augmentation can change sentence semantics significantly, and thus a positive sample are possibly turned into a negative one (Wang et al., 2021a).
To alleviate the issues, we introduce the idea of prototypical contrastive learning to unsupervised sentence embeddings learning, which is proven effective to learn structural visual embedding space (Li et al., 2020b;Caron et al., 2020).The motivation lies in that when encoding sentences into the embedding space, the sentences with similar semantics cluster together around the corresponding prototype.Nevertheless, the acquisition of prototypes is inefficient if we directly apply the clustering algorithms used in (Li et al., 2020b;Caron et al., 2020), due to the requirement of extra forward pass over the training set or weak correlation with semantics (Caron et al., 2020).This makes us wonder whether there exists a dedicated method of mining the semantic prototypes for sentence embeddings especially based on PLMs?
To answer this question, we attempt to think from the perspective of prompt learning (Brown et al., 2020).Intuitively, on sentence-level NLP tasks such as classification, the neural models en- code and map each input to a corresponding semantic prototype in the embedding space.For example, the sentiment analysis models divide instances into semantic prototypes related to sentiment polarity.In addition, the sentence-level tasks can be solved by providing task-specific prompts to PLMs as a condition even without any fine-tuning (Brown et al., 2020;Sanh et al., 2021;Wei et al., 2021).As illustrated in Figure 1, PLMs can directly generate reasonable label words (e.g., "positive") for a sentence <S> by answering the query of the "[MASK]" token, when fed a prompt-wrapped sequence (e.g., "<S> is this review positive ?[MASK]").Thus, we argue that the representations of the "[MASK]" token derived by task-specific templates can be viewed as virtual semantic prototypes, which can be obtained without using label information (Lan et al., 2021) and clustering algorithms (Li et al., 2020b;Caron et al., 2020).Besides the commonlyused templates, we manually convert each template to its negation, and use them to induce negative prototypes.Back to Figure 1, with the template "<S> is not a [MASK] one", the word "bad" can be derived from PLMs.
In this paper, we propose ConPVP (Contrastive learning with Prompt-derived Virtual Semantic Prototypes) (ConPVP) for unsupervised sentence representation learning.Specifically, given an input sentence, we generate the positive and negative prototypical embeddings by using a task-specific template and its negative counterpart, respectively.We use the contrastive loss to enforce the sentence embedding to be close to its positive prototype, and far apart from the negative prototype as well as the prototypes of other sentences.As illustrated in Figure 2, the issue of local smoothness can be alleviated by exploiting the semantic regularization induced by task-specific prompts, and the sentences with similar semantics are closer.We empirically evaluate our proposed ConPVP on a range of semantic textual similarity tasks, and the experimental results show the substantial improvements compared with strong baselines.Further,the extensive analysis and applications to transfer and clustering tasks confirm the effectiveness and robustness of our ConPVP.
2 Related Work

Prototypical Contrastive Learning
Recently, prototypical contrastive learning has shown its power in computer vision (Li et al., 2020b;Caron et al., 2020;Sharma et al., 2020) and NLP tasks (Wei et al., 2022;Ding et al., 2021), which discover the underlying semantic structure by clustering the learned embeddings.Compared with them, we propose a more efficient and dedicated method to find prototypes for sentence embeddings, without using clustering algorithms or label information.To the best of our knowledge, we are the first to explore the prototypical contrastive learning in unsupervised sentence representation learning.

Prompt-based Learning
Prompt-based Learning has become a new paradigm in NLP, bridging the gap between pretraining tasks and downstream tasks (Brown et al., 2020;Schick and Schütze, 2021a;Sanh et al., 2021).It reformulates various NLP tasks as clozestyle questions, and by doing so, the knowledge stored in PLMs can be fully exploited, making PLMs achieve impressive performance in few-shot and zero-shot settings.Along this research line, various types of prompts are explored including discrete and continuous prompts (Gao et al., 2021a;Shin et al., 2020;Hu et al., 2021;Liu et al., 2021;Cui et al., 2021;Si et al., 2021;Li and Liang, 2021;Schick and Schütze, 2021b).In this work, we exploit prompts of different downstream tasks to assign various virtual semantic prototypes to each instance.

Unsupervised Sentence Embedding
Unsupervised learning has been used to improve the sentence embedding learning (Reimers and Gurevych, 2019b;Li et al., 2020a;Su et al., 2021;Zhang et al., 2020), and contrastive learning has attracted extensive attention due to the promising performance (Gao et al., 2021b;Giorgi et al., 2021;Wu et al., 2020;Yan et al., 2021;Meng et al., 2021;Carlsson et al., 2021).Wu et al. (2021) augment positive pairs with word repetition and introduce a momentum encoder for negative pairs.Wang et al. (2022) use soft negative samples which have highly similar textual but opposite meaning to the input sentence.Jiang et al. (2022) use a discrete template to obtain sentence embeddings.Unlike these studies, we introduce prototypical contrastive learning and implicitly encode semantic structure induced by task-specific prompts into the embedding space, enhancing PLMs' ability of modeling semantic similarity.Furthermore, our prototypical contrastive loss is orthogonal to the instance-wise one, and the performance can be further improved by combining ConPVP with the above studies.

Method
In this section, we elaborate the proposed ConPVP, a novel contrastive learning approach implicitly encoding semantic structure into the embedding space.As illustrated in Figure 3, ConPVP is based on the popular SimCSE framework (Gao et al., 2021b) and further leverages the concept of semantic prototypes.

Prompt-derived Virtual Semantic Prototypes
Semantic prototype is defined as a representative embedding for a group of semantically similar instances (Li et al., 2020b).Given the fact that PLMs are able to perform well on various NLP tasks when provided with suitable task-specific templates (Brown et al., 2020;Sanh et al., 2021;Wei et al., 2021), we can induce the semantic prototypes of sentences from PLMs with the help of prompts.In this work, we construct a template set T + using four NLP tasks (i.e., classification, summarization, natural language inference, and sentence embedding), and assign each task 2 templates.Please note that we select the templates without deliberation, and we leave the other choices as future work.Furthermore, we construct another template set T − , in which the templates are the negative form of those in T + and possibly induces semantically opposite response from PLMs.All the templates are illustrated in Table 1.

Basic Templates
Given "<S>" , we assume that "[MASK]" " <S> " , is this review positive ?After obtaining the template sets, we convert an input sentence where T + i is a template sampled from T + .We feed xi to a PLM and take the hidden state of the "[MASK]" token h [MASK] as the positive prototypical embedding c + i .In this same way, we generate the negative prototypical embedding c − i using a sampled template T − i ∈ T − .Notably, unlike the conventional prototypes (Li et al., 2020b;Caron et al., 2020;Sharma et al., 2020), our method of obtaining prototypes may not seem intuitive, since there is no explicit partitioning of the embedding space.In order to distinguish our method from the previous studies, we name the prototypes in this work virtual prototypes.

Prototypical Contrastive Learning
To obtain the embedding of an anchor sentence x i , we feed [x i ; t 1 ; ...; t l ;[MASK]] to a PLM to obtain its contextualized representations, where t 1 ; ...; t l is a continuous prompt.The representation vector of the "[MASK]" token is taken as the sentence embedding v i .Given the embedding of the anchor sentence and the corresponding positive and negative prototype embeddings, we integrate them into  the InfoNCE based contrastive loss1 : where N is the number of sentences in a mini-batch.With this loss function, we pull the embedding of the anchor sentence v i close to its positive prototypical embedding c + i , and push v i and the irrelevant prototypical embeddings apart.

Experiments
To verify the effectiveness of our proposed method, we conduct experiments and empirical analysis on Semantic Textual Similarity (STS) tasks under the unsupervised setting.

Main Results
We compare our ConPVP to the recent related methods which are based on instance-wise contrastive learning, including 1) ConSERT (Yan et al., 2021) which exploits four data augmentation strategies to construct positive samples; 2) SimCSE (Gao et al., 2021b) which directly uses Dropout to generate positive pairs; 3) ESimCSE (Wu et al., 2021) which introduces word repetition augmented positive pairs and momentum negative pairs; 4) Prompt-BERT (Jiang et al., 2022) which reformulates the sentence embeddings task as a prompt-based learning paradigm.
For fair comparison, we report the best performance from 4 runs in Table 3.Compared with SimCSE, ConPVP brings significant improvements across the board.Specifically, ConPVP achieves average improvements of 2.60, 1.60, 2.61, and 1.28 points over BERT-base, BERT-large, RoBERTabase and RoBERTa-large, respectively, showing the superiority of our prototypical contrastive method.

Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R
Avg.  Besides, ConPVP surpasses ConSERT and ESim-CSE, which carefully design positive samples with various textual data augmentation.This demonstrates that although the textual data augmentation can provide different views of the anchor, these methods based on it still suffers the local smooth problem.In contrast, our model shows that textual data augmentation is possibly unnecessary, and the improvement can be achieved by encoding more structural information into the embedding space, e.g., finding semantic prototypes.Finally, our Con-PVP achieves consistently better performance than PromptBERT, demonstrating the effectiveness of the proposed prototypical contrastive loss.

Ablation Study
To analyze the impact of different components of ConPVP, we investigate the following three variants: 1) ConPVP w/ manual, where we obtain the anchor sentence embeddings with the searched discrete templates from Jiang et al. (2022); 2) ConPVP w/o c − , where we remove the negative prototypes  (Jiang et al., 2022), where the discrete templates are replaced by the continuous ones.We take RoBERTa-large as the backbone.The results on STS tasks are listed in Table 4 and the conclusions are as follows: 1) ConPVP obtains better results against ConPVP w/ manual.This may be because one manually-designed prompt cannot fit different PLMs and training strategies at the same time, and continuous prompts are more flexible and effective in comparison.Besides, the improvement of ConPVP w/ manual over Prompt-BERT validates the advantage of the prototypical contrastive loss.2) Removing the negative prototypes (i.e., ConPVP w/o c − ) leads to a performance degradation of 1.41 point against ConPVP.The underlying reason is that the negative prototypes here serve as a type of hard negatives-the semantics of the negative prototypes are essentially different from the positive prototypes but the prompts used to induce them are similar in text.3) We find that ConPVP w/o c + & c − does not give an improvement against SimCSE.These observations show that the gain of our method entirely comes from the cooperation between the prompt-derived virtual prototypes and the prototypical contrastive loss, rather than the usage of the prompt-based sentence embeddings.

Distribution of Cosine Similarity
In this section, we investigate the similarity distributions learned by different methods.As shown in Figure 4, the native sentence representations of RoBERTa-large suffer from the collapse issue (Chen and He, 2021), and therefore we get high similarity scores for all sentence pairs.By contrast, both ConPVP and SimCSE alleviate the collapse issue, and the predicted cosine similarity scores for positive pairs of ConPVP are more certain.For example, for the positive pairs whose similarity scores range from 4 to 5, the scores predicted by ConPVP (0.6 to 1.0) is more concentrated than the scores predicted by SimCSE (0.4 to 1.0).

Analysis of Embedding Space
Previous studies indicated that the collapse issue is mainly due to anisotropy of the learned embedding space, which is sensitive to token frequency (Yan et al., 2021;Jiang et al., 2022).We follow Yan et al. (2021) to remove the embeddings of K most frequent tokens and explore the relation between the number of removed tokens and the average  spearman correlation on STS tasks.
From Figure 5, we can observe that the performance of native BERT-base and SimCSE improves when removing the most frequent tokens.By contrast, ConPVP achieves its best performance without removing any tokens, showing that our approach reshapes the BERT's original embedding space, reducing the influence of common tokens on sentence representations.In addition, the performance of both SimCSE and ConPVP drops as the number of removed tokens increases but Con-PVP performs significantly better, demonstrating the robustness of ConPVP to incomplete input.

Visualization of Learned Embeddings
We visualize a few variants of RoBERTa-large sentence embeddings to grasp an intuition on the effectiveness of our method.Specifically, we sample 3 groups of samples from the STS-B test set, and the similarity score of each group is 0 (orange), 3 (blue), and 5 (green), respectively.Each group has 10 sentence pairs.We visualize their embeddings generated by different models using t-SNE (van der Maaten and Hinton, 2008) in Figure 6.
Due to the collapse issue, the sentence embeddings obtained from RoBERTa-large [CLS] cluster together whether they are similar or not.For Sim-CSE, the sentence embeddings of the positive pairs are well-clustered.However, the sentences pairs with similarity scores of 3 or 5 are very close in the embedding space.In contrast, the embeddings learned by our ConPVP are more discriminative, forming more separated clusters (e.g., the sentence pairs in green are more clustered than those in blue, while the pairs in orange are more dispersed).

Application to Transfer Learning Tasks
We evaluate the quality of the sentence embeddings learned by ConPVP on transfer learning tasks, in-     (Gao et al., 2021b).We report the clustering accuracy averaged over 10 independent runs.
cluding MR (Pang and Lee, 2005), CR (Hu and Liu, 2004), SUBJ (Pang and Lee, 2004), MPQA (Wiebe et al., 2005), SST-2 (Socher et al., 2013), TREC (Li and Roth, 2002) and MRPC (Dolan et al., 2004).A logistic regression classifier is trained using frozen sentence embeddings produced by different methods.We follow default configurations from SentEval (Conneau and Kiela, 2018).In addition, based on the principle that good representations can be transferred well with limited supervision and finetuning, we extend the evaluation to few-shot setting and follow (Zhang et al., 2021) to uniformly sample 16 labeled instances per class for each task.
Table 5 presents the results under the full data setting.As we can see, the performance gap between ConPVP and SimCSE is significant and consistent.Furthermore, we can observe more obvious gap under the few-shot setting (Figure 7).The results reveal the robustness and effectiveness of our ap-proach under the data scarcity scenarios, which is important in real-world applications.

Application to Clustering Tasks
We follow Zhang et al. (2021) to consider 8 benchmark datasets for short text clustering, including SearchSnippets (SS) (Phan et al., 2008), StackOverflow (SO) (Xu et al., 2017), Biomedical (Bio) (Xu et al., 2017), AgNews (AG) (Zhang and LeCun, 2015), Tweet (Yin and Wang, 2016) and Google-News (G-T, G-S, G-TS) (Yin and Wang, 2016).We follow default settings of (Zhang et al., 2021) and use BERT-base and BERT-large as the backbones.We run K-Means (Pedregosa et al., 2011) on the sentence embeddings and report the clustering accuracy averaged over 10 independent runs.As illustrated in Table 6, in comparison with SimCSE, ConPVP obtains an averaged improvement of 1.21 and 2.18, respectively, which validates our motiva-tion in leveraging the implicit grouping effect of the prompt-derived semantic prototypes to encode more semantic structure into representations.

Conclusion
In this work, we take the first step to explore the prototypical contrastive learning on unsupervised sentence embedding learning, and consider more semantic views for each instance than the recent instance-wise contrastive methods.In particular, we make use of the prompting in PLMs to generate the positive and negative prototypical embeddings with task-specific templates.The experiments and extensive analysis validate the effectiveness and robustness of our ConPVP.

Limitations
We only tried 16 task specific prompts in this paper, which is possibly sub-optimal to induce semantic prototypes.Besides, the usage of prompts reduces the maximum effective lengths that the pretrained language models can process.

Figure 2 :
Figure 2: Illustration of contrastive learning with prompt-derived virtual semantic prototypes.

Figure 3 :
Figure 3: The overall framework of our proposed ConPVP.

Figure 4 :
Figure 4: Distribution of predicted cosine similarity.The correlation diagram between the gold similarity scores (x-axis) and model predicted cosine similarity scores (y-axis) on the STS-B dataset.We scale the predicted scores to 0 to 1.

Figure 5 :
Figure 5: Analysis of embedding space.The average spearman correlation on STS tasks w.r.t the number of removed top-k frequent tokens.The frequency of each token is calculated through the test split of the STS Benchmark dataset.

Figure 6 :
Figure 6: Visualization of learned embeddings.We visualize 10 sentence pairs whose similarity scores are 0 in orange (ids from 0 to 9), 10 pairs whose similarity scores are 3 in blue (ids from 10 to 19), and 10 pairs whose similarity scores are 5 in green (ids from 20 to 29).The sentences are sampled from the STS-B test set.

Figure 7 :
Figure 7: Few-shot learning evaluation on Transfer tasks with RoBERTa-base and RoBERTa-large as the backbones.For each task, we randomly sample 16 labeled instances per class and draw violin plots of the performance of 10 runs with different random seeds. [MASK].

Table 2 :
Training Settings for STS.

Table 3 :
Experimental results on unsupervised STS tasks.Methods with † denote that we directly report the scores from corresponding paper, and others are from our implementation.We run 4 times with different random seeds and report the best Avg. for fair comparison.

Table 4 :
Ablation Study.We run each experiment 4 times with different random seeds and report mean and standard deviation.

Table 6 :
Clustering accuracy reported on short text clustering datasets with BERT-base and BERT-large as the backbones.♣: results evaluated on the checkpoints provided by