Contrastive Learning of Sentence Embeddings from Scratch

Contrastive learning has been the dominant approach to train state-of-the-art sentence embeddings. Previous studies have typically learned sentence embeddings either through the use of human-annotated natural language inference (NLI) data or via large-scale unlabeled sentences in an unsupervised manner. However, even in the case of unlabeled data, their acquisition presents challenges in certain domains due to various reasons. To address these issues, we present SynCSE, a contrastive learning framework that trains sentence embeddings with synthesized data. Specifically, we explore utilizing large language models to synthesize the required data samples for contrastive learning, including (1) producing positive and negative annotations given unlabeled sentences (SynCSE-partial), and (2) generating sentences along with their corresponding annotations from scratch (SynCSE-scratch). Experimental results on sentence similarity and reranking tasks indicate that both SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines, and SynCSE-partial even achieves comparable performance to the supervised models in most settings.


Introduction
The objective of sentence representation learning is to derive sentence embeddings that can benefit a wide range of downstream tasks, including reranking (Lee et al., 2021;Barker et al., 2021), natural language understanding (Cer et al., 2018), and retrieval (Misra et al., 2016;Thakur et al., 2021;Wang et al., 2022a).Methods built on contrastive learning, such as SimCSE (Gao et al., 2021) and PromCSE (Jiang et al., 2022b), have dominated the field due to their competitive performance (Zeng Devise ten distinct and diverse sentences that may appear in the pieces of content shared on social media platforms, covering a range of subjects (education, food, technology, history, architecture, war, etc.).These sentences should present a mix of complexity levels, from elementary structures akin to "Birds fly in the sky." to more sophisticated ones.Aim for a low degree of lexical overlap ...

Prompt
this colorful sunset at the beach today ☀️ Exploring the city and stumbled upon this beautiful architecture!...

I spent the entire day indoors
The architecture in the city was disappointing and unattractive.

Positive prompt Hard negative prompt Positive prompt
Hard negative prompt I want to collect some sentences from social medial platform.
Figure 1: An overview of the data synthesis process of SynCSE-scratch.We specify a desired domain and genre, and our framework will generate diverse unlabeled data for that domain along with their positive and negative annotations.et al., 2022;Limkonchotiwat et al., 2022;Wu et al., 2022a;Wang et al., 2022c;Jiang et al., 2022b).
Contrastive learning trains sentence representations through distinguishing positive samples from negative ones.In this framework, the quality of these positive and negative annotations plays a critical role.Supervised approaches typically gather these annotations from labeled natural language inference (NLI) datasets (Jiang et al., 2022a;Limkonchotiwat et al., 2022) -however, such sources are generally unavailable for most settings, and manually creating them is cost-prohibitive.As a result, unsupervised methods that solely rely on unlabeled sentences attract significantly more attention re-cently (Gao et al., 2021;Zhou et al., 2022;Wu et al., 2022a) -they mostly develop methods to automatically obtain positive and negative samples to facilitate contrastive learning.A representative example is SimCSE (Gao et al., 2021), which leverages perturbed hidden states as the positive samples and in-batch sentences as negatives to perform contrastive learning.To differentiate between in-batch negatives and the annotated negatives, the latter are often termed "hard negatives", which have proven to be significantly advantageous in enhancing sentence embeddings (Wang et al., 2022b,c).
Despite considerable advances in recent years, the performance of these unsupervised methods still falls short when compared to their supervised counterparts.Moreover, the unavailability of largescale unlabeled data for the targeted domain often poses additional limitations to these approaches.To overcome these challenges, we introduce SynCSE, an unsupervised contrastive framework that trains sentence embeddings with synthesized data.Concretely, we propose to prompt large language models (LLMs) such as ChatGPT (OpenAI, 2022) to synthesize the samples needed for contrastive learning.This is inspired by recent successes of prompting large language models (LLMs) to perform various tasks (Chung et al., 2022;Ouyang et al., 2022;OpenAI, 2023), especially the superior performance of LLMs over crowd-workers on text annotation (Gilardi et al., 2023).We investigate two variants of SynCSE in this work that correspond to two practical scenarios: (1) SynCSE-partial, where large-scale unlabeled sentences are available and LLMs are prompted to produce positive and hard negative annotations, and (2) SynCSE-scratch, where large-scale unlabeled sentences are not available, prompting LLMs to generate sentences and their corresponding annotations from scratch.The latter represents a particularly challenging yet practical scenario where we aim to learn sentence embeddings without any data samples.
We conduct comprehensive experiments on the standard Semantic Textual Similarity (STS) benchmark, along with four reranking tasks and four domain adaptation tasks.Our results demonstrate that both SynCSE-partial and SynCSE-scratch substantially outperform the unsupervised baselines in all cases -for example, SynCSE-partial and SynCSEscratch exceed the unsupervised SimCSE baseline by 5.37 and 4.18 absolute points respectively on STS.Particularly, SynCSE-partial often equals its supervised counterpart on STS, marking the first instance of an unsupervised method matching supervised results on this benchmark.We release our synthesized datasets to facilitate further research to learn better sentence embeddings.

Background
We base our approach on the formulation of Sim-CSE (Gao et al., 2021), which is one of the most common and effective contrastive learning frameworks to learn sentence embeddings.Formally, we denote the unlabeled sentence as x i and its positive sample as x + i .Let h i and h + i denote the representations of x i and x + i respectively, then the unsupervised SimCSE loss is defined as: where M denotes the mini-batch's size, τ is a temperature hyperparameter, and sim(•, •) stands for a similarity function.Unsupervised SimCSE passes the same x i twice to the encoder to form (h i , h + i ) pairs due to random dropout, and other sentences within the same mini-batch are considered as negative samples as shown in Eq. 1. Supervised SimCSE further extends (x i , x + i ) with hard negative samples x − i to constitute the triplet datasets and define the supervised loss: .
(2) In supervised SimCSE, the (x i , x + i , x − i ) triplets are typically from annotated NLI datasets, where x i is the premise, x + i and x − i are the entailment and contradiction hypotheses.Supervised SimCSE significantly outperforms the unsupervised one due to the enhanced quality of positive and hard negative samples.However, such annotated data are typically unavailable in most settings, and manually annotating triplets (x i , x + i , x − i ) can be resource-intensive, rendering unsupervised approaches the most promising choices in practice.In this work, we focus on the supervised loss in Eq. 2, but synthesize (x + i , x − i ) given x i or even generate (x i , x + i , x − i ) triplets from scratch, aiming to approach the performance of supervised models with an unsupervised method.We describe our data synthesis process next.

Hard negative prompts pools
Prompt1: Revise the provided sentence by swapping, changing, or contradicting some details in order to express a different meaning, while maintaining the general context and structure.
Prompt2: Generate a slightly modified version of the provided sentence to express an opposing or alternate meaning by changing one or two specific elements, while maintaining the overall context and sentence structure.
Prompt3: Transform the input sentence by adjusting, altering, or contradicting its original meaning to create a logical and sensible output sentence with a different meaning from the input sentence.
Prompt4: Generate a sentence that conveys a altering, contrasting or opposite idea to the given input sentence, while ensuring the new sentence is logical, realistic, and grounded in common sense.The input sentence is: One of our number will carry out your instructions minutely.
What is your generated sentence?One person from our group will execute your instructions with great attention to detail.

Data Synthesis from ChatGPT
We propose to prompt ChatGPT (OpenAI, 2022) to synthesize the required data in contrastive learning, inspired by recent successes of prompting LLMs to fulfill multiple tasks (Chung et al., 2022;Ope-nAI, 2023).Concretely, we introduce two variants of SynCSE: (1) SynCSE-partial which synthesizes (x + i , x − i ) given x i , and (2) SynCSE-scratch which synthesizes (x i , x + i , x − i ) from scratch.SynCSEscratch is practically useful since large-scale unlabeled data are not always available in the domain of interest due to copyright restrictions, data distribution issues, or messy formats.We describe these two variants below.

SynCSE-partial
Synthesizing positive and hard negative examples: We prompt ChatGPT in a few-shot setting to annotate positive and hard negative samples given a sentence x i , an illustrative example is shown in Figure 2. The structure of the prompts for generating positive and hard negative examples remains the same; the only difference lies in the prompts.In our implementation with the ChatGPT model, we have designed a few-shot prompt in a multi-turn chat format.
Example and prompt pools: A significant challenge in creating synthetic datasets lies in enhancing the dataset's diversity.Ye et al. (2022) suggested that merely increasing the size of the synthetic dataset might not lead to better performance, with one reason being the lack of diversity.Datasets labeled by groups of annotators can naturally help to mitigate this problem due to the variance in understanding and interpretation of prompts among different annotators.This variance results in diverse outputs, even for the same input.For example, Williams et al. (2018) utilized 387 annotators to create the MultiNLI dataset.Even with the same prompt, these annotators provided varied outputs due to their individual understanding of the prompt and their unique world knowledge, leading to a more diverse dataset.In an attempt to mimic this variation among different annotators, we employ example pools and prompt pools.Specifically, we designed four types of positive/hard negative prompts (an example of hard negative prompts are showed in Table 1) and 18 few-shot exemplars (generated using GPT-4).During each data generation process, we probabilistically sampled one prompt and five exemplars to construct a distinct input prompt.Details of these pools can be found in Appendix A.

SynCSE-scratch
Creating a synthetic dataset from scratch, where the necessary unlabeled sentences for annotation are absent, presents a substantial challenge.We address this problem in two stages: initially, we generate unlabeled sentences, and subsequently, we apply the procedure discussed in §2.3 to annotate positive and hard negative samples of these sentences.
To ensure data diversity during the generation of unlabeled sentences, we employ a strategy that specifies the genres and topics when generation, combined with the utilization of example and prompt pools.This strategy is intended to minimize repetition and redundancy between the new data and the generated data so far.More specifically, as illustrated in Figure 1, given a text genre, we randomly select six topics from a pre-defined list to be included in the prompt (the list of genres and topics used in this paper can be found in Appendix B).The term "etc." in the prompt ensures that the generated sentences are not strictly limited to these six topics.We adopt one-shot prompting to generate several sentences at once.As long as given different genres or topics when adding data compared to the existing data, the added data will likely have low redundancy with the existing data, thereby enhancing the overall diversity of the dataset.Importantly, the examples used for generating raw sentences were produced by GPT-4.Hence, throughout the entire generation process, we only need to know the genres and topics, without requiring any additional data.

Training
We evaluate three different settings in the experiments, including SynCSE-partial, SynCSE-scratch, as well as a combination of SynCSE-scratch with existing annotated datasets in a supervised setting.While both SynCSE-partial and SynCSE-scratch represent unsupervised settings, in the combination setting we augment previous annotated datasets with the synthesized data produced in SynCSEscratch, to examine whether SynCSE-scratch could provide help for a supervised scenario as well.
We refer to the NLI dataset (MNLI+SNLI) used by SimCSE as SimCSE_NLI.In the creation of the SynCSE-partial dataset, for a fair comparison, we utilized the unlabeled sentences x from Sim-CSE_NLI, and generated positive/hard negative examples for them using the algorithm detailed in §2.3.For SynCSE-scratch, we generate the same number of examples as in the SynCSE-partial case, as detailed in §2.4.While our method can easily scale up the dataset, for a fair comparison, we ensure the data volume used for SynCSE-scratch and SynCSE-partial is equivalent to that of Sim-CSE_NLI.For the combination of the SynCSEscratch and SimCSE_NLI datasets, we simply merge these two datasets to evaluate whether our generated dataset can aid the manually annotated one.
Given that SimCSE serves as a general method in contrastive learning, we consistently use SimCSE as the backbone method for SynCSE.We note that SynCSE is general and could be combined with more advanced algorithms as well, such as with PromCSE (Jiang et al., 2022b) and CARDS (Wang et al., 2022c).We emphasize that, after training the models on the NLI dataset, we freeze the models and directly evaluate our embeddings on all the different tasks and setting below -we do not specifically train sentence embeddings on each setting separately.For the STS and transfer learning tasks, we use the same hyperparameters as Sim-CSE.Since SimCSE did not conduct reranking experiments, we directly use the default parameters of MTEB (Muennighoff et al., 2023) to evaluate embeddings on the reranking tasks.
is used to select the best models.The other hyperparameters are kept consistent with those used in SimCSE.
Reranking tasks: We further evaluate the synthetic dataset on four reranking tasks: AskUbun-tuDupQuestions (Lei et al., 2016), MindSmallReranking (Wu et al., 2020), SciDocsRR (Cohan et al., 2020), and StackOverflowDupQuestions (Liu et al., 2018).We directly evaluate the model, which is frozen after training on the NLI dataset, on reranking tasks, without using the training sets of reranking tasks.The resulting ranking is scored for each query and averaged across all queries.In line with the methodology of MTEB (Muennighoff et al., 2023), we utilize Mean Average Precision (MAP) as the primary metric.
Baselines: We compare our approach with stateof-the-art sentence embedding learning methods: RankCSE (Liu et al.), L2P-CSR (Zhou et al.), PCL (Wu et al., 2022a), CARDS (Wang et al., 2022c), ConPVP (Zeng et al., 2022), and PromptRoBERTa (Jiang et al., 2022a).While we base our approach on SimCSE, we emphasize that our approach is orthogonal to the baseline algorithms and our synthesized datasets may be combined with them to further boost the performance.We directly report the results from their respective papers.Table 3: Performance comparison of RoBERTa-base trained on various datasets, using the STS benchmark for evaluation.The reported metric is Spearman's correlation.The " †" symbol is used to indicate results reported in DINO.For SimCSE, we adopted the MNLI+SNLI dataset used in (Gao et al., 2021)." ‡": GenSE released an NLI synthetic dataset comprising over 60 million samples.For a fair comparison, we randomly sampled from it the same number of samples used in the SimCSE dataset.

Semantic Texual Similarity
Main results: As shown in  mer et al., 2015)), the domains were not explicitly known while generating the SynCSE-scratch dataset.Interestingly, SynCSE-partial does not always beat SynCSE-scratch as demonstrated in the RoBERTa-large case, which implies the potential of SynCSE-scratch as a promising approach to learn sentence embeddings without using any real data samples.By augmenting annotated NLI data with the SynCSE-scratch synthetic dataset, our approach outperformed sup-SimCSE significantly, reaching a performance of 84.37% with RoBERta-large, suggesting that our synthetic data is complementary to human-labeled NLI datasets."PromptCSE+EH" (Jiang et al., 2022b) achieves competitive performance in the supervised setups.
As an orthogonal contribution, however, SynCSE may be combined with the loss function they proposed to further advance the results.
Comparison with other synthetic datasets: In addition to comparing with the MNLI+SNLI datasets used in SimCSE, we also compare our method with two other baselines that leverage synthetic NLI data: (1) GENSE (Chen et al., 2022) aims to automatically annotate the positive and hard negative examples with a LLM trained on ex-isting NLI labeled dataset.They published an open synthetic NLI dataset.However, GenSE is not an unsupervised framework.It employs the annotated NLI data to train the generation model.To ensure a fair comparison, we sampled the same number of examples from their dataset as those used in SynCSE, and (2) The objective of DINO (Schick and Schütze, 2021) is to generate synthetic data for sentence embeddings as well.In DINO's most effective configuration, they generate the positive or hard negative samples and assign a similarity score to them based on the prompts used.They directly trained the STS dataset with a regressionbased method.As they have not made an NLI-style dataset available, we directly report results from their paper.We compare the generated sentences of our methods with them in Table 8.From the table, we can find that our generated sentence can generate more diverse annotations.As depicted in Table 3, both SynCSE-scratch and SynCSE-partial have achieved performance on the STS task that surpasses that of DINO and GenSE.

Reranking
Table 4 shows the results of the reranking tasks.
Compared to the STS task, the domain of the reranking task data is more divergent from that of the NLI data used for training, as a result, SynCSEscratch actually outperforms SynCSE-partial significantly, which implies the advantage of SynCSEscratch when in-domain unlabeled sentences are unavailable.SynCSE-scratch also surpasses other unsupervised baselines while SynCSE-partial underperforms them.Moreover, the combination of SynCSE-scratch with manually annotated datasets still facilitates further performance enhancement, substantiating that our method can aid in augmenting existing datasets.

Analysis
In this subsection, we provide an in-depth analysis of SynCSE.All results presented here are based on the RoBERTa-base model.

Synthetic data amount:
We also analyzed the impact on performance when augmenting the volume of generated data on the manually curated dataset, as shown in Table 6.Since the domain of SynCSE-scratch is established upon its completion, the performance ceases to increase after a certain amount of SynCSE-scratch data is added to Sim-CSE.This may be due to the fact that the added data is randomly sampled, which likely already covers the domain of SynCSE-scratch.

Related Work
Prior approaches for sentence embedding fall into two main categories: (1) supervised learning with labeled sentences, and (2) unsupervised sentence embedding with unlabeled sentences.Among these, works based on contrastive learning have proven to be the most effective.For unsupervised methods, SimCSE uses dropout masks to construct positive pairs for learning, while negative examples use In-batch negative examples.Instead of random mask used by SimCSE, L2P-CSR (Zhou et al.) employs learnable masks.Some works employ data augmentation techniques on input sentences, such as word repetition (Wu et al., 2022b), case flipping (Wang et al., 2022c), or a combination of multiple data augmentation strategies to offset the bias caused by mono-augmentation (Wu et al., 2022a).
PromptBERT (Jiang et al., 2022a) uses prompts instead of the [CLS] token to extract embeddings.However, these unsupervised methods significantly lag behind their supervised counterparts.Supervised approaches usually derive positive and hard negative samples from public NLI datasets (Wang and Lu, 2022;Gao et al., 2021;Jiang et al., 2022a,b;Wu and Zhao, 2022), but these datasets are limited in quantity and domain.Additionally, annotating a new NLI dataset is costly, especially in fields that require trained annotators, thus making an efficient NLI dataset generation framework greatly needed.There have also been attempts to generate NLI-labeled data.For instance, Chen et al. (2022)  .The labels "unsup-" and "sup-" correspond to unsupervised and supervised settings, respectively." † ": results from (Gao et al., 2021); "♠": results from (Zhou et al.); " ‡ ‡": results from (Wu et al., 2022a); " † †": results from (Jiang et al., 2022a); "•": results from (Zeng et al., 2022).The term "SynCSE-scratch + SimCSE_NLI " represents our synthetic data combined human labeled NLI dataset used in SimCSE.ative samples, while Ye et al. (2022) implemented a continuously updated model to modify prompts for generation.However, the performance of these algorithms is still constrained by the performance of generators, which need labeled NLI data for training.Differing from these methods, which necessitate training an additional model, Wang et al. (2022b) proposed a rule-based algorithm capable of generating hard negative annotations.While this approach led to performance enhancements, its diversity is limited to the prescribed rules.Gilardi et al. (2023) used ChatGPT for dataset annotation.However, their exploration was limited to tasks with explicit answer labels such as "RELEVANT" or "IRRELEVANT".They did not attempt to annotate datasets that required diverse responses.Schick and Schütze (2021) also propose to generate both annotations and unlabeled sentences, while they do not focus on the contrastive learning framework.

Discussion
In this work, we propose SynCSE, a novel contrastive learning framework for learning sentence embeddings with synthetic data.We prompt LLMs to synthesize unlabeled sentences and their positive and negative examples.Furthermore, by utilizing example and prompt pools, we can specify the genre and topic of generated sentences, thereby enhancing the quality of the synthetic dataset.Experiments on both sentence similarity and reranking tasks demonstrate the effectiveness of SynCSE.
The performance of SynCSE in this study strongly suggests the potential of synthetic datasets generated by the increasingly advanced LLMs of today.We envision that, through the effective use of prompting strategies with LLMs, synthetic datasets produced by these models could potentially serve as promising alternatives to real-world data across a wide range of tasks.
…[5-shot examples]... Please paraphrase the input sentence, providing an alternative expression with the same meaning.

Figure 2 :
Figure 2: Few-shot examples of generating positive examples of the input sentence.We adopt 5-shot for generation.

Table 1 :
Hard negative prompts pools.During the generation of hard negative samples, a hard negative prompt is randomly sampled each time.

Table 4 :
Results on the reranking benchmark.Mean Average Precision (MAP) is reported.

Table 5 :
(Chung et al., 2022)t al., 2022)model capable of producing positive and hard neg-Transfer task results of different sentence embedding models (measured as accuracy)

Table 6 :
Performance of SimCSE_NLI when combined with varying amounts of our synthetic SynCSE-scratch dataset.We report the performance on the avg STS results on the test set.