Improving Contrastive Learning of Sentence Embeddings from AI Feedback

Contrastive learning has become a popular approach in natural language processing, particularly for the learning of sentence embeddings. However, the discrete nature of natural language makes it difficult to ensure the quality of positive and negative sample pairs generated through data augmentation methods. Although supervised contrastive learning can produce more accurate sample pairs with human feedback labels, it still lacks fine-grained training signals. In this paper, we propose to improve \textbf{C}ontrastive \textbf{L}earning of sentence embeddings from \textbf{AI} \textbf{F}eedback \textbf{(CLAIF)}. Our method utilizes AI feedback from large pre-trained language models (LLMs) to construct sample pairs with fine-grained sample similarity scores to improve contrastive learning. Besides, we combine human feedback and AI feedback to provide better supervision signals for supervised contrastive learning of sentence embeddings. Experimental results show that our method achieves state-of-the-art performance on several semantic textual similarity (STS) and transfer learning tasks compared to other unsupervised and supervised contrastive learning methods.


Introduction
Learning sentence embeddings with rich semantics is very important for many natural language processing tasks, such as semantic matching and information retrieval.Recently, pre-trained language models (Devlin et al., 2019;Liu et al., 2019;Qiu et al., 2020) provide a convenient way to get sentence embeddings.However, sentence embeddings directly generated by pre-trained language models show poor performance on semantic textual similarity (STS) tasks due to the representation degeneration problem (Gao et al., 2019).Therefore, finding ways to further improve pre-trained models to produce better sentence embeddings becomes an crucial and fundamental challenge in natural language processing.
Given the shortage of labeled data for sentence embedding learning, recent studies mainly focus on unsupervised methods, such as utilizing contrastive learning methods (Yan et al., 2021;Gao et al., 2021;Chuang et al., 2022).Contrastive learning can be classified into two categories (Khosla et al., 2020): supervised contrastive learning and unsupervised contrastive learning, depending on whether additional label information is utilized to construct positive and negative sample pairs.However, the quality of positive and negative sample pairs in unsupervised contrastive learning can be difficult to ensure.Recent studies also show that data augmentation strategies in unsupervised contrastive learning may introduce some bias like length information (Wu et al., 2022) and improper negatives (Zhou et al., 2022a).While supervised contrastive learning methods can produce more accurate sample pairs by utilizing label information, such as using supervised datasets from natural language inference (Gao et al., 2021), it can only provide coarse-grained labels and lack fine-grained supervision signals.We aruge that these limitations of current contrastive learning methods restrict further performance enhancement of sentence embeddings.
With the emergence of large pre-trained language models (LLMs) (Brown et al., 2020;Sun et al., 2021;Ouyang et al., 2022;Zhang et al., 2022), researchers hope powerful LLMs can help human train other AI models (Bai et al., 2022).One way is to use LLMs to generate datasets using for zero-shot learning (Schick and Schütze, 2021;Ye et al., 2022;Meng et al., 2022).These methods all use predefined labels and task descriptions to generate training inputs, instead of utilizing AI feedback as supervision signals.Therefore, these

Original sentence
Masked sentences Generated sentences Step 1: Sentence Pair Generation

Prompt for labeling
The similarity score is 0.89 .
The similarity score is 0.20 .
The similarity score is 0.00 .

… …
("a man is playing a flute .","a man was playing a flute .") ("a man is playing a flute .","a man is playing a guitar .")method are not suitable for tasks whose labels are continuous values and may lead to lack of diversity in training samples.Inspired by these studies, we hope to exploit the capability of LLMs to address shortcomings in contrastive learning of sentence embeddings.
We propose to improve Contrastive Learning of sentence embeddings from AI Feedback (CLAIF).Specifically, we design a two-step sample pair generation method to produce high quality sentence pairs and fine-grained semantic similarity scores using AI feedback from GPT-3, as shown in Figure 1.In the first step, we mask some words in a sentence with different mask rates and then use GPT-3 to generate new sentences based on the remaining information in the masked sentence.Then we combine the generated sentences and the original sentence to construct sentence pairs.In this way, we can use the mask rate to control the amount of sharing information between two sentences in a pair, which will produce sentence pairs with different semantic similarities.In the second step, we utilize GPT-3 to generate semantic similarity scores for sentence pairs.These scores are the AI feedback on sample similarity.Since the semantic change caused by reconstructing a masked sentence is difficult to measure, we leverage the linguistic knowledge of LLMs to generate the semantic similarity score.The diversity of AI feedback similarity scores ensured by the sentence pair generation process in the first step.At last we use our generated sample pairs and similarity scores to train the model for sentence embeddings.
In addition to using AI feedback alone, we also combine human feedback and AI feedback by introducing AI feedback into supervised contrastive learning of sentence embeddings which needs human feedback labels to generate positive sample pairs.We use the AI feedback similarity score for the positive sample pair as a soft label to replace the one-hot label in InfoNCE loss (He et al., 2020).We term our loss Soft InfoNCE.This process can be referred to as contrastive learning of sentence embeddings from human and AI feedback (CLHAIF).
We conduct extensive experiments to show the effectiveness of our method.Sentence embeddings learned with CLAIF and CLHAIF achieve state-ofthe-art performance on standard semantic textual similarity tasks and outperform strong baselines on transfer learning tasks.We also find that CLAIF results in significant improvements to the crossencoder architecture for the sentence-pair modeling task.
Our main contributions are as follows:  et al., 2018;He et al., 2020;Gao et al., 2021), NT-Xent (Chen et al., 2020) Human Feedback (CLHF) SupCon (Khosla et al., 2020), InfoNCE (Gao et al., 2021), KNN-Contrastive (Zhou et al., 2022b) AI Feedback (CLAIF) Human and AI Feedback (CLHAIF) Table 1: The details of contrastive learning from different feedback.X is the full set containing all samples and x i is the i-th sample of X, such as a sentence or an image.x ′ i is an augmented sample obtained by using some data augmentation strategies to x i .x + i and x − i are postive sample and negative sample of x i picked by human feedback information, such as class label information.y i is the AI feedback sample similarity score for the i-th sample pair.* : CLAIF does not explicitly construct positive and negative pairs, sample pairs with high simiarity scores can be seen as positive pairs and those with low scores can be seen as negative pairs.
• We propose to improve contrastive learning of sentence embeddings from AI feedback (CLAIF) and achieve state-of-the-art performance on several semantic textual similarity tasks and transfer learning tasks.
• We construct a semantic textual similarity dataset with high quality sentence pairs and fine-grained AI feedback similarity scores using large pre-trained language models.
• We propose a method to incorporate human feedback and AI feedback to provide better supervision for contrastive learning of sentence embeddings.
• Experimental results show the scalability of CLAIF, which is cheaper and more efficient than collecting data from human feedback.

Understanding Contrastive Learning from Different Feedback
In this section, we categorize contrastive learning methods into four categories according to their feedback sources.We summarize the details of contrastive learning from different feedback in Table 1, including their feedback types, sample pairs construction methods and representative loss functions.

Contrastive Learning from Zero Feedback
Traditional contrastive learning is used for selfsupervised representation learning (Hadsell et al., 2006;He et al., 2020).These methods construct positive and negative sample pairs using data augmentation strategies without any human feedback.For example, in natural language processing, Gao et al. ( 2021) construct positive sample pairs by doing the dropout operation twice for the same sentence and negative pairs by combining with another sentences.We refer to these methods as Contrastive Learning from Zero Feedback (CLZF).The most common loss function for CLZF is InfoNCE (van den Oord et al., 2018).Chen et al. (2020) propose NT-Xent loss, which can be seen as a variant of InfoNCE.However, due to the discrete nature of natural language, it is hard to find effective and unbiased data augmentation strategies to construct high quality sample pairs.

Contrastive Learning from Human Feedback
Recently, Khosla et al. (2020) propose to use label information to construct positive sample pairs.In sentence embeddings, Gao et al. (2021) use premise-hypothesis pairs with entailment relationship from natural language inference (NLI) datasets as positive sample pairs and still use InfoNCE for training.Since these methods leverage label information from human, we refer to them as Contrastive Learning from Human Feedback (CLHF).
With the help of label information, some new losses can be used in CLHF, like SupCon (Khosla et al., 2020) and KNN-Contrastive (Zhou et al., 2022b).
Although CLHF can construct more accurate sam-ple pairs, it still lacks fine-grained supervision signals.For example, in InfoNCE, all positive pairs have a label of 1.But there are also differences in the similarity between different positive sample pairs.

Contrastive Learning from AI Feedback
Measuring the similarity of sample pairs in contrastive learning is a laborious task.

Methodology
In this section, we first introduce our method to generate sample pairs and the training process of CLAIF.In order to obtain high quality sentence pairs with diverse and fine-grained similarity scores, we propose a two-step sample pair generation method: Sentence Pair Generation and Semantic Similarity Labeling.The generation process is shown in Figure 1.We use these sample pairs to train language models like BERT and RoBERTa.Then we introduce CLHAIF, which combines human and AI feedback in contrastive learning of sentences embeddings.

Sentence Pair Generation
We use unpaired sentences from the training set of STS Benchmark (Cer et al., 2017) as our original sentences.As shown in Figure 1, we first mask some words of the original sentence "a man is playing a flute."with different mask rates using the <mask> token, in order to delete some information in the original sentence.The more words that are masked, the less information is left.We use the depth of color to indicate the degree of information sharing between two sentences in Figure 1.Then we write a task description prompt to steer GPT-3 to generate new sentences based on masked sentences.We provide our task descriptions in Appendix B. To increase the diversity of generated sentences, we merge adjacent <mask> tokens in 50% of masked sentences into one <mask> token.Then we combine the original sentence with each generated sentence to construct sentence pairs.

Semantic Similarity Labeling
In this step, we label the semantic similarity score for each sentence pair using AI feedback from GPT-3.The similarity score ranges from 0 to 1, where a score of 1 means that the semantic of the two sentences are exactly the same, and a score of 0 means that the semantic of the two sentences are completely different.We write a task description prompt to steer GPT-3 to generate a similarity score between 0 and 1 for each sample pair generated in step 1.The first step ensures the diversity of semantic similarity scores.As illustrated in Figure 2, the generated scores are diverse and distributed in the value range from 0 to 1.

Training on Generated Pairs
With the generated sample pairs, we train a language model as the sentence encoder to get better sentence embeddings.Given diverse sentence pairs which have fine-grained similarity scores, we do not need to explicitly construct positive and negative sample pairs.Therefore, we directly use the mean squared error (MSE) loss to fit the cosine similarity of each sentence pair to its AI feedback similarity score: where N is the batch size, h i and h ′ i are two sentence embeddings of the i-th sentence pair (x i , x ′ i ) encoded by the model, y i is the corresponding similarity score and cos means the calculation of cosine similarity.During inference, we use the cosine similarity of two sentence embeddings as their semantic similarity score.

Combining Human Feedback and AI Feedback
In this section, we mainly study the cooperation of human and AI models to provide better training signals for contrastive learning, which we called CLHAIF.Reimers and Gurevych (2019) use supervised NLI datasets to learn sentence embeddings.Gao et al. (2021) construct positive and hard negative sample pairs for contrastive learning leveraging label information of NLI datasets, achieving significant improvements.However, as we mentioned in Section 2.2, CLHF does not distinguish between different positive sample pairs and assigns label of 1 for all positive pairs.In this way, all positive sample pairs are pulled together with the same extent in contrastive learning, ignoring differences in similarity between different positive pairs.Therefore, we use AI feedback to refine these coarse-grained supervision signals.
At first, we use the semantic similarity labeling step in Section 3.2 to generate AI feedback similarity scores for sentence pairs constructed from supervised NLI datasets: SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018).Following Gao et al. (2021), we construct sample pairs using the label information.For the i-th sample of the NLI dataset, we can obtain two sentence pairs , where x i is the premise, x + i and x − i are entailment and contradiction hypothesis.(x i , x + i ) is the positive pair and (x i , x − i ) is the hard negative pair.
In order to incorporate AI feedback, we propose soft InfoNCE loss by replacing the one-hot label with the AI feedback score as the soft label: where N is the batch size, h i , h + i and h − i are sentence embeddings of x i , x + i and x − i , y i is the AI feedback similarity score for the positive pair (x i , x + i ) and τ is the temperature parameter.

Baselines
We compare our method with some strong baselines among three types of sentence embedding methods: Post-processing methods: These methods adopt some post-processing operations to enhance sentence embeddings which do not need to further train the backbone model.We use BERT-whitening (Su et al., 2021), BERT-flow (Li et al., 2020) and prompt based BERT (Jiang et al., 2022)  SimCSE (Gao et al., 2021), DiffCSE (Chuang et al., 2022) and PromptBERT (Jiang et al., 2022) as baselines.
Dataset-generation based methods: Some studies generate datasets from LLMs for sentence embedding learning.We use Dino (Schick and Schütze, 2021) as our baseline.Dino generates sentence pairs based on three discrete similarity labels using GPT2-XL.For a fair comparison, we re-implement Dino using GPT-3 in our experiments.

Implementation Details
Choice of large pre-trained language models: In our experiments, we get all AI feedback from textdavinci-003, which is the latest version of GPT-3.
We access text-davinci-003 through the OpenAI API.

Sample pair generation:
We use nine mask rates for each original sentence in sentence pair generation: 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8.For CLAIF, we use unpaired sentences from the training set of STS-B as original sentences to construct sentence pairs from scratch and randomly sample two other sentences for each original sentence to construct two sentence pairs with a similarity score of 0. For CLHAIF, following previous studies (Gao et al., 2021;Jiang et al., 2022)  pooling strategy to get sentence embeddings for BERT and RoBERTa.For CLHAIF, we take the same pooling strategy as the corresponding baseline.Other implementation details are in Appendix A.   CLHF methods like supervised SimCSE on six STS datasets except STS12.

Semantic Textual Similarity
Transfer Tasks In addition to STS tasks, we also evaluate several transfer learning tasks from Sen-tEval.Experimental results show that sentence embeddings learned with CLAIF and CLHAIF also achieve better or comparable performance compared to baselines.We present the average results for seven transfer tasks in Table 3 and detailed results in Appendix C.

Scalability of CLAIF
In this section we discuss the scalability of CLAIF.
The results of CLAIF scaled in Table 5 show that using more data to scale CLAIF can bring significant improvements.CLAIF scaled greatly outputforms CLAIF by 2.74 points on BERT-base (79.63 → 82.37 ) and even outputforms or performs on par with CLHF and CLHAIF methods.We believe that using more data can further improve the performance of CLAIF.Since collecting data from AI feedback is more cheaper than from human feedback, we argue that CLAIF has great potential in practical applications.

Sentence-Pair Modeling
In this section, we evaluate CLAIF on the sentencepair modeling task.Cross-encoders usually outperform bi-encoders in information retrieval.However, we observe in Liu et al. (2022) that the crossencoder does not show its superior on sentence-pair modeling.We contribute this to the lack of finegrained training signals.We train a cross-encoder with CLAIF.Experimental results in Table 11 show that, with the help of AI feedback, CLAIF cross brings significant improvements for cross-encoders on the sentence-pair modeling task compared to the previous model Trans-Encoder (Liu et al., 2022).
More training details are in Appendix D.

Human Evaluation
In this section, we conduct human evaluation to measure the quality of generated sentences and similarity scores.We measure whether the generated sentences are fluent and whether the similarity scores are consistent with the real semantic similarities.To help human judge the consistency, we generate a natural language explanation for each generated similarity score using GPT-3.We invite 4 experts to participate in our human evaluation.
Then we random pick 100 samples from the dataset used in CLAIF and assign 25 samples to each expert.In the evaluation, 92 percent of generated sentences are considered fluent and 90 percent of generated scores are considered consistent by the expert, which means our method can generate high quality sentence pairs and similarity scores.

Related Work
Recent studies about sentence embeddings mainly focuse on using additional data to further train pre-trained language models.Yan et al. (2021) and Gao et al. (2021) propose different data aug-mentation strategies for contrastive learning and achieve significant improvements using unlabeled data.Chuang et al. (2022)  Impressed by the powerful capabilities of LLMs (Brown et al., 2020;Ouyang et al., 2022), researchers pay more attention to using AI feedback from LLMs for zero-shot and few-shot learning.Li et al. (2023); Li and Qiu (2023) use AI feedback from language models to enhance In-context Learning and Chain-of-Thoughts.Ye et al. (2022) and Meng et al. (2022) generate datasets by taking labels and prompts as the input of LLMs and then let LLMs generate training samples.Schick and Schütze (2021) design a dataset generation method for STS tasks.They construct three natural language instructions based on three discrete similarity scores and then use these instructions to steer LLMs to construct sentence pairs.However, it is hard to use natural language to describe various similarity scores, since the similarity score is a continuous variable with values ranging from 0 to 1.

Conclusion
In this paper, we first formalize four types of contrastive learning: contrastive learning from zero feedback (CLZF), contrastive learning from human feedback (CLHF), contrastive learning from AI feedback (CLAIF) and contrastive learning from human and AI feedback (CLHAIF).Then we improve contrastive learning of sentence embeddings from AI feedback and combine human feedback with AI feedback to produce better supervision signals.Experimental results show that CLAIF and CLHAIF can bring substantial improvements for sentence embedding learning.We hope that learning from AI feedback can shed new lights for representation learning and contrastive learning.

Limitations
To inspire future work, we conclude some limitations of our work as follows: • While our method achieves promising performance on sentence embedding related tasks like STS, the performance on other natural language processing tasks are still need to investigate.
• The AI feedback in our experiments comes from GPT-3, which requires a fee to use.
• We do not explore the effect of different task description prompts on the quality of generated sample pairs, which may influence the performance of CLAIF.
• In CLHAIF, we only use the AI feedback for positive sample pairs.How to utilize AI feedback for negative sample pairs remains to be studied.

D Sentence-Pair Modeling
In sentence-pair modeling task, cross-encoders can be used to directly encode the sequence of two sentences and then predict a similarity score.Previous studies (Thakur et al., 2021;Liu et al., 2022;Lu et al., 2022) show that cross-encoders usually outperform bi-encoders.We find that CLAIF can significantly improve the performance of crossencoders on sentence-pair modeling task, with the help of fine-grained AI feedback scores.
We use the binary cross-entropy (BCE) loss to train cross-encoders initialized from BERT and RoBERTa: where N is the batch size, ŷi is the predicted score of the i-th sentence pair, y i is the AI feedback similarity score and σ is the sigmoid function.

E Cost for Data Generation
According to our billings, we spent about $100 to generate data for CLAIF and about $720 for the scaled dataset.

F Generated Examples
We present some generated sample pairs used in CLAIF in Table 9 and some generated similarity scores for sample pairs constructed from NLI in Table 10.

G Comparison with
Text-Ada-Embedding-002 Recently, OpenAI has released a powerful embedding model named text-ada-embedding-002, we compare the performance of it on STS tasks with CLAIF here.The results show that CLAIF-scaled achieves better performance on STS tasks than textada-embedding-002.

a
man is playing a flute .a man <mask> playing a flute .a man is playing <mask> <mask> .a <mask> playing <mask> <mask> .playing a flute .a man is playing a guitar .a boy and girl playing hopscotch together .

Figure 1 :
Figure 1: Illustration of the sample pair generation process.The darker the color, the more information the sentence shares with the original sentence.

Figure 2 :
Figure2: The score distribution of our generated sample pairs.The x-axis is the similarity score and the y-axis is the percentage of the score.
However, thanks to emergence of LLMs, we can use LLMs to measure the similarity of sample pairs and use the AI feedback as our training signals.We refer to this approach as Contrastive Learning from AI Feedback (CLAIF).CLAIF does not need to explicitly construct positive and negative sample pairs because each sample pair has a fine-grained label.We use mean squared error (MSE) loss for the training of CLAIF in this work.
2.4 Contrastive Learning from Human and AI Feedback Besides contrastive learning from AI feedback, we propose to combine human and AI feedback to produce better supervision signals when they are both available.We call this category contrastive learning from human and AI feedback (CLHAIF) and we propose a soft InfoNCE loss for the training of CLHAIF.We hope to use fine-grained AI feedback to refine the coarse-grained signals in current CLHF methods.

Table 3 :
The performance comparison of CLAIF and CLHAIF on transfer learning tasks.SentEval Avg is the average accuracy on seven transfer learning datasets from SentEval.

Table 5 :
The performance comparison of CLHAIF on STS tasks.†: results from Jiang et al. (2022).Other results are from our experiments.* : The results of PromptBERT and PromptRoBERTa are obtained by running official code of Jiang et al. (2022) with recommended hyperparameters.

Table 6 :
The performance comparison of CLAIF based on the cross-encoder architecture.

Table 8 :
The performance comparison of CLHAIF on transfer learning tasks.†: results from Jiang et al. (2022).* : The results of PromptBERT and PromptRoBERTa are obtained by running official code of Jiang et al. (2022) with recommended hyperparameters.

Table 9 :
Generated examples of sample pairs used in CLAIF.