Improving Contrastive Learning of Sentence Embeddings with Focal-InfoNCE

The recent success of SimCSE has greatly advanced state-of-the-art sentence representations. However, the original formulation of SimCSE does not fully exploit the potential of hard negative samples in contrastive learning. This study introduces an unsupervised contrastive learning framework that combines SimCSE with hard negative mining, aiming to enhance the quality of sentence embeddings. The proposed focal-InfoNCE function introduces self-paced modulation terms in the contrastive objective, downweighting the loss associated with easy negatives and encouraging the model focusing on hard negatives. Experimentation on various STS benchmarks shows that our method improves sentence embeddings in terms of Spearman's correlation and representation alignment and uniformity.


Introduction
Unsupervised learning of sentence embeddings has been extensively explored in natural language processing (NLP) (Cer et al., 2018;Giorgi et al., 2020;Yan et al., 2021), aiming to generate meaningful representations of sentences without the need for labeled data.Among various approaches, Sim-CSE (Gao et al., 2021) achieves state-of-the-art performance in learning high-quality sentence embeddings through contrastive learning.Due to its simplicity and effectiveness, various efforts have been made to improve the contrastive learning of sentence embeddings from different aspects, including alleviating false negative pairs (Wu et al., 2021a;Zhou et al., 2022) and incorporating more informative data augmentations (Wu et al., 2021b;Chuang et al., 2022).
Leveraging hard-negative samples in contrastive learning is of significance (Schroff et al., 2015;Oh Song et al., 2016;Robinson et al., 2020).Nevertheless, unsupervised contrastive learning ap-proaches often face challenges in hard sample mining.Specifically, the original training paradigm of unsupervised-SimCSE proposes to use contradiction sentences as "negatives".But such implementation only guarantees that the contradiction sentences are "true negatives" but not necessarily hard.With a large number of easy negative samples, the contribution of hard negatives is thus prone to being overwhelmed, To address this issue, we propose a novel loss function, namely Focal-InfoNCE, in the paradigm of unsupervised SimCES for sentence embedding.Inspired by the focal loss (Lin et al., 2017), the proposed Focal-InfoNCE loss assigns higher weights to the harder negative samples in model training and reduces the influence of easy negatives accordingly.By doing so, focal Info-NCE encourages the model to focus more on challenging pairs, forcing it to learn more discriminative sentence representations.In addition, to adapt the dropout strategy for positive pair construction in SimCSE, we further incorporate a positive modulation term in the contrastive objective, which reweights the positive pairs in model optimization.We conduct extensive experiments on various STS benchmark datasets (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016;;Cer et al., 2017;Marelli et al., 2014) to evaluate the effectiveness of Focal Info-NCE.Our results demonstrate that Focal Info-NCE significantly improves the quality of sentence embeddings and outperforms unsupervised-SimCSE by an average of 1.64%, 0.82%, 1.51%, and 0.75% Spearan's correlation on BERT-base, BERT-large, RoBERTabase, and RoBERTa-large, respectively.

Unsupervised SimCSE
SimCSE (Gao et al., 2021) provides an unsupervised contrastive learning solution to SOTA performance in sentence embedding.Following previ-ous work (Chen et al., 2020), it optimizes a pretrained model with the cross-entropy objective using in-batch negatives.Formally, given a minibatch of N sentences, {x i } N i=1 , let h i be the sentence representation of x i with the pre-trained language model such as BERT (Devlin et al., 2018) or RoBERTa (Liu et al., 2019).SimCSE's training objective, InfoNCE, can be formulated as where τ is a temperature hyperparameter and sim(h i , h j ) represents the cosine similarity between sentence pairs (x i , x j ).Note, h + i is the representation of an augmented version of x i , which constitutes the positive pair of x i .For notation simplification, we will use s i p and s i,j n to represent the similarities between positive pairs and negative pairs in this paper.
Unsupervised SimCSE uses model's built-in dropout as the minimal "data augmentation" and passes the same sentence to the same encoder twice to obtain two sentence embeddings as positive pairs.Any two sentences within a mini-batch form a negative pair.It should be noted that in contrastive learning, model optimizaiton with hard negative samples helps learn better representations.But SimCSE doesn't distinguish hard negatives and easy ones.We show in this work that incorporating hard negative sample mining in SimCSE boosts the quality of sentence embedding.

Sample Re-weighting in Machine Learning
Re-weighting is a simple yet effective strategy for addressing biases in machine learning.It downweights the loss from majority classes and obtains a balanced learning solution for minority groups.
Re-weighting is also a common technique for hard example mining in deep metric learning (Schroff et al., 2015) and contrastive learning (Chen et al., 2020;Khosla et al., 2020).Recently, self-paced re-weighting is explored in various tasks, such as object detection (Lin et al., 2017), person reidentification (Sun et al., 2020), and adversarial training (Hou et al., 2023).It re-weights the loss of each sample adaptively according to model's optimization status and encourages a model to focus on learning hard cases.To the best of our knowledge, this study constitutes the first attempt to incorporate self-paced re-weighting strategy in unsupervised sentence embedding.

Focal-InfoNCE for Sentence Embedding
This study follows the unsupervised SimCSE framework for sentence embedding.Instead of taking the InfoNCE loss in Eq. ( 1), we introduce a self-paced reweighting objective function, Focal-InfoNCE, to up-weight hard negative samples in contrastive learning.Specifically, for each sentence x i , Focal-infoNCE is formulated as where m is a hardness-aware hyperparameter that offers flexibility in adjusting the re-weighting strategy.Within a mini-batch of N sentences, the final loss function, L = N i=1 l i , can be derived as Analysis of Focal-InfoNCE: Compare with In-foNCE in Eq. ( 1), Focal-InfoNCE introduces selfpaced modulation terms on s p and s n , proportional to the similarity quantification.Let's first focus on the modulation term, s i,j n + m, on negative pairs.Prior arts have shown that pre-trained language models usually suffer from anisotropy in sentence embedding (Wang and Isola, 2020).Finetuning the pretrained models with contrastive learning on negative samples, especially hard negative samples, improves uniformity of representations, mitigating the anisotropy issue.In SimCSE, s i,j n quantifies the similarity between negatives x i and x j .If s i,j n is large, x i and x j are hard negatives for current model.Improving the model with such hard negative pairs encourage representation's uniformity.To this end, we propose to upweight the corresponding term s i,j n /τ by a modulation factor s i,j n + m.The partial derivative of Focal-InforNCE with respect to s i,j n is where Z i = N j̸ =i e s i,j n (s i,j n +m)/τ + e (s i p ) 2 /τ .According to Eq. ( 4), comparing to easy negatives, hard negative samples that associates with higher similarity score s i,j n contribute more to the loss function.This implies that a model optimized with the proposed Focal-InfoNCE focuses more on hardnegative samples.Our experiments also show that Focal-InfoNCE improves uniformity in sentence embeddings.
To uncover the insight of the modulation term s i p on positive cases, let's revisit SimCSE.In Sim-CSE, the positive pair is formed by dropout with random masking.Thus a low similarity score s p indicates semantic information loss introduced by dropout.Since such a low similarity is not attributed to model's representation capability, we should mitigate its effect on model optimization.Hence, Focal-inforNCE assigns a small weight to the dissimilary positive pair.The partial derivative with respect to s i,j n is which suggests that positive pairs with lower similarity scores in SimCSE contributes less to model optimization.We show in the experiments that Focal-InfoNCE improves the allignment of sentense embeddings as well.
Due to the modulation terms on both positive and negative samples, Focal-InfoNCE reduces the chances of the model getting stuck in sub-optimal solutions dominated by easy pairs.We show in our experiment that the proposed Focal-InfoNCE can easily fit into most contrastive training frameworks for sentence embeddings.

Comparison to Prior Arts
Table 2 shows the performance of the different models with and without the Focal-InfoNCE loss.In general, we observe improvements in Spearman's correlation scores when incorporating the proposed Focal-InfoNCE.For example, with SimCSE-BERT base , the average score increases from 75.68 to 77.32 when using Focal-InfoNCE.

Alignment and Uniformity
Alignment and uniformity are two key properties to measuring the quality of contrastive representations (Gao et al., 2021).By specifically focusing on challenging negative samples, focal-InfoNCE encourages the model to pay closer attention to negative instances that are difficult to distinguish from positive pairs.In Table .3, we incorporate the proposed focal-InfoNCE into different contrastive learning for sentence embeddings and show improvements in both alignment and uniformity of the resulting representations.

Ablation Studies on Hyperparameters
We conducted ablation studies to analyze two key factors in Focal-InfoNCE: temperature τ and the hardness hyperparameter m.
Temperature τ is a hyper-parameter used in the InfoNCE loss function that scales the logits before computing the probabilities.(Wang and Liu, 2021) show that the temperature plays a key role in controlling the strength of penalties on hard negative samples.In this ablation, we set the temperature as 0.03, 0.05, 0.07, and 0.1, explore the effect of different temperature values on the model's performance and report the results in Table 4.
The hardness hyper-parameter m controls the rescaling of the negative samples in the contrastive loss function.The effectiveness of the Focal-InfoNCE depends on the quality of pre-trained models.When a pretrained model leads to bad representations, the similarity scores may mislead model finetuing.In addition, the positive re-weighting strategy in this study is quite simple.We believe that more sophisticated mechanisms to address semantic information loss in positive pairs would further improve the performance.
Figure. 1 visualizes the rescaling effects of m.Specifically, our Focal-InfoNCE loss regards negative pairs with cosine similarity larger than (1-m) as hard negative examples and vice versa.The loss is then up-weighted or downweighted proportionally.

Figure 1 :
Figure 1: Plot of re-scaled negative pairs vs. their original value with different choices of m.Orange: upweighted negative examples; green: down-weighted negative examples

Table 1 :
Experimental setting for our main results.
Table.5demonstrates that our method is not sensitive to m and the optimal setting can usually be found between 0.2 and 0.3.One

Table 3 :
(Cer et al., 2017)gnment (ALGN), uniformity (UNIF), and Spearman's correlation (SpCorr) with BERT base on the STS Benchmark(Cer et al., 2017).By combining SimCSE with self-paced hard negative re-weighting, model optimization was benefited from hard negatives.Extensive experiments shows the effectiveness of the proposed method on various STS benchmarks.