Ranking-Enhanced Unsupervised Sentence Representation Learning

Unsupervised sentence representation learning has progressed through contrastive learning and data augmentation methods such as dropout masking. Despite this progress, sentence encoders are still limited to using only an input sentence when predicting its semantic vector. In this work, we show that the semantic meaning of a sentence is also determined by nearest-neighbor sentences that are similar to the input sentence. Based on this finding, we propose a novel unsupervised sentence encoder, RankEncoder. RankEncoder predicts the semantic vector of an input sentence by leveraging its relationship with other sentences in an external corpus, as well as the input sentence itself. We evaluate RankEncoder on semantic textual benchmark datasets. From the experimental results, we verify that 1) RankEncoder achieves 80.07% Spearman’s correlation, a 1.1% absolute improvement compared to the previous state-of-the-art performance, 2) RankEncoder is universally applicable to existing unsupervised sentence embedding methods, and 3) RankEncoder is specifically effective for predicting the similarity scores of similar sentence pairs.


INTRODUCTION
Unsupervised sentence encoders aim to overcome limited labeled data using a contrastive learning framework with data augmentation methods (Gao et al., 2021;Yan et al., 2021;Liu et al., 2021;Wu et al., 2021;Izacard et al., 2021;Kim et al., 2021). These approaches minimize the distance between the vector representations of similar sentences, called positive pairs, while maximizing the distance between those of dissimilar sentences, called negative pairs. Many studies have focused on constructing better positive and negative pairs. Data augmentation methods such as dropout masking (Gao et al., 2021), token shuffling (Yan et al., 2021), and sentence negation  have led significant improvement and achieved comparable semantic textual similarity performance to supervised sentence encoders trained on NLI datasets (Cer et al., 2018;Reimers & Gurevych, 2019;. Existing data augmentation methods sometimes generate positive and negative pairs with unintented semantic changes (Chuang et al., 2022). As a result, sentence encoders have difficulty in learning fine-grained semantic difference. Figure 1 shows example sentences and their distances computed by the unsupervised sentence encoder PromptBERT . In this figure, sentences a, b, and c have slightly different meanings that differ by phrases standing in and a body of water near a waterfall, and the sentence encoder fails to capture fine-grained semantic differences of these sentences. This results in a large performance difference between unsupervised and supervised sentence encoders on similar sentence pairs, demonstrated in Figure 2, whereas, the difference is negligible or reversed on dissimilar pairs. Supervised sentence encoders are robust on similar sentence  : The semantic textual similarity task performancese of unsupervised sentence encoders and supervised sentence encoders on different sentence pair groups. We divide the STS-B dataset by the similarity score of each sentence pair (0-1 scale).
In this paper, we found that neighbor sentences approximate the fine-grained semantic differences of similar sentences. In Figure 1, identifying a closer neighbor sentence results in predicting the correct similarity scores even when the embedding vectors capture incorrect semantic meaning; a and c are closer to sentence α than β, and b is closer to sentence β than α. We extend this approach to a larger context than its neighbors and compare the input sentence with all the sentences in the given corpus. For a given unsupervised sentence encoder E, our approach, named RankEncoder, computes another vector representation of the input sentence based on the similarity scores between the input sentence and the sentences in the given corpus. The similarity scores are computed by the given sentence encoder, E. RankEncoder captures fine-grained semantic differences better than the original sentence vector without further training. Since we get more accurate semantic similarity with RankEncoder on similar sentence pairs, we use RankEncoder's similarity scores for training and achieve a better sentence encoder.
From experiments on seven STS benchmark datasets, we verify that 1) rank variables are effective for capturing the fine-grained semantic difference between sentences, 2) RankEncoder brings improvement to any unsupervised sentence encoders, and 3) this improvement leads to state-of-the-art semantic textual similarity performance. First, we measure the performance difference between RankEncoder and the base encoder, E, on sentence pairs with different similarity ranges. The experimental results show that RankEncoder is effective on similar sentence pairs. Second, we apply RankEncoder to the three base encoders, SimCSE (Gao et al., 2021), PromptBERT SNCSE (Wang et al., 2022), then verify that our approach applies to all the base encoders. Third, we apply RankEncoder to the state-of-the-art unsupervised sentence encoder  and achieves a 1.1% improvement. the performances of the previous state-of-the-art method and our method are 78.97 and 80.07, respectively.
The contributions of this paper are three folds. First, we demonstrate that previous contrastive learning approaches based on data augmentation methods are limited for capturing fine-grained semantic differences, leading to a large performance difference between unsupervised and supervised methods on similar sentence pairs. Second, we propose RankEncoder, which leverages neighbor sentences for capturing fine-grained semantic differences. Third, we achieve the state-of-the-art STS performance and reduce the gap between supervised and unsupervised sentence encoders; the performances of our method and the state-of-the-art supervised sentence encoder  are 80.07 and 81.97, respectively.

RELATED WORKS
Unsupervised sentence encoders are trained by contrastive learning with positive and negative sentence pairs. Recent approaches construct positive and negative sentence pairs via data augmentation methods. SimCSE computes different sentence vectors from an input sentence by applying different dropout masks (Gao et al., 2021), and ConSERT uses token shuffling and adversarial attacks (Yan et al., 2021). These data augmentation methods often change the meaning of the input sentence and generate dissimilar positive pairs. To alleviate this problem, DiffCSE proposes a masked language modeling based word replacement method (Chuang et al., 2022). Some studies adopt momentum contrastive learning inspired by unsupervised visual representation learning (Zhang et al., 2021;Wu et al., 2021). Prompting  is another direction that is capable of generating positive pairs. With these data augmentation methods, sentence encoders are mostly trained on dissimilar sentence pairs, which is less important than positive pairs (Thakur et al., 2021;Zhou et al., 2022); we get n positive pairs and n 2 − n negative pairs from a batch of sentences of size n. SNCSE's soft-negative sampling method alleviates this problem . SNCSE takes the negation of an input sentence and uses this soft-negative sample in a contrastive learning framework; since the negative sentence is also semantically similar to the input, the sentence encoder is trained more on similar sentence pairs. Compared to previous approaches focused on generating better positive and negative pairs, our work uses neighbor sentences to get more accurate supervision signals for similar sentence pairs, which is a novel approach that previous approaches have not covered. Related but different from our work, Trans-Encoder proposes the self-distillation method that gets supervision signals from itself (Liu et al., 2022). Trans-Encoder solves a slightly different problem from ours. They aim to solve an unsupervised sentence pair modeling problem, not unsupervised sentence embedding; although this work does not require any human-annotated similarity scores of sentence pairs for training, they need the sentence pairs of the STS datasets, which is not allowed to use for training in unsupervised sentence embedding studies.

METHOD
Neighbor sentences provide an overview of the semantics of the local vector space that they belong to. From the comparison between the input sentence and its neighbor sentences, we identify which semantic meanings the input sentence contains more. This is effective for capturing fine-grained semantic difference between sentences. For instance, given two input sentences sharing their neighbors, the semantic difference is identified by referring neighbor sentences that have reversed relation to the input sentences; we provide an example in Figure 1. We extend this method to the whole sentences in the corpus. Our unsupervised sentence encoder, RankEncoder, computes ranks of all sentences in the corpus by their similarity scores to the input. When two input sentences have similar ranks, RankEncoder predicts that they are similar. We found that RankEncoder captures more accurate semantic similarity on similar sentence pairs and leverage RankEncoder's predictions for training. We provide the overall illustration of RankEncoder in Figure 3.

CONTRASTIVE LEARNING FOR BASE ENCODER
The first step of our framework is to learn a base sentence encoder E 1 via the standard contrastive learning approach (Chen et al., 2020). Given an input sentence x i , we first create a positive example x + i which is semantically related with x i (Gao et al., 2021;Chuang et al., 2022). By passing the where cos(·) is the cosine similarity function and τ is the temperature hyperparameter. We then get the overall contrastive loss for the whole batch by summing over all the sentences: l cl = m i=1 l i . Note that the training objective l cl can be further enhanced by adding other relevant losses (Chuang et al., 2022), transforming the input sentences Gao et al., 2021), or modifying the standard contrastive loss (Zhou et al., 2022). For simplicity, we use l cl to represent all the variants of contrastive learning loss in this paper. By optimizing l cl , we first obtain a coarse-grained sentence encoder E 1 for the following steps.

RANKENCODER
RankEncoder computes the ranks of sentences in the corpus with their similarity scores to the input sentence. For a given corpus with n sentences, C = [x 1 , ..., x n ], and a given base encoder, E 1 , RankEncoder computes the vector representation of each sentence in the corpus, V = [ v 1 , ..., v n ], with E 1 . Then computes the rank vector of an input sentence, x, by their ranks as follows: where r i is the rank of sentence x i . We compute r i with cosine similarity scores between V and the vector representation of x computed by E 1 . The function g is a normalization function defined as follows: where σ is the standard deviation of the input values, 1 is a vector of ones of size n. By applying this function to rank vectors, the inner product of two normalized rank vectors becomes equivalent to Spearman's rank correlation, and the similarity is scaled between -1 and 1. We describe the connection between normalization function g and Spearman's rank correlation in Appendix A.1. Sentence vectors in the same area have the same rank vector; the rank vector of sentence x is the same as r α as it is in the α area.

SEMANTIC VECTOR SPACE OF RANKENCODER
Every sentence in the given corpus becomes a main factor that determines the similarity between two rank vectors as rank vectors encode all relations between the input sentence and the corpus. However, for similar sentences, the similarity is mostly affected by their neighbor sentences. Figure 4 shows the example of RankEncoder's semantic space when the corpus has three sentences. Each solid line represents the boundary that two rank variables are converted. For instance, the yellow line is the boundary that converts the ranks of sentences b and c; all the vectors located in the left part of the yellow line are closer to sentence b than c. Since we have three sentences in this corpus, we get six rank vectors, and all vectors in each region have the same rank vector; for n sentences, we get n! (factorial) rank vectors. In this figure, all boundaries cross at the central area of the three sentences. As a result, the six regions are concentrated in the central area. For a given sentence, if its sentence representation lies in the central area, i.e., the red area, then its corresponding rank vector can be easily changed by a small modification of the sentence representation. For vectors having larger distance from these sentences, e.g., the vectors in the gray area, the corresponding rank vectors are much less sensitive to modification of sentence representation. This pattern holds for a larger corpus as well; we demonstrate this in Section 5.5. As a result, RankEncoder leverages more on neighbor sentences even though we use all sentences in the corpus for computing the rank vectors.

MODEL TRAINING
We then use rank vectors for training. For a given unsupervised sentence encoder E 1 and corpus C, we compute similarity scores of all possible sentence pairs in a batch with the rank vectors computed by E 1 . The similarity scores are computed by the inner product of these rank vectors. Then, we define the loss function as the mean square error of RankEncoder's similarity scores as follows: where {x i } m i=1 are the sentences in the batch, E 2 is the sentence encoder in training, u i is a vector representation of x i computed by RankEncoder E1 , and cos(·) is the cosine similarity function. Then, we combine the RankEncoder loss, l r , with the contrastive loss function of E 1 , l cl in the form of the hinge loss as follows: where λ train is a weight parameter for the RankEncoder loss.

SENTENCE PAIR FILTERING
Previous unsupervised sentence encoders randomly sample sentences to construct a batch, and randomly sampled sentence pairs are mostly dissimilar pairs. This causes sentence encoders to learn mostly on dissimilar pairs, which is less important than similar sentence pairs. To alleviate this problem, we filter dissimilar sentence pairs with a similarity under a certain threshold. Also, it is unlikely that randomly sampled sentence pairs have the same semantic meaning. We regard sentence pairs with high similarity as noisy samples and filter these pairs with a certain threshold. The final RankEncoder loss function with sentence pair filtering is as follows: where τ l and τ u are the thresholding parameters, and 1 is the indicator function that returns 1 when the condition is true and returns 0 otherwise.

INFERENCE
We can further utilize RankEncoder in inference stage. Given a sentence pair (x i , x j ), we compute the similarity between the two sentences as follows: where E 2 is a sentence encoder trained by the RankEncoder loss function, and λ inf is a weight parameter for the similarity computed by RankEncoder, and u i and u j are the sentence vectors of x i and x j computed by RankEncoder E2 .

BASE ENCODER E 1 & CORPUS C
RankEncoder computes rank vectors by unsupervised sentence encoder E 1 with corpus C. We use 100,000 sentences sampled from Wikipedia as the corpus (C). We demonstrate the efficacy of RankEncoder with three different unsupervised sentence encoders: SimCSE (Gao et al., 2021), PromptBERT , and SNCSE . SimCSE is effective to show the properties of RankEncoder since this model uses the standard contrastive learning loss with the simple data augmentation method. We use PromptBERT and SNCSE, the state-of-the-art unsupervised sentence encoders, to verify whether RankEncoder is effective on more complex models.

TRAINING DETAILS & HYPER-PARAMETER SETTINGS
We train RankEncoder on 1 million sentences from Wikipedia, following existing unsupervised sentence embedding studies. We set λ train = 0.05, λ inf = 0.1, τ l = 0.5, and τ u = 0.8. We provide more analysis for the hyper-parameter, λ train , in Appendix A.2. We use the development sets of the STS-B and SICK-R datasets for parameter tuning. For other hyper-parameters, we follow the base encoder's setting provided by the authors of each base encoder, E 1 .

RESULTS AND DISCUSSIONS
In this section, we demonstrate that 1) RankEncoder is effective for capturing the semantic similarity scores of similar sentences, 2) RankEncoder applies to existing unsupervised sentence encoders, and  Figure 5: STS performance of three unsupervised sentence encoders and RankEncoder. We report the mean performance and standard deviation of three separate trials with different random seeds. RankEncoder brings improvement on all base encoders. This result implies that our approach generally applies to other unsupervised sentence embedding approaches.
3) RankEncoder achieves state-of-the-art semantic textual similarity (STS) performance. We describe the detailed experimental results in the following sections.

SEMANTIC TEXTUAL SIMILARITY PERFORMANCE
We apply RankEncoder to an existing unsupervised sentence encoder and achieve state-of-the-art STS performance. We use SNCSE (Wang et al., 2022) fine-tuned on BERT-base (Devlin et al., 2019) as the base encoder, E 1 , of RankEncoder. Table 1 shows the STS performance of RankEncoder and unsupervised sentence encoders on seven STS datasets and their average performance (AVG). RankEncoder increases the AVG performance of SNCSE by 1.1 and achieves the state-of-the-art STS performance.
RankEncoder brings a significant performance gain on STS12, STS13, STS14, and SICK-R, but a comparably small improvement on STS16 and STS-B. We conjecture that this is the results of the effectiveness of RankEncoder on similar sentence pairs. In Appendix A.3, we show the similarity distribution of each dataset. From the similarity distributions, we see that STS12,13,14 and SICK-R contain more similar sentence pairs than dissimilar pairs. This pattern is aligned with the performance gain on each STS dataset in Table 1.

UNIVERSALITY OF RANKENCODER
RankEncoder applies to any unsupervised sentence encoders. We apply RankEncoder to SimCSE (Gao et al., 2021), PromptBERT SNCSE (Wang et al., 2022). SimCSE represents the vanilla contrastive learning based sentence encoder, and PromptBERT and SNCSE represent the state-of-the-art unsupervised sentence encoders. We evaluate each encoder's average performance (AVG) on seven STS datasets. We train each encoder in three separate trials and report the mean and the standard deviation of the three AVG performances in Figure 5; the error bar shows the standard deviation. In this figure, RankEncoder increases the average STS performance of the three unsupervised sentence encoders; the improvements on SimCSE, PromptBERT, and SNCSE are 2.1, 0.9, and 0.9, respectively. We report detailed experimental results in Appendix A.4. This result implies that RankEncoder is a universal method that applies to any unsupervised sentence encoders.

OVERLAPPING NEIGHBOR SENTENCES
In Section 3.3, we conjecture that the RankEncoder is specifically effective for similar sentence pairs since they have more overlapping neighbor sentences. To support this supposition, we show the relation between the performance gain caused by RankEncoder and the number of overlapping neighbor sentences of the input sentences. We group sentence pairs in the STS-B dataset with their cosine similarity scores, then compare the STS performance of SimCSE and RankEncoder (Eq. 2 without re-training) on each group; we use SimCSE as the base encoder, E 1 , of RankEncoder. We also report the average number of overlapping neighbor sentences of sentence pairs in each group. We select the nearest 100 neighbor sentences for each sentence in a given sentence pair and count the number of sentences in the intersection. Figure 6 shows one expected result of our supposition; the performance gain correlates with the number of overlapping neighbor sentences.

PERFORMANCE ON SIMILAR SENTENCE PAIRS
In Section 5.3, we show that RankEncoder is effective for sentence pairs closely located in the semantic vector space. In this section, we demonstrate that the same pattern holds for humanannotated similar sentence pairs. We divide sentence pairs in the STS-B dataset into three groups by their human-annotated similarity scores and use the group with the highest similarity. The similarity range of each group is 0.0-0.33 for the dissimilar groups, 0.33-0.67 for the intermediate group, and 0.67-1.0 for the similar group; we normalize the scores to a 0.0-1.0 scale. Figure 7 shows the performance of three unsupervised sentence encoders and the performance gain brought by each component of RankEncoder. RankEncoder E is the model with Eq. 2 that uses E as the base encoder. RankEncoder E -retrain is the model with re-training (Eq. 5). RankEncoder E -retrain-inf is the model with re-training and weighted inference (Eq. 7). From the comparison between E and RankEncoder E , we verify that rank vector effectively increases the base encoder's performance on similar sentence pairs. This improvement is even more significant when using rank vectors for re-training and inference. We report the detailed results in Appendix A.5. We randomly sampled 1000 sentences from the STS-B dataset and visualize the vector representations of these sentences (grey dots). We use the same PromptBERT encoder as the base encoder of RankEncoder. We use the following equatino to compute the distances between vectors; dist( v i , v j ) = 1 − cos( v i , v j ). The bigger the similarity the closer the vectors.

THE VECTOR SPACE OF RANKENCODER
In Section 3.3, we show that RankEncoder increases the distance between similar sentences. In this section, we demonstrate that this pattern holds for a larger corpus as well. In Figure 8, we show the vector space of PromptBERT and RankEncoder; we use PromptBERT as the base encoder of RankEncoder. We visualize the vector representations of randomly sampled 1,000 sentences in the STS-B dataset. In this figure, the sub-spaces where vectors located closely are expanded by RankEncoder and the overall vector space become more uniform. We report the detailed comparison of uniformity (Gao et al., 2021) of unsupervised sentence encoders and RankEncoder in Appendix A.6.

CONCLUSION
In this study, we found that previous unsupervised sentence encoders based on data augmentation methods have a certain limitation for capturing fine-grained semantic meanings of sentences. We proposed RankEncoder, which captures semantic meanings of sentences by leveraging their neighbor sentences. We verified that using relations between the sentence and its neighbors increases the STS performance without further training. We also showed that our approach is specifically effective for capturing the semantic similarity scores of similar sentences. For further improvement, we used the similarity scores computed by RankEncoder for training unsupervised sentence encoders and achieved the state-of-the-art STS performance. We also demonstrated that RankEncoder is generally applicable to any unsupervised sentence encoders. The Spearman's rank correlation of two list of variables, u =< u 1 , ..., u n > and v =< v 1 , ..., v n >, is the Pearson correlation coefficient, ρ, of their ranks, r u and r v , as follows: where r u and r v are the mean of the rank variables, σ(r) u and σ(r) v are the standard deviations of ranks. Then, this can be re-written as follows: Thus, the inner product of the two rank vectors after normalization with g is equivalent to the Spearman's rank correlation of the rank variables.

A.2 λ TRAIN ANALYSIS
The RankEncoder loss, l r , brings a large effect to RankEncoder's re-training process even when the weight parameter, λ train , is set to a small value. In this section, we show that the two losses, l cl and l r , similarly affect to the total loss, l total in Eq. 5, when λ train = 0.05, which is the default setting we use for all experiments in this paper. Figure 9 shows the training loss curves of RankEncoder and SimCSE-unsup with the same random seed. We show the two losses, l cl and l r , of RankEncoder separately. SimCSE-unsup's loss rapidly decreases at the beginning, and converges to a value less than 0.001. We see a similar pattern in the contrastive loss of RankEncoder, which is the same loss SimCSE-sup (lcl) Figure 9: The training loss curves of SimCSE and RankEncoder. X-axis represents a training step, and Y-axis is a scaled loss. After few training steps, the three losses converge in a similar value. Setting λ train to a small value, 0.05, results in similar weights on the two loss functions of RankEncoder while maintaining the loss curve of the base encoder.
function as SimCSE-unsup. In contrast, λ train × l r starts from a much lower value than l cl ; even without the weight parameter, l r is still much lower than l cl . After few training steps, λ train × l r converges close to the value of l cl . Given that λ train determines the scale of two losses of our hinge loss function (Eq. 5), we expect that increasing λ train brings RankEncoder's loss curve converged to higher than SimCSE's loss. This result shows that λ train = 0.05 is optimal value that maintaining the RankEncoder's loss curve similar to the base encoder's loss curve, while balancing the weights of the two losses, l cl and l r .
The loss curve of a supervised sentence encoder provides a reference point for comparison between the loss curves of unsupervised sentence encoders. In Figure 9, all unsupervised sentence encoders' loss curves show a rapidly decreasing pattern, which implies overfitting in training. To verify whether this pattern comes from unsupervised training, we show the loss curve of the supervised sentence encoder, SimCSE-sup, in Figure 9. In this experiment, we measure the same contrastive loss used in unsupervised sentence encoders but in the SimCSE-sup's fully supervised training process. We see the same pattern also holds for SimCSE-sup and verify that the rapidly decreasing pattern is not the problem that only occurs in unsupervised training.

A.3 SIMILARITY DISTRIBUTION OF STS BENCHMARK DATASETS
Semantic textual similarity datasets have different similarity distributions. Since RankEncoder is specifically effective for similar sentence pairs, we expect that RankEncoder brings a more performance increase on datasets with more similar sentence pairs. We show the similarity distribution of each STS dataset in Figure 10. In this figure, we normalize the similarity scores between 0 and 1. The result shows that the similarity distributions of STS12, STS14, and SICK-R are skewed to a high similarity score and STS13's similarity distribution has a distinct peak at a high similarity score. From the results in Table 1, we see that RankEncoder is more effective on STS12, STS13, STS14, SICK-R, and show the relation between the performance increase and the similarity distribution of each dataset.

A.4 UNIVERSALITY OF RANKENCODER
In this section, we report the detailed experimental results of Figure 5. Table 2 shows the results.

A.5 THE PERFORMANCE OF RANKENCODER ON SIMILAR SENTENCE PAIRS
We report the detailed results of Figure 7 in Table 3.   Table 2: Semantic textual similarity performance of RankEncoder and base encoders: SimCSE, PromptBERT, and SNCSE. We measure the Spearman's rank correlation between the human annotated scores and the model's predictions. We report the mean performance and standard deviation of three separate trials with different random seeds.