Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS

Existing sentence textual similarity benchmark datasets only use a single number to summarize how similar the sentence encoder's decision is to humans'. However, it is unclear what kind of sentence pairs a sentence encoder (SE) would consider similar. Moreover, existing SE benchmarks mainly consider sentence pairs with low lexical overlap, so it is unclear how the SEs behave when two sentences have high lexical overlap. We introduce a high-quality SE diagnostic dataset, HEROS. HEROS is constructed by transforming an original sentence into a new sentence based on certain rules to form a \textit{minimal pair}, and the minimal pair has high lexical overlaps. The rules include replacing a word with a synonym, an antonym, a typo, a random word, and converting the original sentence into its negation. Different rules yield different subsets of HEROS. By systematically comparing the performance of over 60 supervised and unsupervised SEs on HEROS, we reveal that most unsupervised sentence encoders are insensitive to negation. We find the datasets used to train the SE are the main determinants of what kind of sentence pairs an SE considers similar. We also show that even if two SEs have similar performance on STS benchmarks, they can have very different behavior on HEROS. Our result reveals the blind spot of traditional STS benchmarks when evaluating SEs.


R1
R2 RL Lev Len  which consist of sentence pairs with human-labeled similarity scores. The performance of the SEs is summarized using Spearman's correlation coefficient between the human-labeled similarity and the cosine similarity obtained from the SE. While the STS benchmarks are widely adopted, there are two problems with these benchmarks. First, the performance on the STS dataset does not reveal much about what kind of sentence pairs would the SE deem similar. Spearman's correlation coefficient only tells us how correlated the sentence embedding cosine similarity and the ground truth similarity are. However, the idea of what is similar can vary among different people and depend on the task at hand. Therefore, just because the sentence embedding cosine similarity is strongly correlated to the ground truth similarity, it does not provide much information about the specific type of similarity that the SE captures. Prior works mostly resort to a few hand-picked examples to illustrate what kind of sentence pairs an SE would consider similar or dissimilar (Gao et al., 2021;Chuang et al., 2022;. But it is hard to fully understand the traits of an SE by using only a few hand-picked samples. The second issue is that sentence pairs in the STS-related benchmarks often have low lexical overlaps, as shown in Table 1, making it unclear how the SEs will perform on sentence pairs with high lexical overlaps, which exist in real-world applications such as adversarial attacks in NLP. Adversarial samples in NLP are constructed by replacing some words in an original sentence with some other words (Alzantot et al., 2018), and the original sentence and the adversarial sample will have high lexical overlaps. SEs are often adopted to check the semantic similarity between the original sentence and the adversarial sample (Garg and Ramakrishnan, 2020;Li et al., 2020b). If we do not know how SEs perform on high lexical overlap sentences, using them to check semantic similarity is meaningless.
To address the above issues, we construct and release a new dataset, HEROS: High-lexical overlap diagnostic dataset for sentence encoders, for evaluating SEs. HEROS is composed of six subsets, and each subset includes 1000 sentence pairs with very high lexical overlaps. For the two sentences in a sentence pair, one of them is created by modifying the other sentence based on certain rules, and each subset adopts a different rule. These rules are (1) replacing a word with a synonym, (2) replacing a word with an antonym, (3) replacing a word with a random word, (4) replacing a word with its typo, and (5,6) negating the sentence. By comparing the sentence embedding cosine similarity of sentence pairs in different subsets, we can understand what kind of sentence pairs, when they have high lexical overlaps, would be considered similar by an SE. We evaluate 60 sentence embedding models on HEROS and reveal many intriguing and unreported observations on these SEs.
While some prior works also crafted sentence pairs to understand the performance of SEs, they either do not make the datasets publicly available (Zhu et al., 2018;Zhu and de Melo, 2020) or do not consider so many SEs as our paper does (Barancikova and Bojar, 2020), especially unsupervised SEs. Our contribution is relevant and significant as it provides a detailed understanding of SEs using a new dataset. The contribution and findings of this paper are summarized as follows: • We release HEROS, a high-quality dataset consisting of 6000 sentence pairs with high lexical overlaps. HEROS allows researchers to systematically evaluate what sentence pairs would be considered similar by SEs when the lexical overlap is high.
• We evaluate 60 SEs on HEROS and reveal several facts that were never reported before or only studied using a few hand-picked examples.
• We show that supervised SEs trained for different downstream tasks behave differently on different subsets of HEROS, indicating that the SEs for different tasks encode different concepts of similarity.
• We find that all unsupervised SEs are considerably insensitive to negation, and further finetuning on NLI datasets makes them acquire the concept of negation.
• We observe that SEs can have very different performances on different subsets of HEROS even if their average STS benchmark performance difference is less than 0.2 points.

HEROS
HEROS consists of six subsets, and each subset consists of 1000 sentence pairs. The six subsets are Synonym, Antonym, Typo, Random MLM, and two types of Negation subsets. In all subsets, each pair of sentences have high lexical overlap, and the two sentences only differ in at most one content word; we call these paired sentences minimal pairs. The dataset is constructed from the GoEmotions dataset (Demszky et al., 2020), a dataset for emotion classification collected from Reddit comments. We select one thousand sentences from GoEmotions and replace one word in the original sentences with its synonym, antonym, a typo of the replaced word, and a random word obtained from BERT (Devlin et al., 2019) mask-infilling. Last, we convert the original sentence into its negation using two different rules. After this process, we obtain six sentences for each of the 1000 selected sentences. We pair the original sentence and a converted sentence to form a minimal pair, which has high lexical overlaps. We will explain the above process in detail in Section 2.2. Samples from HEROS are shown in Table 2.

Motivation and Intended Usage
Unlike traditional STS benchmark datasets that asked humans to assign a similarity score as the ground truth similarity, HEROS does not provide a ground truth similarity score for sentence pairs. This is because it is difficult to define how similar two sentences should be for them to be given a certain similarity score. Moreover, the concept of similarity differs in downstream tasks. For example, in paraphrase tasks, a Negation minimal pair is considered semantically different; but for a retrieval task, we might consider them similar.
Thus, instead of providing a ground truth label for each sentence pair and letting future researchers pursue state-of-the-art results on HEROS, we hope this dataset is used for diagnosing the characteristics of an SE. Specifically, one can compare the average sentence embedding cosine similarity of sentence pairs in different subsets to understand what kind of similarity is captured by the sentence embedding model. Different subsets in HEROS capture various aspects of semantics. Comparing the average cosine similarity between minimal pairs in Synonym and Antonym allows one to understand whether replacing a word with an antonym is more dissimilar to the original semantics than replacing a word with a synonym. The average cosine similarity between minimal pairs in Negation can tell us how negation affects sentence embedding similarity. Typos are realistic and happen every day. While humans can infer the original word from a typo and get the original meaning of the sentence, it will be interesting to see how the typos affect the sentences' similarity with the original sentences. The Random MLM subset can tell us how similar the sentence embedding can be when two sentences are semantically different but with high lexical overlaps. By comparing the performance of different SEs on different subsets in HEROS, we can further understand the trait of different SEs.

Dataset construction
2.2.1 Raw dataset preprocessing HEROS are constructed from GoEmotions. For the sentences in GoEmotions, we only select sentences whose lengths are more than 8 words and less than 25 words. We filter out sentences with cursing, and we use language-tool to filter out sentences that language-tool find ungrammatical. We manually remove the sentences that we find offensive or harmful. The selected sentences are called original sentences in our paper. More details on preprocessing are presented in Appendix A.1.

Selecting which word to replace
The next step is to determine which word to replace in the original sentences obtained from preprocessing. The selected word must be (1) seman-tically significant to the original sentence so that when it is replaced with a non-synonym word, the two sentences will have vastly different meanings and would be considered contradictory in an NLI task.
(2) The selected word must have synonyms and antonyms at the same time since it will be replaced with its synonyms and antonyms. We only select verbs and adjectives for replacement because changing them greatly alters the semantics of a sentence. Sentences that do not contain a word that satisfies the two criteria are dropped.

Synonym and Antonym Subsets
The first subset in HEROS includes the minimal pairs formed by replacing a word in the original sentence with its synonym; the second subset includes the minimal pairs formed by replacing a word with its antonym. After selecting the word to be replaced, we determine what synonym and antonym should be used for replacement. There are three principles for replacement: (1) the replacement should fit in the context, (2) the synonym should match the word sense of the original word in the sentence 2 , and (3) the collocating words (e.g., prepositions, definite articles) may also need to be modified. The three guiding principles make this process require high proficiency in English and this process is impossible to be done using an automatic process. Thus, this process is performed by the authors ourselves. We use our proficiency in English and four online dictionaries to select the replacement words. The four resources are thesaurus.com, thesaurus of Merriam-Webster, Online Oxford Collocation Dictionary, and Cambridge Dictionary. This step takes 72 hours.

Random MLM
The third subset in HEROS is obtained by replacing the word to be replaced with a random word predicted by a masked language model. We mask the word to be replaced by [MASK] token and use bert-large-uncased to fill in the masked position. We filter out the synonym, antonym, and their derivational forms 3 from the masked prediction using WordNet and LemmInflect. Additionally, we filter out punctuations and subword tokens that are not complete words. Moreover, we manually filter out mask predictions that are very similar in mean-

Subset Example (adjective) Example (verb)
Original And that is why it is (or was) illegal. You do not know how much that boosted my self-esteem right now.
Synonym And that is why it is (or was) illegitimate. You do not know how much that increased my self-esteem right now.
Antonym And that is why it is (or was) legal. You do not know how much that lowered my self-esteem right now.

Random MLM
And that is why it is (or was) here. You do not know how much that affects my self-esteem right now.

Typo
And that is why it is (or was) illiegal. You do not know how much that booste my self-esteem right now.

Negation (Main)
And that is not why it is (or was) illegal. You do know how much that boosted my self-esteem right now.

Negation (Antonym)
And that is why it is (or was) not illegal. You do not know how much that did not boost my self-esteem right now. ing to the original word when used in the same context. This is because even if a word is not a synonym of the word to be replaced, it may still express the same meaning when used in the same context. For example, "great" is not a synonym of "good" according to WordNet, but their meaning is very similar. The resulting sentences can be ungrammatical in very few cases, but we leave them as is.

Typos
The fourth subset in HEROS is constructed by swapping the word to be replaced with its typo. Typos are spelling or typing errors that occur in real life.
If the word to be typoed is in the Wikipedia lists of common misspellings, we replace the word with its typo in the list. If the word is not in the common misspelling list, we create a typo by randomly deleting or replacing one character or swapping two different characters in the word (He et al., 2021).

Negation
The last subset in HEROS is constructed by negating the original sentence. Negation can happen at different levels in a sentence, and we create two different types of negation datasets based on where the negation happens. The first one is negating the main verb, which is the action performed by the subject, in the sentence. 4 If the main verb is not negated, we negate it by adding appropriate auxiliary verbs and the word "not". If the main verb is already negated, we directly remove the word "not" and do not remove the auxiliary verb. We call this type of negation dataset the Negation (Main).
The other type of negation dataset is related to the Antonym subset. A minimal pair in the Antonym subset is formed by replacing a word in a sentence with its antonym. This implicitly negates the meaning of the original sentence. Here, we construct another subset called Negation (Antonym), which explicitly negates the word that is replaced with an antonym in the Antonym subset. Given a sentence pair in the Antonym dataset, there is a verb or adjective in the original sentence that is replaced by its antonym in the converted sentence. We directly negate that word in the original sentence by adding "not" in front of an adjective or adding "not" and an auxiliary verb for verbs. 5 . If the word is already negated, we remove the "not". These sentences might sound a bit strange, but they are still understandable. This type of negation dataset is called Negation (Antonym) because the negation is in the same place as the antonym replacement in the Antonym subset.

Comparing 60 Sentence Embedding Models
In this section, we compare the behavior of 60 SEs on HEROS. Detailed information about the SEs, including training data and model size, is listed in Appendix C. We calculate the cosine similarity between each minimal pair in HEROS and normalize it by a baseline cosine similarity to remove the effect of anisotropic embedding space (Ethayarajh, 2019; Li et al., 2020a). The baseline cosine similarity is calculated by averaging the similarity between 250K random sentence pairs (details in Appendix D). We report the average normalized similarity of different subsets in HEROS for each SE. For simplicity, we will use "similarity" to refer to the normalized cosine similarity.

Supervised SEs
We use 30 supervised transformer-based SEs in the SentenceTransformers toolkit Gurevych, 2019, 2020). These SEs are trained supervisedly using different datasets for specific downstream tasks. The results are presented in Figure 1. In Figure 1, we group SEs into groups based on what training dataset they used. We denote the datasets used to fine-tune the SEs in the parentheses in Figure 1. There are a lot of interesting observations one can obtain from Figure 1, and we are just listing some of those observations. SEs fine-tuned only on QA datasets are insensitive to negation: The first and second blocks in Figure 1 include different SEs obtained from fine-tuning on QA datasets using contrastive learning (Hadsell et al., 2006). In the fine-tuning stage, a positive pair for contrastive learning is a pair of question and the answer to the question. The high similarity of the two Negation subsets can be explained by the dataset type used for fine-tuning: whether the answer is negated or not, it may still be considered a valid answer to the question. Hence, it is reasonable that a sentence and its negation have high similarity. We also find that replacing a word with a typo will cause the resulting sentence to have lower similarity with the original sentence compared with replacing the word with a synonym. While humans can understand the real meaning of a typo word, this is not the case for the SEs.
SEs fine-tuned from T5 are less sensitive to typos when the model size scales up: The GTR models (Ni et al., 2021) in the second block of Figure 1 and the ST5 models (Ni et al., 2022) in the fourth block are SEs fine-tuned from T5 (Raffel et al., 2020). Although the two types of models are trained using different datasets, we find that their performance on the Typo subset shares an interesting trend when the model size scales up from the smallest base-size model to the largest xxl-size model: The similarity on the Typo subset grows higher as the model gets larger and can be as high as or higher than the similarity of the Synonym subset; meanwhile, the similarity on the Synonym subset is almost unchanged when the model size gets larger. This shows that deeper model can better mitigate the negative impact of typos on sentence embeddings.
SEs fine-tuned on paraphrase datasets are extremely sensitive to negations and antonyms: The third block in Figure 1 includes the results of SEs fine-tuned on paraphrase datasets using contrastive learning. Paraphrase datasets include a combination of different datasets such as premisehypothesis pairs in NLI datasets and duplicate question pairs. Contrary to the previous paragraph which shows fine-tuning only using questionanswer pairs makes the model insensitive to negation, we see a completely different result in the third block of Figure 1. We infer that this is mainly due to the NLI datasets used for fine-tuning: negating the original sentence results in a sentence that semantically contradicts the original sentence, and will be considered as a hard negative in contrastive learning. Hence, SEs fine-tuned on NLI will be very sensitive negation. For the same reason, these SEs are also sensitive to replacing words with antonyms. The only exception is the MiniLM L3 (para) model , which has very high similarity on the Negation subsets and is even higher than the Synonym subset. We hypothesize that this is because the number of parameters of the model and the sentence embedding dimension are too small, thus limiting the expressiveness of the sentence embeddings.
SEs fine-tuned on all available sentence pair datasets are again insensitive to negations: The models in the last block in Figure 1 are fine-tuned on all available sentence-pair training data, denoted as (all). The training data consist of 32 datasets  and have a total of 1.17B sentence pairs, including question-answer pairs in QA datasets, premisehypothesis pairs in NLI dataset, and contextpassage pairs in retrieval datasets. In the fifth block, the similarity between sentence pairs from the two Negation subsets is very high and is even higher than the similarity of the Synonym subset for most models. This means that when using these models for retrieval, given a source sentence, it is more likely to retrieve the negation of the source sentence, instead of another sentence that only differs from the source sentence by a synonym. While  these models are also fine-tuned on NLI datasets, the NLI datasets only compose 0.24% of the whole training data. This makes the models in this block much less sensitive to negations, compared with models fine-tuned mainly with NLI datasets (e.g., ST5) and models fine-tuned on paraphrase datasets.
HEROS reveal different characteristics of different SEs: Overall, we see that even if the sen-tences in HEROS all have high lexical overlaps, the similarity score can still be very different among HEROS subsets for the same SE. HEROS also shows that how the concept of similarity is encoded by an SE is highly related to what the SE is trained on. This further allows us to understand what kind of similarity is required by the task related to the training dataset. For example, NLI tasks consider negation pairs as dissimilar while question-answer pair retrieval task considers negation to be similar. Such interesting observations are not revealed by any prior SE benchmarks, making HEROS very valuable. It will also be interesting to see if there is any correlation between an SE's performance on different subsets in HEROS and different downstream tasks in SentEval (Conneau and Kiela, 2018); we save this in future work.

Unsupervised SEs
Next, we turn our attention to unsupervised SEs. Unlike supervised SEs that are fine-tuned on labeled pairs of sentences, unsupervised SEs are trained using specially designed methods that do not use labeled sentence pairs. Most of these unsupervised SEs can be further fine-tuned on NLI datasets to further improve the performance on the STS benchmarks (Gao et al., 2021;Jiang et al., 2022;Jian et al., 2022). We show the result on HEROS of 7 different types of unsupervised SEs and their derived supervised SEs in Figure 2.
For the completeness of the result, we also report the performance of sentence embeddings calculated by averaging the GLoVe embeddings (Pennington et al., 2014) in the sentence. The result is presented in the first row in Figure 2. We observe that the sentence before and after negation have very high similarity, and the similarity is much higher than replacing one word with its synonym or antonym. This shows that negation words have a very small contribution to the sentence embedding obtained from averaging the GLoVe embeddings.
Unsupervised SEs are insensitive to negation: Unsupervised SEs, denoted with unsup in Figure 2, have high similarity on Negation subsets, sometimes even higher than Synonym subsets. SNCSE  models are an exception, where Negation subsets may have a lower similarity. SNCSE uses the dependency tree of a sentence to convert it into its negation as a "softnegative" in contrastive learning, but it needs a dependency parser, making it not truly unsuper-vised. Hence, we use unsup* to denote SNCSE models in Figure 2. The lower similarity on Negation datasets is not consistent for different SNCSE models, possibly due to a poor negation method in the implementation of SNCSE that does not consider negative contractions, resulting in low-quality augmented data.
Further supervised fine-tuning on NLI datasets significantly change the model's behavior on HEROS: Fine-tuning unsupervised SEs on NLI datasets (denoted with sup in Figure 2) leads to a significant drop in similarity on Negation and Antonym and an increase in similarity on the Synonym subset. This show that supervised fine-tuning greatly changes how SEs encode similarity. An interesting trend is that after fine-tuning, similarity on the Typo subset increases for most models, likely because the SE better captures semantic similarity and pays less attention to superficial lexical form.
Almost all SEs rate Negation (Main) to be less similar compared with Negation (Antonym) Recall that the Negation (Main) subsets are created by negating the main verb while the Negation (Antonym) subset does not always negate the main verb. The lower similarity on the Negation (Main) subset shows that SEs consider negating the main verb to be less similar to the original sentence, compared with negating other positions in the original sentence. This implies that the SEs can capture the level of the verb in the dependency tree of the sentence, and it considers negating the main verb to be more influential to sentence embeddings.
Close performance on STS benchmarks can have different behaviors on HEROS: We find that two SEs that achieve similar average performance on STS benchmarks (STS 12-17, STSb, and SICK-R) can perform very differently on HEROS. For example, RoBERTa large (all) and DistilRoBERTa base (para) in Figure 1 have similar average STS scores (81.07 and 81.12, respectively), but the former have very high similarity on the Negation subsets while the latter does not. This is also the case for SNCSE roberta-large in Figure 2 and mpnet base (para) in Figure 1, which have average scores of 81.77 and 81.57 on the STS benchmarks, respectively. This shows that HEROS can reveal some traits of the SEs that the traditional STS benchmarks cannot identify.
We introduce HEROS, a new dataset of 6000 humanconstructed sentence pairs with high lexical overlaps. It is composed of 6 subsets that capture different linguistic phenomena. Evaluating an SE on HEROS can reveal what kind of sentence pairs the SE considers similar. HEROS fills a void in current SE evaluation methods, which only use correlation coefficients with human ratings or performance on downstream tasks to summarize an SE, and mainly use sentence pairs with low lexical overlaps. We use HEROS to evaluate 60 models and reveal numerous new observations. We believe that HEROS can aid in interpreting SE behavior and comparing the performance of different SEs.

Limitations
The SEs in this paper are mainly transformer-based SEs, and we are not sure whether the observations hold for other SEs. However, considering that transformer-based SEs dominate the current NLP community, we think it is fine to only evaluate 59 transformer-based SEs. Another limitation is that the sentences in HEROS are converted from Reddit, which is an online forum and the texts on Reddit may be more casual and informal. This makes the sentence pairs in HEROS tend to be more informal. Users should note such a characteristic of the sentence pairs of HEROS and beware that the results obtained using HEROS may be different from the results obtained using more formal texts. An additional limitation is that there can be more diverse rules to create different sentence pairs other than the six subsets included in HEROS, and our paper cannot include them all. As a last limitation, during the construction of HEROS, we remove sentences that are ungrammatical based on language-tool, so our results may not generalize to ungrammatical sentences.

Ethics Statement
The main ethical concern in this paper is our dataset, HEROS. HEROS is constructed from an existing dataset, GoEmotion. As listed in the model card of GoEmotion, GoEmotion contains biases in Reddit and some offensive contents. As stated in Section 2.2.1, the authors have tried our best to remove all content that we find to be possibly offensive to users. We cannot guarantee that our standard of unbiased and unharmful fits everyone.
Thus, we also remind future users of HEROS to be aware of such possible harms. To make sure the accessibility of our paper, we have used an online resource to carefully check that the figures in the paper are interpretable for readers of different backgrounds.

A.1 Dataset Preprocess
The sentences in GoEmotions are already anonymized, where the names of people are replaced with a special [NAME] token, so we do not need to further perform anonymization. We filter out all sentences that have more than one [NAME] token and replace all [NAME] with a gender-neutral name "Jackie".

A.2 Dataset License
HEROS is constructed based on the GoEmotion dataset (Demszky et al., 2020). GoEmotion is released under the Apache 2.0 license, so our modification and redistribution to GoEmotion are granted by the dataset license. Our dataset, HEROS is also released under the Apache 2.0 license.

B Comparing the Lexical Overlaps of Different Datasets
In Table 1, we show the basic statistics of three different datasets. We use the ROUGE F1 and Levenshtein distance to quantify the lexical overlap between sentence pairs of a dataset. The statistics of HEROS is averaged over different subsets, and those of STS-b and SICK-R are calculated based on the test set. R1, R2, and RL: ROUGE F1 score between the sentence pairs. (R1 and R2: unigram and bigram overlap; RL: longest common subsequence.) We use the implementation of python rouge 1.0.1 to calculate the ROUGE score.
Lev is the average normalized token-level Levenshtein distance among the sentence pairs, and the normalized Levenshtein distance is the Levenshtein distance between two sentences divided by the length of the longer sequence of the sentence pairs. We first tokenize the sentence using the tokenizer of bert-base-uncased and calculate the Levenshtein distance between the token ids of the sentence pairs. We normalize the Levenshtein distance to make it falls in the range of [0, 1].
The average sentence length is the average number of tokens per sentence, and the tokens are obtained by using the tokenizer of bert-base-uncased.

C Supplementary Materials for Sentence Encoders
C.1 Supervised Sentence Encoders Table 3 shows the number of parameters and the sentence embedding dimension of the SEs used in this paper.

C.1.1 Datasets Used to Train Supervised SEs
The datasets indicated in Figure 1 is listed as follows: community QA websites.

C.2 Unsupervised Sentence Encoders
The full list of unsupervised SEs and their supervised derivations we compared are: SimCSE (Gao et al., 2021), DiffCSE (Chuang et al., 2022), PromptBERT (Jiang et al., 2022), SNCSE , RankEncoder (Seonwoo et al., 2022), AudioCSE and VisualCSE (Jian et al., 2022). For all the unsupervised SEs shown in Figure 2, if it is a base-size model, its number of parameters is roughly 110M; if it is a large-size model, its number of parameters is roughly 335M. The bert models shown in Figure 2 are all uncased models.

D Normalization
For each SE, we first calculate the cosine similarity between each minimal pair in HEROS. However, if the embedding space is highly anisotropic (Ethayarajh, 2019; Li et al., 2020a), the cosine similarity between two random sentences is expected to be rather high. To remove the effect of anisotropic embedding space and better interpret the result, we normalize the cosine similarity by a baseline cosine similarity. The baseline cosine similarity is calculated by the following procedure: We split the 1000 original sentences into the first 500 and the last 500 sentences, and calculate the average cosine similarity between the sentence embeddings of these 500 × 500 random sentence pairs. This average cosine similarity, cos avg , gives us an idea of how similar sentence embedding can be for two randomly selected sentences. Last, we normalize the cosine similarity of the minimal pairs to lessen the effect of anisotropy by the following formula: where cos orig is the original cosine similarity of a sentence pair and cos normalized is the similarity after normalization.

E Runtime and Computation Resource
The experiments on Section 3, except T5 xl and xxl, are conducted on an NVIDIA 1080 Ti, and it takes less than one hour to run all the experiments. The T5 xxl and xl models cannot be loaded on a 1080 Ti, and we use V100 to conduct the experiment of the SEs whose base models are T5 xl and xxl, which takes less than 15 minutes.