Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words

The performance of sentence encoders can be significantly improved through the simple practice of fine-tuning using contrastive loss. A natural question arises: what characteristics do models acquire during contrastive learning? This paper theoretically and experimentally shows that contrastive-based sentence encoders implicitly weight words based on information-theoretic quantities; that is, more informative words receive greater weight, while others receive less. The theory states that, in the lower bound of the optimal value of the contrastive learning objective, the norm of word embedding reflects the information gain associated with the distribution of surrounding words. We also conduct comprehensive experiments using various models, multiple datasets, two methods to measure the implicit weighting of models (Integrated Gradients and SHAP), and two information-theoretic quantities (information gain and self-information). The results provide empirical evidence that contrastive fine-tuning emphasizes informative words.


Introduction
Embedding a sentence into a point in a highdimensional continuous space plays a foundational role in the natural language processing (NLP) (Arora et al., 2017;Reimers and Gurevych, 2019;Chuang et al., 2022, etc.).Such sentence embedding methods can also embed text of various types and lengths, such as queries, passages, and paragraphs; therefore, they are widely used in diverse applications such as information retrieval (Karpukhin et al., 2020;Muennighoff, 2022), question answering (Nguyen et al., 2022), and retrieval-augmented generation (Chase, 2023).
One of the earliest successful sentence embedding methods is additive composition (Mitchell and Lapata, 2010;Mikolov et al., 2013), which embeds a sentence (i.e., a sequence of words) by summing its static word embeddings (SWEs; Mikolov et al., 2013;Pennington et al., 2014).Besides, weighing each word based on the inverse of word frequency considerably improved the quality of the sentence embeddings, exemplified by TF-IDF (Arroyo-Fernández et al., 2019) and smoothed inverse word frequency (SIF; Arora et al., 2017).
Recent sentence embeddings are built on masked language models (MLMs; Devlin et al., 2019;Liu et al., 2019;Song et al., 2020).Although sentence embeddings from the additive composition of MLMs' word embeddings are inferior to those of SWEs (Reimers and Gurevych, 2019), fine-tuning MLMs with contrastive learning objectives has elevated the quality of sentence embeddings (Reimers and Gurevych, 2019;Gao et al., 2021;Chuang et al., 2022, etc.) and is now the de-facto standard.Interestingly, these contrastive-based sentence encoders do not employ explicit word weighting, which is the key in the SWE-based methods.
In this paper, we demonstrate that a reason for the success of contrastive-based sentence encoders is implicit word weighting.Specifically, by com-bining explainable AI (XAI) techniques and information theory, we demonstrate that contrastivebased sentence encoders implicitly weight each word according to two information-theoretic quantities (Figure 1).To measure the contribution (i.e., implicit weighting) of each input word to the output sentence embedding within the encoders, we used two XAI techniques: Integrated Gradients (IG; Sundararajan et al., 2017) and Shapley additive explanations (SHAP; Lundberg and Lee, 2017) (Section 3.1).To measure the information-theoretic quantities of each word, we used the two simplest quantities, information-gain KL(w) and selfinformation − log P (w) (Section 3.2).To demonstrate our hypothesis, we first provided a theoretical connection between contrastive learning and information gain (Section 4).We then conducted comprehensive experiments with a total of 12 models and 4 datasets, which found a strong empirical correlation between the encoders' implicit word weighting and the information-theoretic quantities (Section 5).The results of our study provide a bridge between SWE-era explicit word weighting techniques and the implicit word weighting used by recent contrastive-based sentence encoders.

Contrastive-Based Sentence Encoders
This section provides an overview of contrastivebased sentence encoders such as SBERT (Reimers and Gurevych, 2019) 1 and SimCSE (Gao et al., 2021).These models are built by fine-tuning MLMs with contrastive learning objectives.
Input and Output: As shown in Figure 1, a contrastive-based sentence encoder m : R n×d → R d calculates a sentence embedding s ∈ R d from a sequence of input word embeddings W = [w 1 , . . ., w n ] ∈ R n×d corresponding to input words and special tokens.For example, for the sentence "I like playing the piano," the input to the en- and sentence embedding s is calculated as follows: where m i : R n×d → R denotes the computation of the i-th element of s.

Architecture:
The encoder architecture consists of (i) Transformer layers (same as MLM architecture) and (ii) a pooling layer.(i) Transformer layers update input word embeddings to contextualized word embeddings w i → e i .(ii) The pooling layer then pools the n representations [e 1 , . . ., e n ] into a single sentence embedding s ∈ R d .There are two major pooling methods, MEAN and CLS: MEAN averages the contextualized word embeddings, and CLS just uses the embeddings for a [CLS] token after applying an MLP on top of it.
Contrastive fine-tuning: The contrastive finetuning of MLMs is briefly described below.In contrastive learning, positive pairs (s, s pos ), i.e., semantically similar pairs of sentence embeddings, are brought closer, while negative pairs (s, s neg ) are pushed apart in the embedding space.Positive examples s pos and negative examples s neg for a sentence embedding s are created in different ways depending on the method.For instance, in the unsupervised SimCSE (Gao et al., 2021), a positive example s pos is created by embedding the same sentence as s by applying a different dropout; and a negative example s neg is created by embedding a sentence randomly sampled from a training corpus.

Analysis Method
We compare the implicit word weighting within the contrastive-based sentence encoders with information-theoretic quantities of words.Here we introduce (i) quantification of the implicit word weighting within the encoders using two XAI techniques (Section 3.1) and (ii) two informationtheoretic quantities of words (Section 3.2).

Implicit Word Weighting within Encoder
Contrastive-based sentence encoders are not given explicit word weighting externally but are expected to implicitly weight words through the complicated internal network.We quantify the implicit word weighting using two widely used feature attribution methods (Molnar, 2022): Integrated gradients (Sundararajan et al., 2017) and Shapley additive explanations (Lundberg and Lee, 2017).

Integrated Gradients (IG)
Integrated Gradients (IG) is a widely XAI technique used to calculate the contribution of each input feature to the output in neural models.IG has two major advantages: (i) it is based on the gradient calculations and thus can be used to arbitrary neural models; and (ii) it satisfies several desirable properties, for example, the sum of the contributions for each input feature matches the output value (Completeness described in Sundararajan et al., 2017).It has also been actively applied to the analysis of MLM-based models (Hao et al., 2021;Prasad et al., 2021;Bastings et al., 2022;Kobayashi et al., 2023).
The formal definition of IG is as follows: Let f : R n×d → R be a model (e.g., each element of sentence encoder) and X ′ ∈ R n×d be a certain input (e.g., word vectors).IG calculates a contribution score IG i,j for the (i, j) element of the input X ′ [i, j] (e.g., each element of each input word vector) to the output f (X ′ ): Here B denotes a baseline vector, often an uninformative or neutral input is employed.Notably, IG decomposes the output value into the sum of the contribution scores of each input (Equation 2).
Application to the sentence encoder: We aim to measure the contribution of each input word to the output sentence embedding.However, when applying IG to the sentence encoder m (its k-th element is m k ) and input W = [w 1 , . . ., w n ], it can only compute the fine contribution score of the j-th element of the each input word vector w i to the k-th element of the sentence vector s = m(W ).Thus, we aggregates the contribution scores across all the (j, k) pairs by the Frobenius norm: where we used a sequence of input word vectors for an uninformed sentence " [SEP]" as the baseline input B. In addition, the word contribution c i is normalized with respect to the sentence length n to compare contributions equally among words in sentences of different lengths:

SHapley Additive exPlanations (SHAP)
Shapley additive explanations (SHAP) is a feature attribution method based on Shapley values (Shapley, 1953).Similar to IG, SHAP satisfies the desirable property: it linearly decomposes the model output to the contribution of each input (Lundberg and Lee, 2017).Its formal definition and application to the word weighting calculation of contrastive-based sentence encoders are shown in Appendix D.
Though we can apply SHAP to analyze sentence encoders, SHAP is often claimed to be unreliable (Prasad et al., 2021).Thus, we discuss the experimental results using IG in the main text and show the results using SHAP in Appendix E.

Information-Theoritic Quantities
Here, we introduce two information-theoretic quantities that represent the amount of information a word conveys.

Information Gain KL(w)
The first quantity is the information gain, which measures how much a probability distribution (e.g., the unigram distribution in some sentences) changes after observing a certain event (e.g., a certain word).Information gain of observing a word w in a sentence is denoted as KL(w) := KL(P sent (•|w) ∥ P (•)), where P sent (•|w) is the word frequency distribution in sentences containing word w, and P (•) is a distribution without conditions.Intuitively, KL(w) represents the extent to which the topic of a sentence is determined by observing a word w in the sentence.For example, if w is "the", KL("the") becomes small because the information that a sentence contains "the" does not largely change the word frequency distribution in the sentence from the unconditional distribution (P sent (•|"the") ≈ P (•)).On the other hand, if w is "NLP", KL("NLP") becomes much larger than KL("the") because the information that a sentence contains "NLP" is expected to significantly change the word frequency distribution (P sent (•|"NLP") ̸ = P (•)).Recently, Oyama et al. (2023) showed that KL(w) is encoded in the norm of SWE.Also, χ 2 -measure, which is a similar quantity to KL(w), is useful in keyword extraction (Matsuo and Ishizuka, 2004).We provide a theoretical connection between KL(w) and contrastive learning in Section 4.

Self-Information − log P (w)
The second quantity which naturally represents the information of a word is self-information − log P (w).− log P (w) is based on the inverse of word frequency and actually very similar to word weighting techniques used in SWE-based sen-tence encoding methods such as TF-IDF (Arroyo-Fernández et al., 2019) and SIF weighting (Arora et al., 2017).Note that the information gain KL(w) introduced in Section 3.2.1 is also correlated with the inverse of word frequency (Oyama et al., 2023), and both quantities introduced in this section are close to the word weighting techniques.Detailed comparison between the two quantities and the word weighting techniques is shown in Appendix B. If the contrastive-based sentence encoder's word weighting is close to KL(w) and − log P (w), the post-hoc word weighing used in SWE-based methods is implicitly learned via contrastive learning.

Theoretical Analysiss
This section provides brief explanations of the theoretical relationship between KL(w) and contrastive learning.Given a pair of sentences (s, s ′ ), contrastive learning can be regarded as a problem of discriminating whether a sentence s ′ is a positive (semantically similar) or negative (not similar) example of another sentence s.We reframe this discrimination problem using word frequency distribution.After observing w in s, the positive example is likely to contain words that co-occur with w; i.e., the word distribution of the positive example likely follows P sent (•|w).Contrary, the negative example is likely to contain random words from a corpus regardless of the observation of w; i.e., the word distribution of the negative example likely follows P (•).Hence, KL(P sent (•|w) ∥ (P (•)) = KL(w) approximately represents the objective of the discrimination problem (i.e., contrastive learning).See Appendix C for a more formal explanation.

Experiments
In this section, we investigate the empirical relationship between the implicit word weighting within contrastive-based sentence encoders (quantified by IG or SHAP) and the information-theoretic quantities (KL(w) or log P (w)).

Experimental Setup
Models: We used the following 9 sentence encoders: SBERT (Reimers and Gurevych, 2019), Unsupervised/Supervised SimCSE (Gao et al., 2021), DiffCSE (Chuang et al., 2022), both for BERT and RoBERTa-based versions, and all-mpnet-base-v22 .As baselines, we also Table 1: Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by IG on the information gain KL for the STS-B dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.
Dataset: We used STS-Benchmark (Cer et al., 2017), a widely used dataset to evaluate sentence representations, for input sentences to encoders and calculating − log P (w) and KL(w).We used the validation set, which includes 3,000 sentences.We also conducted experiments using Wikipedia, STS12 (Agirre et al., 2012), andNLI datasets (Bowman et al., 2015;Williams et al., 2018) for the generalizability, which are shown in Appendix F.2.
Experimental procedure: First, we fed all sentences to the models and calculated the word weightings by IG or SHAP (Section 3.1).Then we applied OLS regression to the calculated word weightings on − log P (w) or KL(w) (Section 3.2) for each model.Although we experimented with all four possible combinations from the two XAI methods and two quantities for each model, we report here the results only for the combination of IG and KL(w).Other results are shown in Appendix E.

Quantitative Analysis
Table 1 lists the coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression on IG and KL(w).Figure 2 shows the plots of the word weightings and their regression lines for BERT and Unsupervised SimCSE.Table 1 shows that R 2 and β are much higher in contrastive-based encoders than pre-trained models. 3Similar trends were obtained with other XAI method, information-theoretic quantity, and datasets (Appendix E, F.2).These indicate that contrastive learning for sentence encoders induces word weighting according to the informationtheoretic quantities (Figure 2).In other words, contrastive-based encoders acquired the inner mechanism to highlight informative words more.Furthermore, given that KL(w) and log P (w) are close to the weighting techniques employed in the SWE-based sentence embeddings (Section 3.2), these results suggest that the contrastive-based sentence encoders learn implicit weightings similar to the explicit ones of the SWE-based methods.

Qualitative Analysis
Figure 3 shows the word weighting within BERT ( ) and Unsupervised SimCSE (▲) and KL(w) (■) for three sentences: "a man is playing guitar.","a man with a hard hat is dancing.",and "a young child is riding a horse.".The contrastive-based encoder (Unsup.SimCSE; ▲) has more similar word weighting to KL(w) (■) than the MLM (BERT; ), which is consistent with the R 2 in the Table 1.Also, the contrastive-based encoder (▲) tends to weight input words more extremely than the MLM ( ), which is consistent with the β in Table 1.For example, weights for nouns such as "guitar", "hat", "child", and "horse" are enhanced, and weights for non-informative words such as "is", and "." are 3 Exceptionally, R 2 does not change much from BERT (MEAN) to SBERT, while β indeed increases.

Word weighting
Figure 3: Sentence-level examples of the word weighting of BERT (CLS) and Unsupervised SimCSE-BERT using IG.Same as IG, KL(w) is normalized so that the sum of KL(w) becomes the sentence length for visibility.For words like "man", "child", "guitar", "is", "hat", "horse" and "." the word weightings of the contrastive fine-tuned model (▲) are closer to KL(w) (■) than the pre-trained model ( ), which means that contrastive fine-tuning induces word weighting by KL(w).
discounted by contrastive learning ( → ▲).On the other hand, weights for words such as "a" and "with", whose KL(w) is very small, are not changed so much.Investigating the effect of POS on word weighting, which is not considered on KL(w) and − log P (w), is an interesting future direction.

Conclusion
We showed that contrastive learning-based sentence encoders implicitly weight informative words based on information-theoretic quantities.This indicates that the recent sentence encoders learn implicit weightings similar to the explicit ones used in the SWE-based methods.We also provided the theoretical proof that contrastive learning induces models to weight each word by KL(w).These provide insights into why contrastive-based sentence encoders succeed in a wide range of tasks, such as information retrieval (Muennighoff, 2022) and question answering (Nguyen et al., 2022), where emphasizing some informative words is effective.Besides sentence encoders, investigating the word weighting of retrieval models is an interesting future direction.

Limitations
There are three limitations in this study.First is the limited experiment for baseline input of IG.
For the IG experiment, we only tested the baseline input with PAD tokens explained in Section 3.1.1.
Although there is no consensus on the appropriate baseline inputs of IG for contrastive-based sentence encoders, a comparison of results with different baseline inputs is left to future work.Second, our findings do not cover contrastive text embeddings from retrieval models.Analyzing contrastive-based retrieval models such as DPR (Karpukhin et al., 2020), Contriever (Izacard et al., 2022), and investigating the effect on text length would also be an interesting future direction.Third is the assumptions made in the sketch of proof in Section C. In our proof, we assume that the similarity of sentence embeddings is calculated via inner product.However, in practice, cosine similarity is often used instead.Also, we do not consider the contextualization effect of Transformer models on the word embeddings in the proof.The theoretical analysis using cosine similarity or considering the contextualization effect is left to future work.

Ethics Statement
Our study showed that sentence encoders implicitly weight input words by frequency-based information-theoretic quantities.This suggests that sentence encoders can be affected by biases in the training datasets, and our finding is a step forward in developing trustful sentence embeddings without social or gender biases.

C The Theoretical Relationship between Contrastive Learning and Information Gain
Formally, the following theorem holds: Theorem 1.Let S be the set of sentences, {((s, s ′ ), C)} be the dataset constructed from S where s ∼ S, s ′ = s (when C = 1), and s ′ ∼ S (when C = 0).Suppose that the sentence encoder parametrized by θ takes a sentence s = (w 1 , . . ., w |s| ) as the input and returns word embeddings (w 1 , . . ., w |s| ) and its mean-pooled sentence embedding s = 1 |s| |s| i=1 w i as the output, and the contrastive fine-tuning maximizes L contrastive (θ), the log-likelihood of P (C|(s, s ′ ); θ) = σ(⟨s, s ′ ⟩).Then, in the lower bound of the optimal value of L contrastive (θ), 1 2 ∥w∥ 2 ≈ KL(w).10939 In other words, KL(w) is encoded in the norm of word embeddings, which construct the sentence embedding.The proof is shown in Section C.2.

C.1 Assumptions
Dataset: Let S be the training corpus for contrastive learning.When training a sentence encoder with contrastive fine-tuning, the set of sentence pairs {(s, s ′ )} is used as the training data, and the sentence encoder is trained with the objective which discriminates whether the sentence pair is a positive example (the pair of semantically similar sentences) or negative example (the pair of semantically dissimilar sentences).For the theoretical analysis, we make the following three reasonable assumptions: 1.An anchor sentence s is sampled from S.

Positive exmaple： For semantically similar
sentence s ′ with s, s itself is used (s ′ = s).
3. Negative example： For semantically dissimilar sentence s ′ with s, randomly sampled sentence s ′ from S is used (s ′ ∼ S).
Assumption 1 is a commonly used setting for the training data of contrastive fine-tuning of sentence encoders (Gao et al., 2021;Chuang et al., 2022, etc.).Assumption 2 considers data augmentation techniques based on perturbations such as token shuffling (Yan et al., 2021) or dropout augmentation (Gao et al., 2021).Noting that these data augmentations do not affect the word frequency distributions, this simplification has little effect on the theory.Assumption 3 considers the most simple way to create dissimilar sentence s ′ of the anchor sentence s, especially in unsupervised settings (Yan et al., 2021;Gao et al., 2021;Chuang et al., 2022, etc.).In typical supervised settings, hard negatives are often created with NLI supervision, and considering these settings in the theory is an important future work.Also, for simplicity, we assume the sentence length is fixed to n in the proof (|s| = |s ′ | = n).
Model: We make the following three assumptions on contrastive-based sentence encoders (Section 2): 1.The sentence embedding is constructed by the mean pooling (Section 2).
2. The inner product is used to calculate the similarity of sentence embeddings s and s ′ 3. Sentence embedding is not normalized.
For assumption 2, one can also use the inner product instead of cosine similarity, as discussed in Gao et al. (2021), for example.Using the inner product for calculating sentence embedding similarity is an important assumption for the proof4 , and extending our theory to cosine similarity is a challenging and interesting future work.Assumption 3 makes the conclusion of our theory meaningful, which uses the norm of word embedding.Typical contrastive sentence encoders compute sentence embedding without normalization (Reimers and Gurevych, 2019;Gao et al., 2021, etc.).

C.2 Proof
First, P (C = 1|(s, s ′ ); θ) has the following lower bound: Here, we used the mean pooling assumption and the property of bilinear form for Equation 6 and Theorem 3.2 in Nantomah (2019) for Equation 7.
Then the objective function L contrastive (θ) of the probabilistic model P has the following approxi-mated lower bound: ≥ (s,s ′ )∼pos w∈s w ′ ∈s ′ P (C = 1|(s, s ′ ); θ) Here, Equation 10 follows from Equation 6to 8, and |s| = |s ′ | = n is used in Equation 11.Hence, the optimal value for L contrastive is bounded as follows: Noting that arg max θ (13) = arg max θ (14), Equation 14 corresponds to the objective function of the skip-gram with negative sampling (SGNS) model (Mikolov et al., 2013) with taking the context window as a sentence.In other words, the optimal value of the lower bound (Equation 13) corresponds to the optimal value of the SGNS model (Equation 14).
(i) If (s, s ′ ) is a positive example, the words w ′ in s ′ can be considered sampled from the distribution of words, co-occurring with w in sentences: w ′ ∼ P pos (•|w) = P same sent (•|w).(ii) If s ′ is a negative example, the word in a negative example can be considered sampled from the unigram distribution P (•), followed from s ′ ∼ S and w ′ ∼ s ′ .By using the property of the trained SGNS model shown in Oyama et al. (2023), we have

C.3 Discussion
Here, we discuss the two implications from Theorem 1. First, Equation 16represents the intuition exaplined in Section 4. That is, the difference of the word frequency distribution of the similar sentence s ′ and the dissimilar sentence s ′ after observing the word w in the anchor sentence s is implicitly encoded in ∥w∥.In other words, the information gain on the word frequency distribution of the similar sentence s ′ is encoded in ∥w∥ by observing w ∈ s.
Secondly, the conclusion of our proof that the information gain KL(w) is encoded in the norm of w justifies the means of quantifying the implicit word weighting of the model using Integrated Gradients (IG) or SHAP to a certain extent.When constructing a sentence embedding by additive composition (MEAN pooling), the contribution of each word is approximately determined by the norm of the word embedding (Yokoi et al., 2020).IG and SHAP also additively decomposes the contribution of each input feature (input word embedding) to the model output (sentence embedding) (Sundararajan et al., 2017;Lundberg and Lee, 2017).From the perspective of the additive properties, the result that IG and SHAP can capture the contributions implicitly encoded in the norm is natural.To preserve the additive properties of IG and SHAP more properly, further sophisticating the aggregation methods of contributions (Equation 4, 20) is an interesting future work.

D Quantifying word weighting within sentence encoders with SHAP
In this section, we briefly describe Shapley aadditive explanations (SHAP; Lundberg and Lee, 2017), the feature attribution method introduced in Section 3.1.2,and then describe how to apply SHAP to quantification of word weighting within sentence encoders.

D.1 Shapley value
SHAP is an extension for XAI of Shapley value (Shapley, 1953), the classic method proposed in the context of cooperative game theory.We first explain Shapley value.Let us consider a cooperative game, where a set of players forms a coalition and gains payoffs.Shapley value is a method to distribute the payoffs gained by cooperation (forming a coalition) fairly among the players.
Let N := {1, 2, . . ., n} be the set of players in the game and v : 2 N → R be the function that determines the payoff gained based on a subset (coalition) of players.Note that the empty set does not gain payoff, v(∅) = 0. To compute the payoff (contribution) ϕ i distributed to the i-th player, Shapley value calculates the expectation of the difference of the payoff before and after the i-th player joins each of the possible coalitions (all subsets made from permutations of players).Formally, ϕ i is calculated as follows: Shapley value satisfies some ideal properties; for example, the summation of each calculated contribution ϕ i becomes the payoff of the case where all players join, i.e., v(N ) = i ∈S ϕ i (v).

D.2 SHAP
SHAP is an extension of the Shapley value to interpreting machine learning models.Let f : R n → R be a model and X ∈ R n be a certain input.SHAP maps machine learning models to cooperative games: the input X corresponds to the set of players N and the model f corresponds to the set function v in Equation 17.However, in cooperative games the input (subset of players) is discrete, while in machine learning models the input (vector) is continuous.SHAP fills this discrepancy as follows: the situation "the i-th player is not included in the input subset" is mapped to "the i-th feature of the input vector is replaced with its expected value."See the original SHAP paper (Lundberg and Lee, 2017) for details.

D.3 Approximate calculation of SHAP
The exact calculation of Shapley value is computationally expensive (time complexity of O(2 N )) because it calculates all the permutations of the input players (Equation 17).Its extension, SHAP, is generally calculated through an approximation.
One of the typical approximated calculation of SHAP is PartitionSHAP5 implemented in shap library6 , which hierarchically clusters input features (or words) to decide coalitions, reducing the total number of the permutations in Equation 17.

D.4 Application to sentence encoder
When applying SHAP to NLP models, instead of replacing a feature with its expected value, an input word is often replaced with uninformative token (e.g., [MASK] for BERT-based models).7For sentence encoders, SHAP calculates the contribution of each input word to the output m j (W ) in the form of decomposing m j (W ) into a sum: where M denotes a masked input.Then, we can calculate the contribution of i-th word to j-th element of s by SHAP in the same aggregation as for IG (see Section 3.1.1): In addition, the word contribution c i is normalized with respect to the sentence length n same as IG (see Section 3.1.1):

E Omitted results in Section 5
Tables 2 to 4 are the results of (IG,− log P (w)), (SHAP, KL(w)) and (SHAP, − log P (w)) experiments, respectively, and Figure 6 is the linear regression plots between − log P (w) and IG.

− log 𝑃(𝑤)
Figure 6: Linear regression plots between − log P (w) and the word weighting of BERT (CLS) (left) and Unspervised (U.) SimCSE-BERT (right) using IG.We plotted subsampled 3000 tokens from the tokens with the top 99.5% of small KL values for visibility.
F Experiments with Wikipedia, NLI, and STS12 datasets Here, we conduct experiments with Wikipedia, NLI, and STS12 datasets other than STS-B datasets in Section 5 for generalizability.
We followed the same setting for models and word weighting calculation in Section 5 and we experimented with all the combinations of ({IG, SHAP},{KL(w), − log P (w)}).

F.2 Results
Tables 5 to 16 are the results of the three datasets.Except for SBERT in the IG experiments10 and U.SimCSE-BERT/RoBERTa DiffCSE-BERT/RoBERTa in the SHAP experiments, the coefficient of determination R 2 and regression coefficient β of contrastive-based sentence encoders are higher than pre-trained models across all the three datasets, verifying the consistency with the experiments with STS-B in Section 5 and E. The results suggest that contrastive-based sentence encoders learn to weight each word according to the

Figure 1 :
Figure 1: Overview of our study.We quantify the word weighting within contrastive-based sentence encoders by XAI techniques: Integrated Gradients (IG) or Shapley additive explanations (SHAP).We found that the quantified weightings are close to infromationtheoretic quantities: information-gain KL(w) and selfinformation − log P (w).

Figure 2 :
Figure 2: Linear regression plots between KL(w) and the word weighting of BERT (CLS) (left) and Unspervised (U.) SimCSE-BERT (right) using IG.We plotted subsampled 3000 tokens from the tokens with the top 99.5% of small KL values for visibility.The plots − log P (w) experiments are shown in Figure 6 in Appendix E.

Figure 4 :Figure 5 :
Figure 4: Scatter plot of KL(w) , IDF, and SIF-Weighting.All weightings are normalized within the same weighting method.

Table 3 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by SHAP on the information gain KL for the STS-B dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 4 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by SHAP on the self-information − log P (w) for the STS-B dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 5 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by IG on the information gain KL for the Wikipedia dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 6 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by IG on the self-information − log P (w) for the Wikipedia dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 7 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by SHAP on the information gain KL for the Wikipedia dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 8 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by SHAP on the self-information − log P (w) for the Wikipedia dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 9 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by IG on the information gain KL for the NLI dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 10 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by IG on the self-information − log P (w) for the NLI dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 11 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by SHAP on the information gain KL for the NLI dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 12 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by SHAP on the self-information − log P (w) for the NLI dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 13 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by IG on the information gain KL for the STS12 dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.

Table 14 :
Coefficient of determination (R 2 ) and regression coefficient (β) of the linear regression of the words' weightings calculated by IG on the self-information − log P (w) for the STS12 dataset.The R 2 and β is reported as R 2 × 100 and β × 100.The values inside the brackets represent the gain from pre-trained models.