An Attribution Method for Siamese Encoders

Despite the success of Siamese encoder models such as sentence transformers (ST), little is known about the aspects of inputs they pay attention to. A barrier is that their predictions cannot be attributed to individual features, as they compare two inputs rather than processing a single one. This paper derives a local attribution method for Siamese encoders by generalizing the principle of integrated gradients to models with multiple inputs. The output takes the form of feature-pair attributions and in case of STs it can be reduced to a token–token matrix. Our method involves the introduction of integrated Jacobians and inherits the advantageous formal properties of integrated gradients: it accounts for the model’s full computation graph and is guaranteed to converge to the actual prediction. A pilot study shows that in case of STs few token pairs can dominate predictions and that STs preferentially focus on nouns and verbs. For accurate predictions, however, they need to attend to the majority of tokens and parts of speech.


Introduction
Siamese encoder models (SE) process two inputs concurrently and map them onto a single scalar output.One realization are sentence transformers (ST), that learn to predict a similarity judgment between two texts.They have lead to remarkable improvements in many areas including sentence classification and semantic similarity (Reimers and Gurevych, 2019), information retrieval (IR) (Thakur et al., 2021) and automated grading (Bexte et al., 2022).However, little is known about which aspects of inputs these models base their decisions on, which limits our understanding of their capabilities and limits.Nikolaev and Padó (2023) analyze STs with sentences of pre-defined lexical and syntactic structure, and use regression analysis to determine the relative importance of different text properties.MacAvaney et al. (2022) analyze IR models with samples consisting of queries and contrastive documents that differ in certain aspects.Opitz and Frank (2022) train an ST to explicitly encode metrics on abstract meaning representations in sub-embeddings.
More is known about the behavior of standard transformer models (see Rogers et al. (2020) for an overview): Hidden representations have been probed for syntactic and semantic information (Tenney et al., 2019;Conia and Navigli, 2022;Jawahar et al., 2019).Attention weights have been analyzed with regard to linguistic patterns they capture (Clark et al., 2019;Voita et al., 2019) and have been linked to individual predictions (Abnar and Zuidema, 2020;Vig, 2019).However, attention weights alone cannot serve as explanations for predictions (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019).To obtain local explanations for individual predictions (Li et al., 2016), Bastings and Filippova (2020) suggest the use of feature attribution methods (Danilevsky et al., 2020).Among them, integrated gradients are arguably the best choice due to their strong theoretic foundation (Sundararajan et al., 2017;Atanasova et al., 2020) (see also Appendix A).However, such methods are not directly applicable to Siamese models since they compare two inputs, instead of processing a single one.
In this work, we derive attributions for an SE's predictions to its inputs.The result takes the form of pair-wise attributions to features from the two inputs.For the case of STs it can be reduced to a token-token matrix (Fig. 1).Our method takes into account the model's full computational graph and only requires it to be differentiable.The combined prediction of all attributions is theoretically guaranteed to converge against the actual prediction.To the best of our knowledge, we propose the first method that can accurately attribute predictions of a Siamese model to input features.Upon publication we will make our code available.
where i and j index the input dimensions of a and b, respectively.We define the matrices J as: The expression inside the integral, ∂e k /∂x i , is the Jacobian of the encoder, i.e. the matrix of partial derivatives of all embedding components k w.r.t.all input components i.We therefore call J an integrated Jacobian.The integral proceeds along positions α on an integration path formed by a linear interpolation between the reference r and the input a: x(α) = r+α(x−r).Intuitively, Eq. 3 embeds all inputs between r and a along the path x(α) and computes their sensitivities w.r.t.input dimensions.It then collects all results on the path and combines them into the matrix J.To efficiently calculate the integral, we approximate it by a sum over discrete points α n on the integration path.
Equation 2 combines the sensitivities of both inputs and computes pairwise attributions between all feature combinations.a and b.Individual summands can be expressed in a matrix, which we will refer to as A ij .
In a transformer model, text representations are typically of shape S × D, where S is the sequence length and D is the embedding dimensionality.Therefore, A quickly becomes intractably large.However, the sum in Eq. 2 allows us to combine individual attributions.Summing over the embedding dimension D yields a matrix of shape S a ×S b , where S a and S b are the lengths of the two input sequences.Figure 1 shows an example.Since Eq. 2 is an equality, the attribution provided by A is provably correct: The entries in A sum to the model prediction f (a, b).Only for efficient calculation, we approximate the integral by a sum of N steps along the integration path (Eq.3).

Adapting Existing Models
A model needs to fulfill two requirements for attributions to be meaningful: Reference Input.It is crucial that f consistently yields a score of zero for inputs involving the reference input r.A solution would be to set r to an input that the encoder maps onto the zero vector, so that f (x, r) = e T (x) e(r) = e T (x) 0 = 0.However, it is not trivial to find such an input.We avoid this issue by choosing an arbitrary reference and shifting all embeddings by r in the encoder space, e(x) = e ′ (x) − e ′ (r), where e ′ is the base encoder, so e(r) = 0.For simplicity, we use a padding token sequence as r.
Objective.Sentence transformers typically use cosine distance to compare embeddings which normalizes embeddings to unit length.Unfortunately, normalization of the zero vector, which we map the reference to, is undefined.Therefore, we replace cosine distance with the (unnormalized) dot product when computing scores as shown in Eq. 1.

Intermediate Representations
Due to the sequence-to-sequence architecture and the language-modeling pre-training, deep represen-

Experiments and Results
We begin by evaluating how much the shift of embeddings and change of objective affect STs.To this end we train STs off different pre-trained transformers on the STS-benchmark (Cer et al., 2017).We include vanilla transformers which have only been pre-trained on language modeling tasks, as well as sentence transformers, that have already been trained in a Siamese setting. 2 We report Spearman correlations between predictions and labels for both cosine distance and dot product in Table 1.The best models, pre-trained sentence transformers, adjust rather well and only sacrifice at most 1.7 points of their original performance.Results for vanilla transformers are more diverse.While MPNet adjusts well, the performance of Roberta drops by around 10 points.

Attribution Accuracy
As described in Sec.be approximated with fewer steps.This is reasonable, because deeper models may learn more complex transformations, that will be harder to approximate.Attributions to e.g.layer 9 are only off by (5±5)×10 −3 with as few as N = 50 approximation steps.Layer 7 require N = 1000 steps to reach an error of (2 ± 3)×10 −3 and attributions to shallower layers have not yet started converging for as many as N = 2500 steps.Our current implementation and resources limit us to N ≤ 2500, however, the sum in Eq. 3 will converge against the integral for sufficiently large N .Both the worse Roberta model and the shallower distilled Roberta model already converge over their full depth with smaller N (Appendix D).We emphasize that attribution accuracy is independent of a model's predictive performance.

Distribution of Attributions
For an overview of the range of attributions that our best-performing model S-MPNet assigns to pairs of tokens, in Figure 3 we show a histogram of attributions to different (intermediate) representations across 1000 test examples.A large fraction of all attributions to intermediate representations is negative (38% for layer 11).Thus, the model can balance matches and mismatches.This becomes apparent in the example in Appendix C. The word poorly negates the meaning of the sentence and contributes negatively to the prediction.Interestingly, attributions to the output representation do not capture this characteristic, as they are almost exclusively positive (95%).We observe similar behaviour also for other models (Appendix E).It further interests us how the model distributes its predictions among token-pairs.We sort attributions (to  layer 11) by their absolute value and add them up cumulatively.Averaging over 1000 test-instances results in Fig. 4. The top 5% of attributions can explain (77±133)% of the model prediction.However, the large standard deviation shows that this estimate is not yet trustworthy.For it to become reliable with a standard deviation below 5% (2%), we require 78% (92%) of all attributions.

POS Relations
To demonstrate the contribution of our method to a better understanding of the model predictions, we evaluate which combinations of parts of speech (POS) the model relies on to compute similarities between sentences.For this purpose, we combine token-to word-attributions by averaging.We then tag words with a POS-Classifier.3Fig. 5 shows shares of the ten most frequent POSrelations among the highest 10%, 25% and 50% of attributions on the STS testset.Within the top 10%, noun-noun attributions clearly dominate with a share of almost 25%, followed by verb-verb and noun-verb attributions.Among the top 25% this trend is mitigated, the top half splits more evenly.
When we compute predictions exclusively from attributions to specific POS-relations, nouns and verbs together explain (53 ± 90)%, and the top ten POS-relations (cf.Fig. 5) account for (66 ± 98)% of the model prediction.The 90% most important relations achieve (95 ± 29)%.Thus, the model largely relies on nouns (and verbs) for its predictions.These findings extend the insights of Nikolaev and Padó (2023) to naturalistic data.However, again the high variances show that the model does look beyond them and relies on all POS to make accurate predictions.

Conclusion
Our method can provably and accurately attribute Siamese model predictions to input and intermediate feature-pairs.While in STs output attributions are not very expressive and attributing to inputs can be computationally expensive, attributions to deeper intermediates are efficient to compute and provide rich insights.
Referring to the terminology introduced by Doshi-Velez and Kim (2017) our feature-pair attributions are single cognitive chunks that combine additive to the model prediction.Importantly, they can explain which feature-pairs are relevant to individual predictions, but not why (Lipton, 2018).
Improvement may be achieved by incorporating Sanyal and Ren (2021)'s discretization method, and care must be applied regarding the possibility of adversarially misleading gradients (Wang et al., 2020).In the future, we believe our method can serve as a diagnostic tool to better analyze the predictions of Siamese models.

Limitations
The most important limitation of our method is the fact that the original model needs to be adjusted and fine-tuned in order to adopt to the shift of embeddings and change of objective that we introduced in Sec.2.2.This step is required because the dot-product (and cosine-similarity) of shifted embeddings does not equal that of the original ones. 4 Therefore, we cannot directly analyze off-the-shelf models.
Second, when a dot-product is used to compare two embeddings instead of a cosine-distance, selfsimilarity is not preserved: without normalization, the dot-product of an embedding vector with itself is not necessarily one.We note that although theoretically not well-defined, in practice it is possible to use cosine-distance together with shifted embeddings.But we leave this evaluation for future work.
Third, our evaluation of predictive performance is limited to the task of semantic similarity and the STS-benchmark (which includes multiple datasets).This has two reasons: We focus on the derivation of an attribution method for Siamese models and the evaluation of the resulting attributions.The preservation of embedding quality for downstream tasks in non-Siamese settings is out of the scope of this short-paper.

Ethics Statement
Our work does not involve sensitive data nor applications.Both, the used pre-trained models and datasets are publicly available.Computational costs for the required fine-tuning are relatively cheap.We believe our method can make Siamese models more transparent and help identify potential errors and biases in their predictions.

A Integrated Gradients
Our method builds on the principle that was introduced by Sundararajan et al. (2017) for models with a single input.Here we derive the core concept of their integrated gradients.
Let f be a differentiable model taking a single vector valued input x and producing a scalar output s ∈ [0, 1]: f (x) = s.In addition let r be a reference input yielding a neutral output: f (r) = 0. We can then start from the difference in the two inputs and reformulate it as an integral (regarding f an anti-derivative): This is a path integral from the point r to a in the input space.We use component-wise notation, and double indices are summed over.To solve the integral, we parameterize the path from r to a by the straight line x(α) = r+α(a−r) and substitute it: The first term inside the above integral is the gradient of f at the position x(α).The second term is the derivative of the straight line and reduces to dx(α)/dα = (a − r), which is independent of α and can be pulled out of the integral: This last expression is the contribution of the i th input feature to the difference in Eq. 4. If f (r) = 0, then the sum over all contributions equals the model prediction f (a) = s.Note, that the equality between Eq. 4 and Eq. 6 holds strictly.Therefore, Eq. 6 is an exact reformulation of the model prediction.

B Detailed Derivation
For the case of a model receiving two inputs, we extend the Ansatz from Eq. 4 to: We plug in the definition of the Siamese model (Eq.1), using element-wise notation for the output embedding dimensions k, and again, omit sums over double indices: Neither encoding depends on the other integration variable, and we can separate derivatives and integrals: Different from above, the encoder e is a vectorvalued function.Therefore, ∂e k (x)/∂x i is a Jacobian, not a gradient.We integrate along straight lines from r to a, and from r to b, parameterized by α and β, respectively, and receive: With the definition of integrated Jacobians from Eq. 3, we can use vector notation and write the sum over the output dimension k in square brackets as a matrix product: J T a J b .If r consistently yields a prediction of zero, the last three terms on the left-hand-side of Eq. 7 vanish, and we arrive at our result in Eq. 2, where we denote the sum over input dimensions i and j explicitly.

C Intermediate Attributions
Fig. 6 shows attributions for one example to different representations in the S-MPNet model.Attributions to layer eleven and seven capture the negative contribution of poorly, which is completely absent in the output layer attributions.As Fig. 3 shows output attributions are less pronounced and almost exclusively positive.

D Attribution Accuracy
In Fig. 7 we include attribution accuracy plots the Roberta and the S-distillRoberta model.In both modes attributions to all layers converge for N < 2500.

F Different Models
Attributions of different models can characterize differently even if agreement on the overall score is good.Fig. 9 shows two examples.

G Prediction Failures
Fig. 10 shows examples in which the S-MPNet prediction is far off from the label.In the future, a systematic analysis of such cases could provide insights into where the model fails.

H Training Details
We fine-tune all models in a Siamese setting on the STS-benchmark train split.Models either use shifted embeddings combined with a dot-product objective or normal embeddings together with a cosine objective.All trainings run for five epochs, with a batch size of 16, a learning rate of 2 × 10 −5 and a weight decay of 0.1 using the AdamWoptimizer.10% of the training data is used for linear warm-up Let f be a Siamese model with an encoder e which maps two inputs a and b to a scalar score s: f (a, b) = e T (a) e(b) = s (1) Additionally, let r be a reference that always results in a score of zero for any other input c: f (r, c) = 0.By extending the principle that Sundararajan et al. (2017) introduced for single-input models, we can derive attributions of a siamese model's predictions to the features of its inputs and receive: 1

1
Detailed derivation is shown in Appendix B.

Figure 1 :
Figure 1: An example for token attributions.The model correctly relates not... good to bad and matches coffee despite the different positions.The similarity score is 0.82, the attribution error is 10 −3 for N = 500.

Figure 2 :
Figure 2: Errors of attributions to different intermediate representations of S-MPNet as a function of approximation steps N (see Sec. 3.1).For layer 9 we additionally show standard deviation.

Figure 3 :
Figure 3: Distribution of individual token-token attributions to different intermediate representations of the S-MPNet model.

Figure 4 :
Figure 4: Cumulative prediction of token-token attributions sorted by their absolute value.

Figure 6 :
Figure 6: Attributions of the same example to different representations in the S-MPNet model.
i ∂y j e k (x) e k (y) dx i dy j (8)

Figure 7 :
Figure 7: Layer-wise attribution errors for a Roberta based model (top) and its distilled version (bottom).

Fig. 8 Figure 8 :
Fig.8shows distribution plots for attributions to different intermediate presentations for the Roberta and the S-distillRoberta.In both cases we also ob-

Figure 9 :Figure 10 :
Figure 9: Attributions for identical sentences by different models.Model and scores are given in the titles.

Table 1 :
Spearman correlations between labels and scores predicted by cosine distance and dot product.Models are adjusted according to Sec. 2.2.Numbers in parentheses show improvements of unmodified models with identical training.
tations of transformers still correspond to individual tokens of the input.Therefore, it makes sense to also consider attributions to intermediate and even output representations.In this case, f becomes the function mapping the given intermediate representation to the output.