Neighboring Words Affect Human Interpretation of Saliency Explanations

Word-level saliency explanations ("heat maps over words") are often used to communicate feature-attribution in text-based models. Recent studies found that superficial factors such as word length can distort human interpretation of the communicated saliency scores. We conduct a user study to investigate how the marking of a word's neighboring words affect the explainee's perception of the word's importance in the context of a saliency explanation. We find that neighboring words have significant effects on the word's importance rating. Concretely, we identify that the influence changes based on neighboring direction (left vs. right) and a-priori linguistic and computational measures of phrases and collocations (vs. unrelated neighboring words). Our results question whether text-based saliency explanations should be continued to be communicated at word level, and inform future research on alternative saliency explanation methods.


Introduction
In the context of explainability methods that assign importance scores to individual words, we are interested in characterizing the effect of phrase-level features on the perceived importance of a particular word: Text is naturally constructed and comprehended in various levels of granularity that go beyond the word-level (Chomsky, 1957;Xia, 2018).For example (Figure 1), the role of the word "York" is contextualized by the phrase "New York" that contains it.Given an explanation that attributes importance to "New" and "York" separately, what is the effect of the importance score of "New" on the explainee's understanding of the importance "York"?Our study investigates this question.
Current feature-attribution explanations in NLP mostly operate at word-level or subword-level (Madsen et al., 2023;Arras et al., 2017; Ribeiro

Mutual information
Importance of collocated neighbor (New) Importance of non-collocated neighbor (since) The company has been headquartered in New York since its IPO in the year 2013 .
setting: word importance explanations user study: query perceived importance analysis: which factors affect perception?
Figure 1: Illustration of the user study.We ask laypeople to rate the perceived importance of words following a word-importance explanation (grey).Then we analyze the effect of the importance of neighboring words on this interpretation, conditioned on the relationship between the words across various measures (orange).et al., 2016;Carvalho et al., 2019).Previous work investigated the effect of word and sentence-level features on subjective interpretations of saliency explanations on text (Schuff et al., 2022)-finding that features such as word length and frequency bias users' perception of explanations (e.g., users may assign higher importance to longer words).
It is not trivial for an explanation of an AI system to successfully communicate the intended information to the explainee (Miller, 2019;Dinu et al., 2020;Fel et al., 2021;Arora et al., 2021).In the case of feature-attribution explanations (Burkart and Huber, 2021;Tjoa and Guan, 2021), which commonly appear in NLP as explanations based on word importance (Madsen et al., 2023;Danilevsky et al., 2020), we must understand how the explainee interprets the role of the attributed inputs on model outputs (Nguyen et al., 2021;Zhou et al., 2022).Research shows that it is often an error to assume that explainees will interpret explanations "as in-tended" (Gonzalez et al., 2021;Ehsan et al., 2021).
The study involves two phases (Figure 1).First, we collect subjective self-reported ratings of importance by laypeople, in a setting of color-coded word importance explanations of a fact-checking NLP model (Section 2, Figure 2).Then, we fit a statistical model to map the importance of neighboring words to the word's rating, conditioned on various a-priori measures of bigram constructs, such as the words' syntactic relation or the degree to which they collocate in a corpus (Kolesnikova, 2016).
We observe significant effects (Section 4) for: 1. left-adjacency vs. right-adjacency; 2. the difference in importance between the two words; 3. the phrase relationship between the words (common phrase vs. no relation).We then deduce likely causes for these effects from relevant literature (Section 5).We are also able to reproduce results by Schuff et al. (2022) in a different English language domain (Section 3).We release the collected data and analysis code. 1  We conclude that laypeople interpretation of word importance explanations in English can be biased via neighboring words' importance, likely moderated by reading direction and phrase units of language.Future work on feature-attribution should investigate more effective methods of communicating information (Mosca et al., 2022;Ju et al., 2022), and implementations of such explanations should take care not to assume that human users interpret word-level importance objectively.

Study Specification
Our analysis has two phases: Collecting subjective interpretations of word-importances from laypeople, and testing for significant influence in various properties on the collected ratings-in particular, properties of adjacent words to the rated word.

Collecting Perceived Importance
We ask laypeople to rate the importance of a word within a feature-importance explanation (Figure 2).The setting is based on Schuff et al. (2022), with the main difference in the text domain.We use the Amazon Mechanical Turk crowd-sourcing platform to recruit a total of 100 participants. 2

Measure Examples Description
First-order constituent highly developed, more than, such as Smallest multi-word constituent sub-trees in the constituency tree.Noun phrase tokyo marathon, ski racer, the UK Multi-word noun phrase in the constituency tree.

Frequency
the United, the family, a species  Explanations.We use color-coding visualization of word importance explanations as the more common format in the literature (e.g., Arras et al., 2017;Wang et al., 2020;Tenney et al., 2020;Arora et al., 2021).We use importance values from two sources: Randomized, and SHAP-values3 (Lundberg and Lee, 2017) for facebook/bart-large-mnli4 (Yin et al., 2019;Lewis et al., 2020) as a fact-checking model.
Task.We communicate to the participants that the model is performing a plausible task of deciding whether the given sentence is fact or non-fact (Lazarski et al., 2021).The source texts are a sample of 150 Wikipedia sentences,5 in order to select text in a domain that has a high natural rate of multi-word chunks.
Procedure.We ask the explainee: "How important (1-7) do you think the word [...] was to the model?" and receive a point-scale answer with an optional comment field.This repeats for one randomlysampled word in each of the 150 sentences.

Measuring Neighbor Effects
Ideally, the importance ratings of a word will be explained entirely by its saliency strength.However, previous work showed that this is not the case.
Here, we are interested in whether and how much the participants' answers can be explained by properties of neighboring words, beyond what can be explained by the rated word's saliency alone.Modeling.We analyze the collected ratings using an ordinal generalized additive mixed model (GAMM). 6Its key properties are that it models the ordinal response variable (i.e., the importance ratings in our setting) on a continuous latent scale as a sum of smooth functions of covariates, while also accounting for random effects.7 Precedent model terms.We include all covariates tested by Schuff et al. (2022), including the rated word's saliency, word length, and so on, in order to control for them when testing our new phrase-level variables.We follow Schuff et al.'s controls for all precedent main and random effects.8 Novel neighbor terms.The following variables dictate our added model terms as the basis for the analysis: Left or right adjacency; rated word's saliency (color intensity); saliency difference between the two words; and whether the words hold a weak or strong relationship.We include four new bivariate smooth term (Figure 3) based on the interactions of the above variables.
We refer to a bigram with a strong relationship as a chunk.To arrive at a reliable measure for chunks, we methodically test various measures of bigram relationships, in two different categories (Table 1): syntactic, via dependency parsing, and statistical, via word collocation in a corpus.Following Frantzi et al. (2000), we use both syntactic and statistical measures together, as first-order constituents among the 0.875 percentile for ' 2 collocations (our observations are robust to choices of statistical measure and percentile; see Appendix C).

Reproducing Prior Results
Our study is similar to the experiments of Schuff et al. (2022) who investigate the effects of wordlevel and sentence-level features on importance perception.Thus, it is well-positioned to attempt a reproduction of prior observations, to confirm whether they persist in a different language domain: Medium-form Wikipedia texts vs. shortform restaurant reviews in Schuff et al., and SHAPvalues vs. Integrated-Gradients (Sundararajan et al., 2017).
The result is positive: We reproduce the previously reported significant effects of word length, display index (i.e., the position of the rated instance within the 150 sentences), capitalization, and dependency relation for randomized explanations as well as SHAP-value explanations (details in Appendix A).This result reinforces prior observations that human users are at significant risk of biased perception of saliency explanations despite an objective visualization interface.

Neighbor Effects Analysis
In the following, we present our results for our two experiments using (a) random saliency values and (b) SHAP values.

Randomized Explanations
Regarding our additionally introduced neighbor terms, Figure 3 shows the estimates for the four described functions (left/right ⇥ chunk/no chunk).Table 2 lists all smooth and parametric terms along with Wald test results (Wood, 2013a,b).Appendix A includes additional results.Asymmetric influence.Figure 3a vs. Figure 3b and Figure 3c vs. Figure 3d reveal qualitative differences between left and right neighbor's influences.We quantitatively confirm these differences by calculating areas of significant differences (Fasiolo    et al., 2020;Marra and Wood, 2012).Figures 4a  and 4b show the respective plots of (significant) differences and probabilities for the chunk case.
Overall, we conclude that the influence from left and right word neighbors is significantly different.
Chunk influence.We investigate the difference between neighbors that are within a chunk with the rated word vs. those that are not.We find qualitative differences in Figure 3 as well as statistically significant differences (Figures 4c and 4d).
Saliency moderates neighbor difference.Figure 3 shows that the effect of a neighbor's saliency difference (x-axis) is moderated by the rated word's saliency (y-axis).We confirm this observation statistically (Figure 4e) by comparing functions at a rated word saliency of 0.25 and 0.75, using unidimensional difference plots (Van Rij et al., 2015).
Combined effects.We identify two general opposing effects: assimilation and contrast. 9 We refer to Assimilation as situations where a word's perceived saliency is perceived as more (or 9 We borrow these terms from psychology (Section 5).less) important based on whether its neighbor has a higher (or lower) saliency.We find assimilation effects from left neighbors that form a chunk with a moderate saliency (0.25-0.75) rated word.
We refer to Contrast as situations where a word's perceived saliency is perceived as less (or more) important based on whether its neighbor has a higher (or lower) saliency.We find contrast effects from left and right neighbors that do not form a chunk with the rated word.10
Variant results.Notably, our SHAP-value results differ from our randomized saliency results with respect to the effects left/right direction.For the randomized saliency experiment, we observe assimilation effects from left neighbors within a chunk (Figure 3c) and contrast effects from left and right neighbors outside a chunk (Figures 3a and 3b).For our SHAP-value experiment, we observe assimilation (low rated word saliencies) and contrast effects (medium normalized rated word saliencies) from right neighbors within a chunk (Figure 10d).We hypothesize that this difference can be attributed to the inter-dependencies of SHAP values as indicated in Figure 12 in Appendix B.

Takeaways
Overall, we find that (a) left/right influences are not the same, (b) strong bigram relationships can invert contrasts into assimilation for left neighbors, (c) extreme saliencies can inhibit assimilation, and (d) biasing effects can be observed for randomized explanations as well as SHAP-value explanations.

Theoretical Grounds in Psychology
The assimilation effect is, of course, intuitive-it simply mean that neighbors' importance "leaks" from neighbor to the rated word for strong bigram relationships.But is there precedence for the observed assimilation and contrast effects in the literature?How do they relate to each other?
Psychology investigates how a prime (e.g., being exposed to a specific word) influences human judgement, as part of two categories: assimilation (the rating is "pulled" towards the prime) and contrast (the rating is "pushed" away from the prime) effects (i.a., Bless and Burger, 2016).Förster et al. (2008) demonstrate how global processing (e.g.looking at the overall structure) vs. local processing (e.g., looking at the details of a structure) leads to assimilation vs. contrast.We argue that some of our observations can be explained with their model: Multi-word phrase neighbors may induce global processing that leads to assimilation (for example, in the randomized explanation experiments, left neighbors) while other neighbors (in the randomized explanation experiments, right neighbors and unrelated left neighbors) induce local processing that leads to contrast.Future work may investigate the properties that induce global processing in specific contexts.

Conclusions
We conduct a user study in a setting of laypeople observing common word-importance explanations, as color-coded importance, in the English Wikipedia domain.In this setting, we find that when the explainee understands the attributed importance of a word, the importance of other words can influence their understanding in unintended ways.
Common wisdom posits that when communicating the importance of a component in a featureattribution explanation, the explainee will understand this importance as it is shown.We find that this is not the case: The explainee's contextualized understanding of the input portion-for us, a word as a part of a phrase-may influence their understanding of the explanation.

Limitations
The observed effects in this work, in principle, can only be applied to the setting of our user study (English text, English-speaking crowd-workers, colorcoded word-level saliency, and so on, as described in the paper).Therefore this study serves only as a proof of existence, for a reasonably plausible and common setting in NLP research, that laypeople can be influenced by context outside of the attributed part of the input when comprehending a feature-attribution explanation.Action taken on design and implementation of explanation technology for NLP systems in another setting, or other systems of similar nature, should either investigate the generalization of effects to the setting in practice (towards which we aim to release our full reproduction code), or take conservative action in anticipation that the effects will generalize without compromising the possibility that they will not.

A User Study Details
This section provides details on our user study setup.

A.1 Interface
Figure 5 shows a screenshot of our rating interface.
Figure 6 shows a screenshot of an attention check.

A.2 Attention Checks
We include three attention checks per participants which we randomly place within the last two thirds of the study following Schuff et al. (2022).

A.3 Participants
In total, we recruit 76 crowd workers from Englishspeaking countries via Amazon Mechanical Turk for our randomized explanation study and 36 crowd workers for our SHAP-value explanation study.We require workers to have at least 5,000 approved HITs and 95% approval rate.Raters are screened with three hidden attention checks that they must answer correctly to be included (but are paid fully regardless).From the 76 workers, 64 workers passed the screening, i.e., we excluded 15.8% of responses on a participant level.From the 36 workers, all workers passed the screening.On average, participants were compensated with an hourly wage of US$8.95.We do not collect any personallyidentifiable data from participants.

B Statistical Model Details
In this section, we give a brief general introduction to statistical model we used (i.e., GAMM) and provide additional results of our analysis.

B.1 Introduction to GAMM Models
We refer to the very brief introduction to GAMMs in Schuff et   an ordinal GAMM can be described as a generalized additive model that additionally accounts for random effects and models ordinal ratings via a continuous latent variable that is separated into the ordinal categories via estimated threshold values.
For further details, Divjak and Baayen (2017) provide a practical introduction to ordinal GAMs in a linguistic context and Wood (2017) offers a detailed textbook on GAM(M)s including implementation and analysis details.

B.2 Model Details in Our Analysis
We control for all main effects (word length, sentence length etc.) as well as all random effects used by Schuff et al. (2022).We exclude the pairwise interactions due to model instability when including the interactions.We additionally include four new novel bivariate smooth terms.Each of these terms models a tensor product of saliency (i.e. the rated word's color intensity) and the neighboring (left or right) word's saliency difference to the rated word.For each side (left and right), we model the smooths for neighbors that (i) are within a lexical chunk to the rated word and (ii) are not.Figure 3 shows the estimated four (bivariate) functions.

B.3 Data Preprocessing
Following Schuff et al. (2022), we exclude ratings with a completion time of less than a minute (implausibly fast completion) and exclude words with a length over 20 characters.We effectively exclude 1.8% of ratings.
In order to analyze left as well as right neighbors, we additionally have to ensure that we only include ratings for which both-left and right-neighbors exist.Therefore, we additionally exclude rating  for which the leftmost or rightmost word in the sentence was rated.This excludes 11.7% of ratings.In total, we thus use 9489 ratings to fit our model.

B.4 Chunk Measures
We explore and combine two approaches of identifying multi-word phrases (or "chunks)".

Syntactic measures (constituents).
We first apply binary chunk measures based on the sentences' parse trees.We use Stanza (Qi et al., 2020) (version 1.4.2) to generate parse tree for each sentence.We assess whether the rated word and its neighbor (left/right) share a constituent at the lowest possible level.Concretely, we (a) start at the rated word and move up one level in the parse tree and (b) start at the neighboring word and move up one level in the parse tree.If we now arrived at the same node in the parse tree, we the rated word and its neighbor share a first-order constituent.If we arrived at different nodes, they do not.Restricting the type of first-level shared constituents to noun phrases yields a further category.We provide respective examples for shared first-level constituents and the respective noun phrase constituents extracted from our data in Table 4 (upper part).
Statistical measures (cooccurrence scores).We additionally explore numeric association measures and calculate all available bigram collocation measures available in NLTK's BigramAssocMeasures module 11 .The calculation is based on the 7 million Wikipedia-2018 sentences in Wikipedia Sentences (Footnote 5).A description of each metric as well as top-scored examples on our data is provided in Table 4 (lower part).We separate examples into examples that form a constituent vs. do not form a constituent to highlight the necessity to apply a constituent filter in order to get meaningful categorization into chunks vs. no chunks.

B.5 Detailed Results
As described in Section 4, we observe different influences of left/right neighbors, chunk/no chunk neighbors as well as rated word saliency levels in our randomized explanation experiment.
Left vs. right neighbors.Figure 7 shows difference plots (and respective p values) between left and right neighbors for chunk neighbors (Figures 7a and 7b) and no chunk neighbors (Figures 7c  and 7d).
Chunk vs. no chunk.Respectively, Figure 8 shows difference plots (and respective p values) between chunk and no chunk neighbors for left neighbors (Figures 8a and 8b) and right neighbors (Figures 8c and 8d).
Differences across saliency levels.Figure 9 shows that the effects of saliency difference are significantly different between different levels of the rated word's saliency (0.25 and 0.75) for left neighbors (Figure 9a) as well as right neighbors (Figure 9b).
We report the detailed Wald test statistics for our randomized explanation experiment in Table 5.

B.6 SHAP-value Results
We additionally report details regarding our SHAPvalue experiment results.Figure 11 displays left/right, chunk/no chunk, and rated word saliency level difference plots.We report the detailed Wald test statistics for our SHAP-value explanation experiment in Table 6. Figure 12 illustrates how the distribution of saliency scores is uniforlmy random for our randomized explanations in contrast to the distributions of SHAP values.Schuff et al. (2022) We confirm previous results from Schuff et al. (2022) and find significant effects of word length, display index, capitalization, and dependency relation.We report detailed statistics of our randomized saliency experiment in Table 5 and our SHAP experiment in Table 6.

C Robustness to Evaluation Parameters.
To ensure our results are not an artifact of the particular combination of threshold and cooccurrence measure, we investigate how our results change if we (i) vary the threshold within {0.5, 0.75, 0.875} and (ii) vary the cooccurence measure within {Jaccard, MI-like, ' 2 , Poisson-Stirling}.We find significant interactions and observe similar interaction patterns as well as areas of significant differences (left/right, chunk/no chink as well as saliency levels) across all settings.We provide a representative selection of plots in Figures 13 to 18. Additionally, Tables 7 and 8 demonstrate that changing the threshold or cooccurrence measure leads to model statistics that are largely consistent with the results reported in Table 5.We choose the ' 2 and a 87.5% threshold as no other model reaches a higher deviance explained and a comparison of randomlysampled chunk/no chunk examples across measures and thresholds yields the best results for this setting.

Measure Constituent Examples No Constituent Examples Description
First-order constituent highly developed, more than, such as, DVD combo, 4 million -Smallest multi-word constituent subtrees in the constituency tree.

Noun phrase
Tokyo Marathon, ski racer, the UK, a retired, the city -Multi-word first-order noun phrase in the constituency tree.

Mutual information
as well, more than, ice hockey, United Kingdom, a species is a, the, in the, is an, it was Bigram mutual information variant (per NLTK implementation).

Frequency
the United, the family, a species, an American, such as of the, in the, is a, to the, on the Raw, unnormalized frequency.

Poisson Stirling
an American, such as, a species, as well, the family is a, of the, in the, is an, it was, has been                    Figure 17: Difference plots between the influence of left saliency differences between exemplary high (0.75) and low (0.25) rated word saliency levels across different choices of thresholds for our randomized explanation experiment.We find similar patterns across all settings.The ' 2 measure is used across all plots.Figure 18: Difference plots between the influence of left saliency differences between exemplary high (0.75) and low (0.25) rated word saliency levels across across different choices of cooccurrence measures for our randomized explanation experiment.We find similar patterns across all settings.t = 87.5 is consistent for all plots.

⇤
Both authors contributed equally to this research.How important is the word "York" to the model? 5 (out of 7) Fitting a model to predict the score (5) from variables.Which variables are significant predictors of the score?Direction Importance of left neighbor (New) Importance of right neighbor (since) Noun phrase Importance of NP neighbor (New) Importance of non-NP neighbor (since)

Figure 2 :
Figure 2: Screenshot of the rating interface.
(d) Chunk/no chunk difference (left) p values (contour marks 0.05).left neighbor saliency difference Est.difference in importance rating (latent) (e) Difference between rated saliency and left neighbor saliency.

Figure 4 :
Figure 4: Difference plots.Contour refers to the contour line.Red x-axis in (e) marks significant differences.

Figure 5 :
Figure 5: Screenshot of the rating interface.

Figure 6 :
Figure 6: Screenshot of the rating interface for an attention check.
For the numeric measures, we provide examples that (a) form a constituent with their neighbor and (b) do not.The examples underline the necessity to combine numeric scores with a constituent filter.Difference (no chunk) p values.The contour line marks 0.05.

Figure 7 :
Figure 7: Differences and p values for (no) lexical chunk neighbors for our randomized explanation experiment.
Difference (left neighbors) p values.The contour line marks 0.05.
Difference between (right neighbors) chunk -no chunk.The contour line marks zero.
Difference (right neighbors) p values.The contour line marks 0.05.

Figure 8 :
Figure 8: Differences and p values for left and right neighbors for our randomized explanation experiment.

Figure 9 :Figure 10 :
Figure9: Difference plots between the influence of saliency differences between exemplary high (0.75) and low (0.25) rated word saliency levels.Red x-axis areas indicate significant differences.

Figure 11 :
Figure 11: Difference plots of our SHAP-value experiment results.Contour refers to the contour line.Red x-axis in (e) marks significant differences.

Figure 12 :
Figure 12: Comparison of the distributions of rated word saliency and right neighbor saliency across our randomized explanations (left) and our SHAP-value experiments (right).

Figure 13 :
Figure13: Tensor product interactions for left saliency difference in the outside chunk setting across different choices of cooccurrence measures for our randomized explanation experiment.We find similar patterns across all settings.t = 87.5 is consistent for all plots.

Figure 14 :
Figure14: Tensor product interactions for left saliency difference in the within chunk setting across different choices of cooccurrence measures for our randomized explanation experiment.We find similar patterns across all settings.t = 87.5 is consistent for all plots.

Figure 15 :
Figure15: p values for between right -left for no lexical chunk neighbors across different choices of cooccurrence measures for our randomized explanation experiment.We find similar patterns across all settings.t = 87.5 is consistent for all plots.

Figure 16 :
Figure16: p values for differences between right -left for no lexical chunk neighbors across different choices of thresholds for our randomized explanation experiment.We find similar patterns across all settings.The ' 2 measure is used across all plots.

Table 1 :
Illustrative subset of our phrase measures.

Table 3 :
Examples of Wikipedia sentences used in our study.

Table 4 :
The list of phrase measures we tested for.Examples for numeric measures are chosen based on highest cooccurrence scores whereas the (boolean) noun phrase and constituent examples are chosen arbitrarily.

Table 5 :
Random saliency experiment results details. (Effective) degrees of freedom, reference degrees of freedom and Wald test statistics for the univariate smooth terms (top), random effects terms (middle) and parametric fixed terms (bottom) using t = 87.5% and ' 2 measure.

Table 6 :
SHAP experiment results details. (Effective) degrees of freedom, reference degrees of freedom and Wald test statistics for the univariate smooth terms (top), random effects terms (middle) and parametric fixed terms (bottom) using t = 87.5% and ' 2 measure.

Table 7 :
(Effective) degrees of freedom, reference degrees of freedom and Wald test statistics for the univariate smooth terms (top), random effects terms (middle) and parametric fixed terms (bottom) using t = 25% and ' 2 measure for our randomized explanation experiment.

Table 8 :
(Effective) degrees of freedom, reference degrees of freedom and Wald test statistics for the univariate smooth terms (top), random effects terms (middle) and parametric fixed terms (bottom) using t = 87.5% and MI-like measure for our randomized explanation experiment.