Connecting Attributions and QA Model Behavior on Realistic Counterfactuals

When a model attribution technique highlights a particular part of the input, a user might understand this highlight as making a statement about counterfactuals (Miller, 2019): if that part of the input were to change, the model’s prediction might change as well. This paper investigates how well different attribution techniques align with this assumption on realistic counterfactuals in the case of reading comprehension (RC). RC is a particularly challenging test case, as token-level attributions that have been extensively studied in other NLP tasks such as sentiment analysis are less suitable to represent the reasoning that RC models perform. We construct counterfactual sets for three different RC settings, and through heuristics that can connect attribution methods’ outputs to high-level model behavior, we can evaluate how useful different attribution methods and even different formats are for understanding counterfactuals. We find that pairwise attributions are better suited to RC than token-level attributions across these different RC settings, with our best performance coming from a modification that we propose to an existing pairwise attribution method.


Introduction
Recent research in interpretability of neural models (Lipton, 2018) has yielded numerous post-hoc explanation methods, including token attribution techniques (Ribeiro et al., 2016;Sundararajan et al., 2017;Guan et al., 2019;De Cao et al., 2020). Attributions are a flexible explanation format and can be applied to many domains, including sentiment analysis (Guan et al., 2019;De Cao et al., 2020), visual recognition (Simonyan et al., 2013), and natural language inference (Camburu et al., 2018;Thorne et al., 2019). However, it is hard to evaluate whether these explanations are faithful Figure 1: Our methodology. Given a base example, we can formulate a hypothesis about the model's behavior, like a theory about how the model is using certain tokens. Next, we collect counterfactual examples that modify these tokens and profile the actual model behavior. Finally, we assess whether feature attributions suggest behavior consistent with what we observe, verifying whether our attributions actually enable meaningful statements about behavior on counterfactuals.
to the computation of the original model (Wu and Mooney, 2019;Hase and Bansal, 2020;Wiegreffe et al., 2021;Jacovi and Goldberg, 2020) and as a result, they can potentially mislead users (Rudin, 2019). Furthermore, attributions do not have a consistent and meaningful social attribution (Miller, 2019;Jacovi and Goldberg, 2021): that is, when a user of the system looks at an explanation, they do not necessarily draw a valid conclusion from it, making it hard to use for downstream tasks.
How can we evaluate whether these attributions make faithful and meaningful statements about model behavior? In this work, we show how to use counterfactual examples to evaluate attributions' ability to reveal the high-level behavior of models. That is, rather than a vague statement like "this word was important," we want attributions to give concrete, testable conclusions like "the model compared these two words to reach its decision;" this statement can be evaluated for faithfulness and it helps a user make important inferences about how the system behaves. We approach this evaluation from a perspective of simulatability (Hase and Bansal, 2020): can we predict how the system will behave on new or modified examples? Doing so is particularly challenging for the RC models we focus on in this work due to the complex nature of the task, which fundamentally involves a correspondence between a question and a supporting text context. Figure 1 shows our methodology. Our approach requires annotating small sets of realistic counterfactuals, which are perturbations of original data points. These resemble several prior "stress tests" used to evaluate models, including counterfactual sets (Kaushik et al., 2020), contrast sets (Gardner et al., 2020), and checklists (Ribeiro et al., 2020). We first semi-automatically curate these sets to answer questions like: if different facts were shown in the context, how would the model behave? If different amounts of text or other incorrect paragraphs were retrieved by an upstream retrieval system, would the model still get the right answer?
We run the model on counterfactuals to assess the ground truth behavior. Then, given attributions from various techniques, can we predict how the model would behave based purely on these explanations? Our approach to do this is specific to each dataset and attribution method, but generally involves assessing how strongly the attribution method highlights tokens that are counterfactually altered, which would indicate that those tokens should impact the prediction if changed.
To showcase the kind of evaluation this method can enable, we investigate two paradigms of explanation techniques: token attribution-based (Simonyan et al., 2013;Ribeiro et al., 2016;De Cao et al., 2020) and feature interaction-based (Tsang et al., 2020;Hao et al., 2021), which attribute decisions to sets of tokens or pairwise token interactions. For both techniques, we devise methods to connect these explanations to our high-level hypotheses about behavior on counterfactual examples. On two types of questions from HOTPOTQA (Yang et al., 2018) and questions from adversarial SQUAD (Rajpurkar et al., 2016), we show that token-level attribution is not sufficient for analyzing RC models, which naturally involves more complex reasoning over multiple clues. We further propose a modification to an existing interaction technique from Hao et al. (2021) and show improved performance on our datasets.
Our main contributions are: (1) We propose a new goal for attributions, namely automatically simulating model behavior on realistic counterfactuals. (2) We describe a technique for connecting low-level attributions (token-level or higher-order) with high-level model hypotheses. (3) We improve an attention-based pairwise attribution technique with a simple but effective fix, leading to strong empirical results. (4) We analyze a set of QA tasks and show that our approach can derive meaningful conclusions about counterfactuals on each. Overall, we establish a methodology for analyzing explanations that we believe can be adapted to studying attribution methods on a wide range of other tasks with appropriate counterfactuals.

Motivation
We start with an example of how model attributions can be used to understand model behavior and consequently how to use our methodology to compare different attribution techniques. Figure 2 shows an example of a multi-hop yes/no question from HotpotQA. The QA model correctly answers yes in this case. Given the original example, the explanations produced using INTGRAD (Sundararajan et al., 2017) and DIFFMASK (De Cao et al., 2020) (explained in Section 4) both assign high attribution scores to the two documentary tokens appearing in the context: a user of the system is likely to impute that the model is comparing these two values, as it's natural to assume this model is using the highlighted information correctly. By contrast, the pairwise attribution approach we propose in this work (Section 4.3) attributes the prediction to interactions with the question, suggesting the interactions related to documentary do not matter.
We manually curate a set of contrastive examples to test this hypothesis. If the model truly recognizes that both movies are documentaries, then replacing either or both of the documentary tokens with romance should change the prediction. To verify that, we perturb the original example to obtain another three examples (left side of Figure 2). These four examples together form a local neighborhood (Ribeiro et al., 2016;Kaushik et al., 2020;Gardner et al., 2020)

YES
Figure 2: A motivating example and attributions generated by three methods. We profile the model's behavior with its predictions on realistic counterfactual inputs, which suggest the model does not truly base its prediction on the two movies being documentaries. We can evaluate attributions by heuristically assessing whether the attribution mass yields the same conclusion about model behavior.
Unlike what's suggested by the token attribution based techniques, the model always predicts "yes" for every example in the neighborhood, casting doubt on whether the model is following the right reasoning process. Although the pairwise attribution seemed at first glance much less plausible than that generated by the other techniques, it was actually better from the perspective of faithfully simulating the model's behavior on these examples. Our main assumption in this work can be stated as follows: an explanation should describe model behavior with respect to realistic counterfactuals. Past work has evaluated along plausibility criteria (Lei et al., 2016;Strout et al., 2019;Thorne et al., 2019), but as we see from this example, faithful explanations (Subramanian et al., 2020;Jacovi andGoldberg, 2020, 2021) are better aligned with our goal of simulatability. We argue that a good explanation is one that aligns with the model's high-level behavior, from which we can understand how the model generalizes to new data. How to interpret behavior from explanations is still an open question, but we take initial steps in this work with techniques based on assessing the attribution "mass" on perturbed tokens.
Discussion: Realistic Counterfactuals Many counterfactual modifications are possible: past work has looked at injecting non-meaningful triggers , deleting chunks of content (Ribeiro et al., 2016), or evaluating interpolated input points as in INTGRAD, all of which "true" set of realistic counterfactuals is highly domain-specific, but nevertheless, a good explanation technique should work well on a range of counterfactuals like those considered here. violate assumptions about the input distribution. In RC, masking part of the question often makes it nonsensical and we may not have strong expectations about our model's behavior in this case. 3 Focusing on realistic counterfactuals, by contrast, illuminates fundamental problems with our RC models' reasoning capabilities (Jia and Liang, 2017;Chen and Durrett, 2019;Min et al., 2019;Jiang and Bansal, 2019). This is the same motivation as that behind contrast sets (Gardner et al., 2020), but our work focuses on benchmarking explanations, not models themselves.

Behavior on Counterfactuals
We seek to formalize the reasoning we undertook in Figure 2. Using the model's explanation on a base data point, can we predict the model's behavior on the perturbed instances of that point?
Definitions Given an original example D 0 (e.g., the top example in Figure 2), we construct a set of perturbations {D 1 , ..., D k } (e.g., the three counterfactual examples in Figure 2), which together with D 0 form a local neighborhood D. These perturbations are realistic inputs derived from existing datasets or which we construct.
We formulate a hypothesis H about the neighborhood. In Figure 2, H is "the model is comparing the target properties" (documentary in this case). Based on the model's behavior on the set D, we can derive a high-level behavioral label z corre-sponding to the truth of H. We form our local neighborhood to check the answer empirically and compute a ground truth for z. Since the model always predicts "yes" in this neighborhood, we label set D with z = 0 (the model is not comparing the properties). We label D as z = 1, when the model does predict "no" for some perturbations.
Procedure Our approach is as follows: 1. Formulate a hypothesis H about the model 2. Collect realistic counterfactuals D to test H empirically for some base examples 3. Use the explanation of each base example to predict z. That is, learn the mapping D 0 → z based on the explanation of D 0 so we can simulate the model on D without observing the perturbations.
Note that this third step only uses the explanation of the base data point: explanations should let us make conclusions about new counterfactuals without having to do inference on them.
Simulation from attributions In our experiments on HOTPOTQA and SQUAD, we compute a scalar factor f for each attribution representing the importance of a specific part of the inputs (e.g., the documentary tokens in Figure 2), which we believe should correlate with model predictions on the counterfactuals. If an attribution assigns higher importance to this information, it suggests that the model will actually change its behavior on these new examples.
Given this factor, we construct a simple classifier where we predict z = 1 if the factor f is above a threshold. We expect the factors extracted using better attribution methods should better indicate the model behavior. Hence, we evaluate the explanation using the best simulation accuracy it can achieve and the AUC score (S-ACC and S-AUC). 4 Our evaluation resembles the human evaluation in Hase and Bansal (2020), which asks human raters to predict a model's decision given an example together with its explanations, addressing simulatability from a user-focused angle. Our method differs in that (1) we automatically extract a factor to predict model behavior instead of asking humans to do so, and (2) we predict the behavior on unseen counterfactuals given the explanation of a single base data point.

Explanation Techniques
Compared to classification tasks like sentiment analysis, RC more fundamentally involves interaction between input features, especially between a question and a context. This work will directly compare feature interaction explanations with token attribution techniques that are more common for other tasks. 5 For RC, each instance D = (q, c, a), a tuple containing a question, context, and answer respectively. In the techniques we consider, q and c are concatenated and fed into a pre-trained transformer model, so our attribution techniques will explain predictions using both of these.

Token Attribution-Based
These techniques all return scores s i for each token i in both the question and context. LIME (Ribeiro et al., 2016) and SHAP (Lundberg and Lee, 2017) both compute the attribution values for individual input features by using a linear model to locally approximate the model's predictions on a set of perturbed instances around the base data point. The attribution value for an individual input feature is the corresponding weight of the linear model. LIME and SHAP are different in the way of specifying instance weights used to train the linear model: LIME computes the weights heuristically, whereas SHAP uses a procedure based on Shapley values.
Integrated Gradient (INTGRAD) (Sundararajan et al., 2017) computes an attribution for each token by integrating the gradients of the prediction with respect to the token embeddings over the path from a baseline input (typically mask or pad tokens) towards the designated input. Although a common technique, recent work has raised concern about the effectiveness of INTGRAD methods for NLP tasks, as interpolated word embeddings do not correspond to real input values (Harbecke and Alt, 2020; Sanyal and Ren, 2021).
Differentiable Mask (DIFFMASK) (De Cao et al., 2020) learns to mask out a subsets of the input tokens for a given example while maintaining a distribution over answers as close to the original distribution as possible. This mask is learned in a differentiable fashion, then a a shallow neural model (a linear layer) is trained to recognize which tokens to discard.

Feature Interaction-Based
These techniques all return scores s i for each pair of tokens (i, j) in both the question and context that are fed into the QA system.
Archipelago (Tsang et al., 2020) measures nonadditive feature interaction. Similar to DIFFMASK, ARCHIP is also implicitly based on unrealistic counterfactuals which remove tokens. Given a subset of tokens, ARCHIP defines the contribution of the interaction by the prediction obtained from masking out all the other tokens, only leaving a very small fraction of the input. Applying this definition to a complex task like QA can result in a nonsensical input.
Attention Attribution (ATATTR) (Hao et al., 2021) uses attention specifically to derive pairwise explanations. However, it avoids the pitfalls of directly inspecting attention (Serrano and Smith, 2019;Wiegreffe and Pinter, 2019) by running an integrated gradients-like procedure over all the attention links within transformers, yielding attribution scores for each link. The attribution scores directly reflect the attribution of the particular attention links, making this model able to describe pairwise interactions.
Concretely, define the h-head attention matrix over input D with n tokens as A = [A 1 , ..., A l ], where A i ∈ R h×n×n is the attention scores for each layer. We can obtain the attribution score for each entry in the attention matrix A as: where F (D, αA) is the transformer model that takes as input the tokens and a matrix specifying the attention scores for each layer. We later sum up the attention attributions across all heads and layers to obtain the pairwise interaction between token (i, j), i.e., s ij = m n ATTR(A) mnij .

Layer-wise Attention Attribution
We propose a new technique LATATTR to improve upon ATATTR for the RC setting. The ATATTR approach simultaneously increases all attention scores when computing the attribution, which could be problematic. Since the attention scores of higher layers are determined by the attention scores of Are …

Transformer Layer Attentions
Step 1 Step 2 Step Layer 2 Layer n Figure 3: Steps of our Layer-wise Attention Attribution approach, where we modify a single layer's attention at a time. For example, to compute the attribution of attentions at layer 2, we only intervene on the attention matrix at that layer, and leave the other attentions computed as usual.
lower layers, forcibly setting all the attention scores and computing gradients at the same time may distort the gradients for the lower level links and produce inaccurate attribution. When applying INT-GRAD approach in other contexts, we typically assume the independence of input features (e.g., pixels of an image and tokens of an utterance), an assumption that does not hold here.
To address this issue, we propose a simple fix, namely applying the INTGRAD method layer-bylayer. As shown in Figure 3, to compute the attribution for attention links of layer i, we only change the attention scores at layer i: F /i (D, αA i ) denotes that we only intervene on the attention masks at layer i while leaving other attention masks computed naturally via the model. We pool to obtain the final attribution for pairwise interaction as s ij = m n ATTR(A) mnij .

Experiments
We assess whether attributions can achieve our proposed goal following the setup in Section 3 on the HOTPOTQA dataset (Yang et al., 2018), and the SQUAD dataset (Rajpurkar et al., 2016), specifically leveraging examples from adversarial SQUAD (Jia and Liang, 2017).

Implementation Details
For experiments on HOTPOTQA, we base our analysis on a ROBERTA (Liu et al., 2019) QA model in the distractor setting. We implement our model using the Huggingface library (Wolf et al., 2020) and train the model for 4 epochs with a learning rate of 3e-5, a batch size of 32, and a warm-up ratio of 6%. Our model achieves (a)   77.2 F1 on the development set in the distractor setting, comparable to other strong ROBERTA-based models (Tu et al., 2020;Groeneveld et al., 2020).
In the SQUAD-ADV setting, we also use a ROBERTA QA model which achieves 92.2 F1 on the SQUAD dev set and 68.0 F1 on SQUAD-ADV. Our model is trained on SQUAD v1.0 for 4 epochs using a learning rate of 1.5e-5, a batch size of 32 and a warm-up ratio of 6%.

Hotpot Yes-No Questions
We first study a subset of yes/no comparison questions, which are challenging despite the binary answer space (Clark et al., 2019). Typically, a yes-no comparison type question requires comparing the properties of two entities ( Figure 2).

Hypothesis & Counterfactuals
The hypothesis H we investigate is as in Section 2: the model compares the entities' properties as indicated by the question. Most Hotpot Yes-No questions follow one of two templates: Are A and B both __? (Figure 4a), and Are A and B of the same __? (Figure 4b). We define the property tokens associated with each question as the tokens in the context that match the blank in the template; that is, the values of the property that A and B are being compared on. For example, in Figure 4a, French and German are the property tokens, as the property of interest is the national origin.
To construct a neighborhood for a base data point, we take the following steps: (1) manually extract the property tokens in the context; (2) replace the property token with two substitutes, forming a set of four counterfactuals exhibiting nonidentical ground truths. When the properties associated with the two entities differ from each other, we directly use the properties extracted as the substitutes (Figure 4a); otherwise we add a new property candidate that is of the same class ( Figure 4b).
We set z = 0 (the hypothesis does not hold) if for each perturbed example D i ∈ D, the model predicts the same answer as for the original example, indicating a failure to compare the properties. We set z = 1 if the model's prediction does change. We choose a binary scheme to label the model behavior because we observed that, on the small perturbation sets, the model performance was bimodal: either the tokens mattered for the prediction (reflected by the model changing its prediction at least once) or they didn't. The authors annotated perturbations for 50 (D, z) randomly selected pairs in total, forming a total of 200 counterfactual instances. Full counterfactual sets are available with our data and code release.

Connecting Explanation and Hypothesis
To make a judgment about z, we extract a factor f based on the importance of a set of property tokens P . For token attribution-based methods, we define f as the sum of the attribution s i of each token in P : i∈P s i . For feature interaction-based methods producing pairwise attribution s ij , we compute f by pooling the scores of all the interaction related to the property tokens: i∈P ∨j∈P s ij . Now we predict z = 1 if the factor f is above a threshold, and evaluate the capability of the factor in indicating the model high-level behavior using the best simulation accuracy it can achieve (S-ACC) and AUC score (S-AUC). 6 Results First, we show that using attributions can indeed help predict the model's behavior. In Table 1, our approach (LATATTR) is the best, achieving a simulation accuracy of 84%. That is, with a properly set threshold, we can successfully predict whether the model predictions change when perturbing the properties in the original example 84% of the time. The attributions therefore give us the ability to simulate our model's behavior better than the other methods here. Our approach also improves substantially over the vanilla ATATTR method.
The best token-level attribution based approaches obtain an accuracy of 72%, significantly lagging the best pairwise technique. This indicates token attribution based methods are less effective in the HOTPOTQA Yes-No setting; we hypothesize that this is due to the importance of token interaction in this RC setting.
In this setting, DIFFMASK performs poorly, typically because it assigns high attribution to many tokens, since it determines which tokens need to be kept rather than distinguishing fine-grained importance (see the appendix for examples). It's possible that other heuristics or models learned on large numbers of perturbations could more meaningfully extract predictions from this technique.

Hotpot Bridge Questions
We also evaluate the explanation approaches on socalled bridge questions on the HOTPOTQA dataset, described in Yang et al. (2018). Figure 5 shows an example explanation of a bridge question. From the attribution scores we find the most salient connection is between the span "what government position" in the question and the span "United States Ambassador" in the context. This attribution directly highlights the reasoning shortcut (Jia and 6 Note that for different attribution methods, the thresholds are different and set to achieve the best accuracy. If we inject an additional sentence "Hillary Clinton is an American politician, who served as the United States secretary of the state from 2009 to 2013" into the context, the model will be misled and predict "United States secretary" as the new answer. This sentence could easily have been part of another document retrieved in the retrieval stage, so we consider its inclusion to be a realistic counterfactual. We further define the primary question, i.e., the span of the question containing wh-words with heavy modifier phrases and embedded clauses dropped, following the decomposition principle from Min et al. (2019). In Figure 5), the primary question is "What government position is held by the woman."

Hypothesis & Counterfactuals
The hypothesis H we investigate is: the model is using correct reasoning and not a shortcut driven by the primary question.
We construct counterfactuals following the same idea applied in our example. We view bridge questions as consisting of two single hop questions, the primary part and the secondary part. For a given question, we add an adversarial sentence based on the primary question so as to alter the model prediction. The added adversarial sentence contains context leading to a spurious answer to only the primary question, but does not change the gold answer.
We do this twice, yielding a set D = {D 0 , D 1 , D 2 } consisting of the base example and two perturbations. We define the label of D to be z = 0 in the case that model's prediction does change under the perturbations, and z = 1 otherwise. We show one example in Figure 4c. More examples and the full counterfactual set can be found in the appendix.
We randomly sample 50 base data points from the development set and two authors each write an adversarial sentence, giving 150 data points total.

Connecting Explanation and Hypothesis
For this setting, we use a factor describing the importance of the primary question normalized by the importance of the entire question. Namely, let P = {p i } be the set of tokens in the primary questions, and Q = {q i } be the set of tokens in the entire question. We define the factor f as the importance of P normalized by the importance of Q, where the importance calculation is the same as in Section 5.1. A higher factor means it is more heavily relying only on the primary question and hence a better chance of being attacked.
Results According to the simulation ACC scores in Table 1, token-level attributions are somewhat more successful at indicating model behavior in this setting compared to the yes/no setting. Our approach as the best feature interaction based technique is able to achieve a stimulation accuracy of 78%, slightly outperforming the best token attribution approach.

SQuAD Adversarial
Hypothesis & Counterfactuals Our hypothesis H is: the model can resist adversarial attacks of the addSent variety (Jia and Liang, 2017). For each of the original examples D 0 from a portion of the SQUAD-ADV development set, Jia and Liang (2017) creates 5 adversarial attacks, which are paraphrased and filtered by Turkers to give 0 to 5 valid attacks for each example, yielding our set D. We define the label of D to be z = 1 if the model resists all the adversarial attacks posed on D 0 (i.e., predictions for D are the same). To ensure the behavior is more precisely profiled by the counterfactuals, we only keep the base examples with more than 3 valid attacks, resulting in a total number of 276 (D, z) pair (1,506 data points).

Connecting Explanation and Hypothesis
We use a factor f indicating the importance of the essential keywords extracted from the question using  POS tags (proper nouns and numbers). E.g., for the question "What Florida stadium was considered for Super Bowl 50", we extract "Florida", "Super Bowl" , and "50". If the model considers all the essential keywords mentioned in the question, it should not be fooled by distractors with irrelevant information. We show a set of illustrative examples in the appendix. We compute the importance scores in the same way described in Section 5.1. In addition to the scores provided by various explanation techniques, we also use the model's confidence on the original prediction as a baseline.
Results We show results in Table 2. The best approaches (ATATTR and LATATTR) can achieve a simulation accuracy around 70%, 10% above the performance based on raw model confidence. This shows the model is indeed over-confident in its predictions; our assumption about the robustness together with our technique can successfully expose the vulnerability in some of the model predictions.
There is room to improve on these results; our simple heuristic cannot perfectly connect the explanations to the model behavior in all cases. We note that there are other orthogonal approaches (Kamath et al., 2020) to calibrate the confidence of QA models' predictions by looking at statistics of the adversarial examples. Because our goal is to assess attributions rather than optimize for calibration, our judgment is made purely based on the original example, and does not exploit learning to refine our heuristic.

Discussion and Limitations
We show that feature attributions can reveal known dataset biases and reasoning shortcuts in HotpotQA without having to perform a detailed manual analysis. This confirms the suitability of our attribution methods for at least this use case: model designers can look at them in a semi-automated way and determine how robust the model is going to be when faced with counterfactuals.
Our analysis also highlights the limitations of current explanation techniques. We experimented with other counterfactuals by permuting the order of the paragraphs in the context, which often gave rise to different predictions. We believe the model prediction was in these cases impacted by biases in positional embeddings (e.g., the answer tends to occur in the first retrieved paragraph), which cannot be indicated by current attribution methods. We believe this is a useful avenue for future investigation. By first thinking about what kind of counterfactuals and what kind of behaviours we want to explain, we can motivate the development of new explanation techniques to serve these needs.

Related Work
We focus on several prominent token attribution techniques, but there are other related methods as well, including other methods based on Shapley values (Štrumbelj and Kononenko, 2014;Lundberg and Lee, 2017), contextual decomposition (Jin et al., 2020), and hierarchical explanations (Chen et al., 2020). These formats can also be evaluated using our framework if being connected with model behavior using the proper heuristic. Other work explores so-called concept-based explanations (Mu and Andreas, 2020;Bau et al., 2017;Yeh et al., 2019). These provide another pathway towards building explanations of high-level behavior; however, they have been explored primarily for image recognition tasks and cannot be directly applied to QA, where defining these sorts of concepts is challenging. Finally, textual explanations (Hendricks et al., 2016) are another popular alternative, but it is difficult to evaluate these in our framework as it is very difficult to bridge from a free-text explanation to an approximation of a model's computation.
Probing techniques aim to discover what intermediate representations have been learned in neural models (Tenney et al., 2019;Conneau et al., 2018;Hewitt and Liang, 2019;Voita and Titov, 2020). Internal representations could potentially be used to predict behavior on contrast sets similar to this work; however, this cannot be done heuristically and larger datasets are needed to explore this.
Other work considering how to evaluate explanations is primarily based on how explanations can assist humans in predicting model decisions for a given example (Doshi-Velez and Kim, 2017;Chandrasekaran et al., 2018;Nguyen, 2018;Hase and Bansal, 2020); We are the first to consider building contrast sets for this. Similar ideas have been used in other contexts (Kaushik et al., 2020;Gardner et al., 2020) but our work focuses on evaluation of explanations rather than general model evaluation.

Conclusion
We have presented a new methodology using explanations to understand model behavior on realistic counterfactuals. We show explanations can indeed be connected to model behavior, and therefore we can compare explanations to understand which ones truly give us actionable insights about what our models are doing.
We have showcased how to apply our methodology on several RC tasks, leveraging either semiautomatically curated counterfactual sets or existing resources. We generally find pairwise interaction methods perform better than the best tokenlevel attribution based methods in our analysis. More broadly, we see our methodology as a useful evaluation paradigm that could be extended across a range of tasks, leveraging either existing contrast sets or with a small amount of effort devoted to create targeted counterfactual sets as in this work.  Figure 6 shows several more examples to illustrate our process of generating counterfactuals for the Hotpot Yes-No setting. As stated in Section 5.1, most Hotpot Yes-No questions follow one of two templates: Are A and B both __? (Figure 6, abc), and Are A and B of the same __? (Figure 6, def). The property tokens that match the blank in the template are highlighted in Figure 6.

A Details of Hotpot Yes-No Counterfactuals
Recall our two steps to construct a neighborhood for a base data point: 1. Manually extract the property tokens in the context 2. Replace each property token with a substitute, forming a set of four counterfactuals exhibiting nonidentical ground truths When the properties associated with the two entities differ from each other, we directly use the properties extracted as the substitutes (Figure 6, abf); otherwise we add a new property candidate that is of the same class ( Figure 6, cde). We annotated randomly sampled examples from the Hotpot Yes-No questions. We skipped several examples that compared abstract concepts with no explicit property tokens. For instance, we skipped the question Are both Yangzhou and Jiangyan District considered coastal cities? whose given context does not explicitly mention whether the cities are coastal cities. We looked through 61 examples in total and obtained annotations for 50 examples, so such discarded examples constitute a relatively small fraction of the dataset. Overall, this resulted in 200 counterfactual instances. We found the prediction of a ROBERTA QA model on 52% of the base data points change when being perturbed. Figure 7 shows more examples of our annotations for generating counterfactuals for Hotpot Bridge examples. We first decompose the bridge questions into two single hop questions (Min et al., 2019), the primary part (marked in Figure 7) and secondary part. The primary part is the main body of the question, whereas the secondary part is usually a clause used to link the bridge entity (Min et al., 2019).

B Details of Hotpot Bridge Counterfactuals
Next, we write adversarial sentences for confuse the model follow a similar method used for generating SQUAD adversarial examples (Jia and Liang, 2017). Specifically, we only look at the primary part, and write down a sentence that can answer the primary question accordingly with a different entity from the secondary question. This will introduce a spurious answer, but does not change the gold answer. Besides, we also write the sentences follow in the same Wikipedia style as the original context possible, and some of the sentences are modified from texts from Wikipedia (e.g., Figure 7 Figure 6: Examples (contexts are truncated for brevity) of our annotations on Hotpot Yes-No base data points. We find the property tokens in the context, and build realist counterfactuals by replacing them with substitutes that are properties extracted in the base data point or similar properties hand-selected by us. (a) Question What is the name of the fight song of the university whose main campus is in Lawrence, Kansas and whose branch campuses are in the Kansas City metropolitan area? Scott Parkin (born 1969, Garland, Texas is an anti-war, environmental and global justice organizer, former community college history instructor, and a founding member of the Houston Global Awareness Collective. He has been a vocal critic of the American invasion of Iraq, and of corporations such as Exxonmobil and Halliburton. The Halliburton Company, an American multinational corporation. One of the world's largest oil field service companies, it has operations in more than 70 countries. Adv Sent 1 Visa is a corporation that has operations in more than 200 countries. Adv Sent 2 The Ford Motor Company is an American multinational corporation with operations in more than 100 countries. (e)

Question
In 1991 Euromarché was bought by a chain that operated how any hypermarkets at the end of 2016? Context Carrefour S.A. is a French multinational retailer headquartered in Boulogne Billancourt, France, in the Hauts-de-Seine Department near Paris. It is one of the largest hypermarket chains in the world (with 1,462 hypermarkets at the end of 2016). Euromarché was a French hypermarket chain. In June 1991, the group was rebought by its rival, Carrefour, for 5,2 billion francs. Adv Sent 1 Walmart Inc is a multinational retail corporation that operates a chain of hypermarkets that owns 4,700 hypermarkets within the United States at the end of 2016. Adv Sent 2 Trader Joe's is an American chain of grocery stores headquartered in Monrovia, California. By the end of 2016, Trader Joe's had over 503 stores nationwide in 42 states.  Figure 7: Examples (contexts are truncated for brevity) of primary questions and adversarial senteces for creating Hotpot Bridge counterfactuals. (a) Explanations generated by our approach (left) and DIFFMASK (right) for a bridge example from the HOT-POTQA dataset. The model can resist adversarial sentences posed on this example. Our explanation highlights certain tokens in the latter part of the question ("American animated television" and "Teen Titans"), suggesting the model prediction is less likely to be flipped by adversarial attacks targeted at this example, which aligns with the model behavior.
(b) Explanations generated by our approach (left) and DIFFMASK (right) for a comparison example from the HOTPOTQA dataset. When we perturb the nationalities in the context, the model changes its prediction. Both of the explanations both suggest the model makes its decision by looking at the nationalities associated with the two entities, which is congruent with the model behavior.