Contrastive Explanations for Model Interpretability

Contrastive explanations clarify why an event occurred in contrast to another. They are inherently intuitive to humans to both produce and comprehend. We propose a method to produce contrastive explanations in the latent space, via a projection of the input representation, such that only the features that differentiate two potential decisions are captured. Our modification allows model behavior to consider only contrastive reasoning, and uncover which aspects of the input are useful for and against particular decisions. Our contrastive explanations can additionally answer for which label, and against which alternative label, is a given input feature useful. We produce contrastive explanations via both high-level abstract concept attribution and low-level input token/span attribution for two NLP classification benchmarks. Our findings demonstrate the ability of label-contrastive explanations to provide fine-grained interpretability of model decisions.


Introduction
Explanations in machine learning attempt to uncover the causal factors leading to a model's decisions. Methods for producing model explanations often seek all causal factors at once-making them difficult to comprehend-or organize the factors via heuristics, such as gradient saliency (Simonyan et al., 2013;Li et al., 2016). However, it remains unclear what makes a particular collection of causal factors a good explanation.
Studies in social science establish that human explanations, conversely, are typically "contrastive" (Miller, 2019): they rely on the causal factors that explain why an event occurred instead of an alternative event (Lipton, 1990). Such explanations pro- Our work seeks to design explanations that are explicitly contrastive, thereby revealing finegrained aspects of model decisions, while being more representative of human comprehension ( §2). We introduce a novel framework for deriving contrastive explanations applicable to any neural classifier ( §3). Our method operates on the input representation space, and produces a latent, contrastive representation. We accomplish this by project-  ing the latent representations of the input to the space that minimally separates two decisions in the model. We additionally propose a measure of contrastiveness ( §3.4) by computing changes to model behavior before and after the projection.
Our experiments consider two well-studied NLP classification benchmarks: MultiNLI (Williams et al., 2018;§4) and BIOS (De-Arteaga et al., 2019;§5). In each, we study explanations in the form of high-level concept features or low-level textual highlights in the input. Our contrastive explanations uncover input features useful for and against particular decisions, and can answer which alternative label was a decision made against; this has potential implications for model debugging. Overall, we find that a contrastive perspective aids finegrained understanding of model behavior.

Contrastive Explanations
Explanations can be considered as answers to the question: "why P ?" where P , the fact, is the event to be explained. Consider the question (Q1; Fig. 2): Why was it decided to hire [Person X]? The explanation behind decision P (the hiring decision) comprises the causal chain of events that led to P ; a 'reasonable' answer to the question may cite that Person X has relevant degrees or professional experience. However, explaining the complete causal chain (A1; Fig. 2) is both burdensome to the explainer and cognitively demanding to the explainee (Hilton and Slugoski, 1986;Hesslow, 1988). For instance, the hiring event above is caused by Person X's application for the role-yet a reasonable explainer will likely omit this factor from the explanation for simplicity, thus reducing the cognitive load. But which factors should be omitted, and which should not?
The theory of contrastive explanations provides a solution common in human explanations, which inherently answer the question: "why P , rather than Q?" (Hilton, 1988), where Q (the foil) is some alternative event. 2 Therefore, the explanation to the hiring decision might answer (Q2; Fig. 2): Why was it decided to hire [Person X], rather than not hiring them? Since the decision not to hire the candidate can also be traced back to the fact that they applied to the role, the explanation can be simplified by omitting this factor (A2; Fig. 2); this illustrates contrastive causal attribution ( §3.3) in the explanation process. Similarly, given a different foil (Q3; Fig. 2): Why was it decided to hire [Person X], rather than hiring [Person Y]? the explainer might find it unnecessary to include attributes (such as a "relevant college degree") common to both Person X and Person Y (A3; Fig. 2). For the rest of the paper, we will refer to explanations which are not explicitly contrastive, such as A1 in Fig. 2, as non-contrastive explanations.
Implications for model explanations: Model explanations can benefit from explicit contrastiveness in two ways. Model decisions are complex and noisy statistical processes-'complete' explanations are difficult (Jacovi and Goldberg, 2020b). Contrastive explanations make model decisions easier to explain by omitting many factors, given the relevant foils. This reduces the burden of the explaining algorithm to interpret and communicate the 'complete' reasoning. Additionally, humans tend to inherently, and often implicitly, comprehend (model) explanations contrastively. However, the foil perceived by the explainee may not match the one truly entailed by the explanation (e.g., Kumar et al., 2020); contrastive explanations make the foil explicit, and are therefore easier to understand.

Contrastive Explanations Framework
We present our framework for producing contrastive explanations. We describe the preliminaries for our framework ( §3.1), and outline an interventionist approach for causal attribution ( §3.2). Next, we introduce our projection-based method to produce latent contrastive representations ( §3.3) and a measure of the resulting behavioral change ( §3.4).

Preliminaries
Candidate Factor Space: All causal factors that could possibly lead to the model's decision. These could include discrete features in the input (textual highlights; Lei et al., 2016; see Fig. 1), abstract input features (concepts; Kim et al., 2017;see Table 4 for an example of gender as a concept), or influential examples in the training set for a prediction (Koh and Liang, 2017;Han et al., 2020). 3 Once defined, a subset of factors from the candidate factor space can then be causally attributed to the model decision. Our framework is agnostic to the type of factor in the candidate factor space, so long as it is possible to intervene on their presence ( §3.2). Our work focuses on textual highlights and concepts.
Event Space: The union (as a discrete set) of all possible model decision classes. 4 This includes the event we attempt to explain (the fact), i.e. a trained model's decision 5 , as well all other classes (the foils), considered independently.

Causal Attribution via Interventions
Given a candidate factor space, we seek to attribute factors with causality over the decision process, i.e. select factors which caused the model's decision. We adopt an interventionist approach (Woodward, 2003)-this involves determining causality of a factor by intervening on it, thereby producing a counterfactual. The importance of the intervened factor is determined by change in model behavior under this counterfactual.
Most interventions we consider are amnesic, i.e. use counterfactuals which omit the candidate factors under consideration. 6 Factors in the form 3 Such spaces can also be a hybrid or partial combination of the above homogeneous spaces we consider in this work. 4 We do not consider explanations for other aspects of model behavior: such as why a particular class was assigned a particular probability, or why a particular neuron received a particular activation. Additionally, for the foil we only consider a single other class (aside from the fact), rather than some subset of classes. 5 The fact is not strictly required to be the model prediction; it could alternatively be the model probability, for instance. 6 An alternative is to replace the candidate factor with other causal factors. We leave extensive comparisons of amnesic and non-amnesic interventions to future work. of highlights and concepts involve different kinds of interventions. For the former, we simply replace each highlighted token with a 'mask' token ( §4.2,5.2), and train models where such masked data is in distribution (Zintgraf et al., 2017;Kim et al., 2020), i.e. pre-trained masked language models, such as RoBERTa (Liu et al., 2019). For conceptual interventions, we employ an amnesic operation to remove a concept from the input representation ( §4.1,5.1). Following Elazar et al. (2021), this amnesic operation uses a null-space projection to iteratively remove all linear directions correlated with the concept, until it is not possible to linearly discover the concept information from the latent vector (Ravfogel et al., 2020). 7 Training an amnesic probe requires labels indicating presence of the concept for each example ( App. A.1). Where possible, we also employ a conceptual intervention via manually-annotated counterfactuals (Kaushik et al., 2020;Gardner et al., 2020) which involve textual modifications to existing data instances ( §4.1; Hyp-Negation).

Contrastive Attribution
While the aforementioned interventions select the causal factors from the candidate factor space for model explanations, we are specifically interested in selecting factors that yield contrastive explanations. Given a space of discovered causal factors, we therefore need an additional intervention to attribute contrastive behavior to a subset of these factors. We propose a method below to produce contrastive explanations in the form of a dense representation of the input in the latent space. This representation is given by a projection operation, such that only the components that distinguish the fact from the foil are preserved.
Formally, let x be the text input to be classified as one of K output classes Y = {y 1 , ..., y K }. Consider the model class f , which commonly uses an arbitrarily deep neural encoder enc(·) that transforms the input x into a vector h x ∈ R d . Once encoded, a final linear layer, W ∈ R K×d can then be applied to the input to yield the logits of the model over the K classes, such that f (x) = Wenc(x) = Wh x . Let y * = arg max y∈Y f (x) be the model prediction (the fact), y be an alternative prediction of interest (the foil), and p := softmax(Wh x ) the normalized model probabilities.
Recall that output of the model, f (x) = Wh x is linear in the latent input representation, h x . Let w i be the row in W corresponding to class i. The logits for classes y , y * , given by the dot products w y T h x and w y * T h x respectively, are thus unrelated to any other row in the prediction matrix. These dot products are un-normalized projections of the representation h x on the directions w y * , w y ∈ R d . While the representation h x can be high-dimensional, only two directions (components) in this high-dimensional space are relevant for each contrastive decision. 8 Furthermore, we are interested in the prediction of one class over the other, as opposed to the logits. Hence, we can replace these two directions with a single contrastive direction u ∈ R d , by defining u = w y * − w y . Given that the model favors y * over y , i.e., p y * > p y , if and only if u T h x > 0, the projection of h x onto u precisely yields the linear direction in R d which the model uses to differentiate between classes y * and y . We refer to the span of the uthat is, the collection of all vectors αu for a scalar α-as the contrastive space for y * and y . We define the contrastive transformation C(h x ) y ,y * to be the orthogonal projection onto this subspace: where P u := uu T u T u is a projection matrix onto u. 9 The resulting representation C(h x ) y ,y * is a latent vector of the same dimensions as h x ; computing this can be understood as a contrastive intervention. Intuitively, it captures (precisely) the latent features in h x which are used by the model to differentiate the fact from the foil, where h x is the hidden representation of x before the final classifying layer.

Measuring Contrastive Behavior
We consider a measure of contrastive behavior, based on our interventionist approach. Let q := softmax(Wh x ) be the model probabilities following an intervention (contrastive or otherwise) which produces counterfactual x . Our measure of contrastive model behavior is simply the difference between normalized probabilities of the fact before and after the intervention, given by: Here, the normalization ensures we consider only the fact (y * ) and foil (y ), other classes being irrelevant. This measure can be applied to both contrastive interventions, involving C(h x ) y ,y * or just causal ones, involving h x . Given −1 ≤ δ contr p,q ≤ 1, our metric is constrained; the magnitude indicates the degree of contrastive behavior. Our metric is reminiscent of statistical parity metrics used in algorithmic fairness (Zemel et al., 2013). 10 We use the contrastive measure in two settings: 1. Ranking factors (controlled foil): given a fixed foil, and the candidate factor space, we rank each factor by how contrastively useful it is to the model for choosing the fact, against the given foil. 2. Ranking foils (controlled factor): given a causal factor, we rank the set of available foils in the event space by how much the said factor is contrastively used by the decision process between the fact and the foils. The above is a relative and continuous perspective on contrastive selection, compared to a discrete and binary one discussed in §2. We leave a possible discretization of this process to future work.

Case Study I: Analyzing NLI
We apply our contrastive framework to the natural language inference (NLI) task (Dagan et al., 2005), on two datasets: MultiNLI (Williams et al., 2018) and SNLI (Bowman et al., 2015). Given a premise sentence, the NLI task classifies if a hypothesis sentence entails, contradicts or is neutral to the premise. Our experiments are based on a RoBERTa-large model (Liu et al., 2019) fine-tuned on MultiNLI (obtaining 90.1% accuracy on devmismatched), unless otherwise specified; see more details in Appendix A.2.
Sanity Checks. Our first set of experiments use a controlled setup to verify that our method works as expected. We train a MultiNLI model on modified instances with label-specific "stains" (input  tokens inserted to contrastively distinguish classes; Sippy et al., 2020). We intervene on the stains (as highlights; §3.2) before our contrastive projection ( §3.3), and then rank highlights and foils by δ contr p,q ( §3.4). The most contrastive foils and highlights indeed correspond to the stains, verifying our methodology; see results and details in App. B.

The Role of NLI Concepts
Extensive prior work supports the presence of spurious correlations or artifacts in popular NLI datasets (Poliak et al., 2018;Gururangan et al., 2018). For e.g., instances with high lexical overlap between premise and hypothesis tend to correlate with the entailment class, and those containing negations in the hypothesis with the contradiction class. As a result, models trained on these datasets tend to rely on these features to make accurate predictions, regardless of other semantic signals. Moreover, Gururangan et al. (2018) show that a model can ignore the premise altogether and still make an accurate prediction based only on the hypothesis. Table 1 shows the distribution of the above three concepts in MultiNLI. While prior work provides considerable evidence that NLI model decisions rely heavily on these concepts, we investigate whether this reliance is contrastive by nature. We consider each concept independently: Overlap. Instances with the overlap concept are those where all of the content words 11 in the hypothesis also exist in the premise (in any order). Prior work has shown that the overlap concept is highly relevant in the model's reasoning process for predicting entailment (Naik et al., 2018;McCoy et al., 2019). We intervene on the overlap concept via amnesic probing ( §3.2), followed by our contrastive intervention ( §3.3) to measure behavioral changes ( §3.4). Foil ranking results in Table 2 show that when predicting entailment, the overlap concept is overwhelmingly contrastive against neutral. This aligns with Table 1 statistics, which show that the concept is highly correlated with entailment (64.4%) and against neutral (6.3%) predictions. The overlap concept is thus contrastively important.
Hypothesis. Motivated by the finding that hypothesis-only models have been shown to achieve high accuracy in the NLI task (Gururangan et al., 2018;Poliak et al., 2018), we consider the 'hypothesis' concept-a collection of all concepts existing only in the hypothesis. This concept is realized in instances accurately predicted by a hypothesis-only baseline. We use binary concept labels based on accurate / inaccurate predictions of a hypothesis-only RoBERTa-baseline to train an amnesic probe for causal intervention ( §3.2). The amnesic probe and our contrastive intervention ( §3.3) are then applied on the full-input model. Results in Table 2 show that when predicting entailment in MultiNLI, this concept is not strongly contrastive to either foil (-0.005 v. -0.031). This aligns with Table 1 statistics, since the concept is similarly distributed with contradiction and neutral (24.7% v. 26.2%). However, when applied to the SNLI dataset, we see stronger contrastive behavior with respect to contradiction (0.505) than neutral (0.463). Perhaps this could be explained by the higher hypothesis-only bias in SNLI, compared to MultiNLI (Gururangan et al., 2018).
Hyp-Negation. This concept is realized in instances containing negation words (e.g., 'no') in the hypothesis. The presence of this concept is highly indicative of an NLI model's prediction to be the contradiction class regardless of other NLI semantics (Gururangan et al., 2018).
Here, we use manually annotated counterfactu-als to intervene on the concept 12 . Given an example without negations in the hypothesis, and predicted by the model as entailment or neutral, two of the authors manually paraphrased the hypothesis to include a negation without altering the semantics; see Appendix C for examples. 13 We then proceed to probe the model for behavioral changes (δ contr p,q ) between the instances before-and-after the intervention, treating the negated example as a counterfactual to compute q. Foil ranking results in Table 2 show that, on average, the model utilizes the negation concept as evidence for contradiction in contrast to entailment (0.195), as opposed to neutral (0.051).
In summary, while it is known that the above three concepts in NLI are useful, we investigated whether they are useful against a particular foil. Two concepts (overlap and hyp-negation) do have a prominent contrastive role, while the hypothesis concept is contrastive in only one setting, indicating that there might be concepts which are not explicitly contrastive.

Ranking Highlights for Debugging NLI
Contrastive explanations can help humans understand model errors. Our goal is to answer: what factor led to the model's incorrect prediction in contrast to the gold label? We achieve this by treating an erroneous model prediction as the fact, and the gold label as the foil. For this fact-foil pair, we rank different factors based on their contrastive behavior ( §3.4). 14 We consider all unigrams and bigrams in the hypothesis as highlight factors in the NLI task. As before, we intervene on each factor, apply our contrastive intervention, and measure the change in behavior ( §3.4). Table 3 presents some SNLI examples where we report the most contrastive highlight factor (by δ contr p,q ) for each foil. We also report a noncontrastive explanation ( §2) resulting from only the causal (highlight) intervention (i.e. no contrastive projection). We see that one of the contrastive explanations (usually with the gold foil) agrees with the non-contrastive explanation, indicating that the latter might simply be reflecting an implicit contrastive explanation. The last row 12 Initial experiments with amnesic probing for this concept were inconclusive. We suspect that although useful for the model, the concept is perhaps not detectable in the last layerthe only conclusion the amnesic probing method can draw. 13 Our collection of 90 such instances for each of entailment and neutral will be released upon publication. 14 We report a BIOS factor ranking experiment in App. D.  shows a case where the non-contrastive explanation does not agree with the contrastive explanation for the gold foil. However, the model's reasoning appears to be correct since the hypothesis may or may not entail the premise; additionally 'bar' seems the correct reasoning for choosing neutral over entailment. Thus, contrastive explanations can provide insight on why the model specifically preferred its prediction over the gold label. Future work might explore using contrastive explanations to detect labeling errors in NLI (Swayamdipta et al., 2020).

Case Study II: Analyzing BIOS
We apply our framework on the BIOS dataset (De-Arteaga et al., 2019) containing individuals biographies, labeled with their professions and binary gender 15 (see Table 4). The task involves classifying a biography as one of 27 professions 16 , without explicitly considering the gender attribute. We analyze RoBERTa-large (Liu et al., 2019) fine-tuned on BIOS, with test performance of 87.52%.

Biography / Profession / Gender
She also works as a Restitution Specialist while being the liaison to the Victim Compensation Board. Ms. Azevedo was named an OVSRS Outstanding Partner due to her dedication to providing critical information to staff so victims can obtain their court-ordered restitution while offenders can be held accountable. / paralegal / F Peter also has substantial experience representing clients in government investigations, including criminal and regulatory investigations, and internal investigations conducted on behalf of clients. / attorney / M

The Role of the Gender Concept
Ravfogel et al. (2020) showed that a (binary) gender concept is valuable for BIOS model predictions.
We use amnesic probing ( §3.2) to intervene on the presence of the gender concept, trained using the binary gender labels in BIOS, followed by our contrastive projection ( §3.3). Table 5 reports the top-5 fact/foil pairs based on the contrastive measure δ contr p,q , among all pairs of professions, along with the respective binary gender % in BIOS. The top-scoring pairs tend to be semantically similar while dissimilar in their gender proportions (e.g. paralegal and attorney). This confirms that the model is indeed leveraging the gender concept to differentiate between otherwise semantically-similar classes.
Amnesic probing for a concept results in a representation that cannot distinguish whether it was present or not in an instance. But it results in a final "concept vector", r, which only maintains the information relevant towards the concept. In Table 5, we additionally report the sign of the cosine similarity between u and r, where u = w y * − w y (see §3.3).  This indicates whether the concept (in this case, "male") is present in the fact (+) or not (−). The results align with intuition: the "male" concept is supportive of attorney over paralegal, and accountant over psychologist, which are male-majority professions in the BIOS dataset.

The Role of Demographic Highlights
The demographic attributes of individuals, as encoded by their pronouns and personal names, can be spuriously correlated with their professions, as is often manifest in the BIOS dataset (De-Arteaga et al., 2019;Romanov et al., 2019). For instance, paralegals in BIOS are overwhelmingly women (roughly 90%); female names and pronouns might be very predictive of this profession, albeit for incorrect reasons. Further, names might reveal other demographics (e.g., Azevedo is a common Portuguese surname) potentially predictive of certain professions. Table 4 shows pronouns and person name highlights which are candidate causal factors of interest. We investigate the contrastive importance of these factors, by asking: which classes does the model use the pronouns and names contrastively against when making its decision?
We intervene on pronoun and name highlights by masking ( §3.2), followed by computing the contrastive measure (Eq. 2) for every possible profession (foil) in contrast to the model prediction (fact). Table 6 shows the most and least contrastive foils for five professions (facts), where the foils are ranked by δ contr p,q , and aggregated across BIOS dev. For example, on paralegal predictions, attorney is the most relevant foil, indicating that the model uses demographic information for that distinction. This indicates that the model leverages demographic attributes as evidence for decisions between classes, which are semantically similar but demographically different. Unlike δ contr p,q values obtained from amnesic probes, the sign of δ contr p,q obtained after highlight interventions can be meaningful. When δ contr p,q > 0, the model uses highlights as evidence for the fact and against the foil; negative values indicate evidence for the foil against the fact. For attorney predictions, paralegal is an important foil as expected, even though δ contr p,q < 0.

Contrastive-Only Interventions
We are additionally interested in measuring the degree of contrastive behavior change without considering causal features such as highlights / concepts. We can treat the contrastive projection C(h x ) y ,y * (Eq. 1) as an intervention 17 , and measure the change in behavior following just this intervention. Since the contrastive intervention, by construction, precisely maintains contrastive behavior, δ contr p,q is no longer appropriate. We thus use a symmetrized Kullback-Leibler divergence, D KL (p q) + D KL (q p), which gives us the global behavior change across the dataset after the contrastive intervention. Table 7 reports results on the D KL metric for BIOS, applying a similar methodology as Table 6, but without the highlights intervention. When predicting the fact, the contrast between the fact and a highly impactful foil does not significantly impact the decision (since removing the non-contrastive information greatly affects the decision), and viceversa. For e.g., the contrast between the related professions of attorney and paralegal does not substantially affect the decision to predict attorney (2.195). As expected, the trend is reversed for attorney and dj (0.528), two distant professions.  physician from surgeon when making physician predictions. The measurement is inverse to the degree that the differentiation is dominating the decision process.

Highlight Ranking in BIOS
Analogous to the MultiNLI highlight ranking procedure and results presented in Section 4.2, we present highlight ranking for the BIOS task. Here, our candidate factor space considers all word unigram and bigram highlights, for simplicity. We derive the model decision after intervening on each candidate, and measure the change in behavior. We apply this technique towards understanding model errors, by selecting examples of model mistakes and assigning the foil to be the gold label. Qualitative examples in Table 13 (Appendix D) show the top-ranking highlight for answering the question: which unigram or bigram was most relevant for the model in making its prediction rather than the gold label?; see Table caption for a detailed discussion.

Related Work
The interventionist approach to causality in our work follows several recent works in NLP (Giulianelli et al., 2018;Meyes et al., 2020;Vig et al., 2020;Elazar et al., 2021;Feder et al., 2021), and is justified by accumulating empirical evidence for the inability to draw causal interpretation from statistical associations alone (Hewitt and Liang, 2019; Tamkin et al., 2020;Ravichander et al., 2021;Elazar et al., 2021). Our contrastive interventions follow an amnesic operation, similar to Feder et al. (2021) who assess the causality of concepts, by adversarial removal of information guided by causal graphs. While we share the amnesic method, we focus on contrastive explanation, while they focused on the influence of concepts on model performance.
Contrastive explanations are a relatively new area in NLP. Recently, Jacovi and Goldberg (2020a) proposed to derive highlights containing the portion of the input which flips the model decision; others propose similar flips via minimal edits (Ross et al., 2021) and conditional generation (Wu et al., 2021). These can be viewed as other interventions orthogonal to our work, since our contrastive framework can be used to understand such interventions. Additionally of interest are adversarial perturbations (Ganin et al., 2017), which are usually implemented as gradient-based interventions. In contrast, our work relies on the identification of erasureusing linear algebra-of linear subspaces that are associated with a given concept. Subspace-based interventions have the advantage of being more interpretable and controlled when compared with gradient-based interventions, which, albeit expressive, are quite opaque, not to mention of unclear efficacy (Elazar and Goldberg, 2019).
Rathi (2019) propose a model-agnostic contrastive explanation scheme based on Shapley values. They offer a local explanation, unlike our global method. In addition, our approach employs behavioral interventions, while Rathi (2019) do not. Others have raised concerns regarding feature importance methods based on Shapley-values (Kumar et al., 2020); the implicit foil of such methods can be unintuitive to human explainees.
In computer-vision, many have studied the generation of counterfactual explanations or counterfactual data points. Freiesleben (2020) provide a unifying theoretical framework around the relationship between adversarial examples and counterfactual explanations. Hendricks et al. (2018) proposed a method that provides natural language counterfactual explanation of image classification decisions. They have relied on a model that proposes potential counterfactual evidence, followed by a verifier that is based on human-provided image description. As their method relies on pre-existing explanation model and human descriptions, there is no guarantee the explanation it provides are related to the model's reasoning process. Sharmanska et al. (2020) used GANs to generate examples representing minority groups, to improve fairness measures. This work, like other works in vision, relies on the continuous input, which is not present in naturallanguage applications. For more information, see Stepin et al. (2021) for a survey of counterfactual explanations.

Conclusion
We introduce a novel framework for producing contrastive explanations for model decisions, via a projection of the input representation to a contrastive space for the prediction and an alternate label. We also propose a measure of the degree of contrastive behavior, following a contrastive intervention. Our experiments with English text classification benchmarks on BIOS and NLI demonstrate our framework's ability to rank model decisions, as well as features responsible for the decision, contrastively. Our quantitative and qualitative evaluations show the fine granularity of contrastive explanations, which could be useful for debugging model behavior. Our framework is general enough to extend other (interventionist) explanation methods to produce contrastive explanations.
Contrastive explanations in NLP and ML are relatively novel; future research could explore variations of interventions and evaluation metrics for the same. This paper presented a formulation designed for contrastive relationships between two specific classes; future work that contrasts the fact with a combination of foils could explore a formulation involving a projection into the subspace containing features from all other classes. A Implementation Details

A.1 Interventions
We utilized three interventions in this work: masking of highlights, amnesic probing of concepts, and erasure of non-contrastive information.
Masking. The masking intervention involved replacing each token in the highlight with a predefined mask token. As we used a pre-trained masked language model for our initialization, we have used that model's mask token, which is <mask> in the case of RoBERTa-Large.
Amnesic probing. We have used the publicly available implementation provided by Elazar et al. (2021), which originally proposed the algorithm. Specifically, we train an iterative nullspace projection probe until convergence which captures the linear directions which correlate with the given concept, and then project the model's latent representation on the null-space of this probe. Please refer to Elazar et al. (2021) for more details. 18 Contrastive projection. In Section 5.3 we propose to use contrastive projection as a standalone intervention to probe for the magnitude of contrastive reasoning process in the model. As mentioned in the main text, this is simply softmax(WC(h x ) y ,y * ) where the original model output is softmax(Wh x ).

A.2 Models and Training
Our experiments were implemented in AllenNLP (Gardner et al., 2018) version 1.2.0rc1. The models used were fine-tuned RoBERTa-Large on the BIOS and MultiNLI training sets, and the models with the best dev-set performance among twenty epochs were chosen for analysis. The models otherwise used default training configurations of AllenNLP, and we provide these configurations in the repository to be included with this work. In all of our training and analysis experiments, we did not use any instances labeled with the 'model' class in BIOS, in all of the training, dev and test sets, due to an observation on noisy labeling made by Ravfogel et al. (2020) which we verified.

B Sanity Checks via Data Staining in NLI
We present a strss-test evaluation to verify the validity of our contrastive framework via data staining (Sippy et al., 2020). It involves "staining" a training instance with a feature (e.g. an inserted token) guaranteed to be useful for the task, and then attempting to recover this feature via the analysis. 19 We modify data staining to evaluate contrastive explanations via introducing stains which are only contrastively useful to the model; the "stain" is some feature useful to differentiate between a specific fact and a foil. The model is thus encouraged to exploit this feature in its decision making. Our stain is added to NLI hypothesis during training as shown in Table 10. The stain can be used by a model to perfectly distinguish a class from the others, i.e. the stain is contrastively useful for or against only the stained class. See Appendix B for illustrative examples with the stains.
We analyze a RoBERTa-Large model finetuned on the stained MultiNLI train set. The entire MultiNLI dataset (train, dev-matched, devmismatched) was stained during the experiment, and we masked the stain for 10% of our training data to ensure such examples are in distribution for the model, enabling us to use masked-stain examples in the analysis step. We repeat our experiment thrice, considering one of the three NLI classes as a stain each time. In all three cases, the stained models achieve high predictive mismatched dev-set performance on the stained MultiNLI (above 97% accuracy). This high performance is expected, and indicates that the models indeed exploit the stain features.
To recover the stains on the MultiNLI dev-set using our methodology, we apply highlight masking interventions. We define our candidate factor space to be all tokens in the hypothesis, and we expect the first token (the stain) to be the most contrastively important evidence when either the fact or the foil is the stained class. We report the accuracy of recovering the stain as the salient factor (ranking factors, c.f. §3.4) when the fact or foil is the stained class, or recovering a non-stain word when the stained class is neither the fact nor the foil. We perform the experiment for a random sample of 1000 test-set MultiNLI instances. The re- "Allison Smith is a model of tenacity and perseverance. She has battled several serious illnesses, undergone multiple surgeries, and endured life-changing procedures, yet embraces her life with joyful exuberance and optimism. Allison inspires all those who have the pleasure of meeting her and hear her incredible story. Her book has the capability of changing the lives of her readers -both those who endure chronic illnesses, as well as the caretakers (families and friends) who walk the journeys with them." -Karen' Justin Bieber is a role model to the people who enjoy his music. The vast majority of these people are children, 8-16 year old girls, therefore, for him to smoke marijuana is a horrible example for the children that look up to him. It's the same as when a child sees their big sibling smoking a cigarette and feels the compulsive need to be like them.    Table 9 shows a demonstration of the data staining experiment.
Additionally, if the stain were to be treated as a highlight which we intervened on, followed by 20 Since the model is not guaranteed to make optimal use of the stain in every case, a high but sub-optimal accuracy performance is within expectations.  a contrastive intervention, and behavioral change, we would expect the stained class to be the most affected (ranking foils, c.f. §3.4). This is indeed the case, as shown in results in Table 11.

C Hyp-Negation Concept in NLI
In the Hyp-Negation experiments, as mentioned, we opted to produce counterfactuals of the hypothesis negation concept, instead of doing so via amnesic probing. Table 12 contains examples of the 180 instances that we will make available online.

D Highlight Ranking in BIOS
Here, to adjust for space, we show some qualitative examples for the BIOS highlight ranking described in Section 5.4 presented in Table 13.
Fact Foil Input with Highlights (prediction) (gold) attorney no explicit foil Harris said the abuse had been inflicted by both "hands and items," and, according to evidence, since near the time of Kairissa's arrival in Mt. Juliet.

physician
Harris said the abuse had been inflicted by both "hands and items," and, according to evidence, since near the time of Kairissa's arrival in Mt. Juliet. attorney no explicit foil He has been involved in land transport for the past 13 years and has worked on various transport projects in Malta. He was appointed as chief officer for land transport within the Authority for Transport in Malta in 2010 where he was responsible for the regulation of driver training, testing and licensing, vehicle registration, goods transport, and passenger transport. In 2015 he moved back to the private sector and took on the role of General Manager of the local bus company, Malta Public Transport with his main responsibility being to oversee the transformation of the public transport service.

accountant
He has been involved in land transport for the past 13 years and has worked on various transport projects in Malta. He was appointed as chief officer for land transport within the Authority for Transport in Malta in 2010 where he was responsible for the regulation of driver training, testing and licensing, vehicle registration, goods transport, and passenger transport.
In 2015 he moved back to the private sector and took on the role of General Manager of the local bus company, Malta Public Transport with his main responsibility being to oversee the transformation of the public transport service.
physician no explicit foil She is a top medicine student whose academic achievements are sweet fruit of her labor. All seemed well until she reached her junior internship year, when one of her patients died under her watch. She was publicly humiliated in the aftermath. Her closest friends and family tried to lift her spirits up, but to no avail. She thought she was a failure. All she felt was the immense pressure boiling inside of her, and she can no longer contain it. Thus, on one fateful night, on the rooftop of her apartment, she decides to end her misery by taking her own life.
surgeon She is a top medicine student whose academic achievements are sweet fruit of her labor. All seemed well until she reached her junior internship year, when one of her patients died under her watch. She was publicly humiliated in the aftermath. Her closest friends and family tried to lift her spirits up, but to no avail. She thought she was a failure. All she felt was the immense pressure boiling inside of her, and she can no longer contain it. Thus, on one fateful night, on the rooftop of her apartment, she decides to end her misery by taking her own life. Table 13: Qualitative examples for the BIOS highlight ranking described in Section 5.4. The top-1 results of ranking highlights contrastively given a particular foil, compared to doing so generally (i.e., where the foil is all other classes together) for examples where the model made a mistake. We consider the space of highlights to be all unigrams and bigrams, and rank the space by the change in behavior (via difference of normalized logits) following a masking intervention on the highlight. In the example in the second row (for which the model mistakenly predicted physician), the model is generally most affected by the bigram "top medicine". However, this is not a particularly useful feature to favor physician rather than surgeon, since surgeons also entail medicine studies; when we repeat the experiment in contrast to surgeon, the top highlight changes to "patients died", indicating that this bigram is a better differentiator for those classes in the trained BIOS model.