Explaining NLP Models via Minimal Contrastive Editing (MiCE)

Humans have been shown to give contrastive explanations, which explain why an observed event happened rather than some other counterfactual event (the contrast case). Despite the influential role that contrastivity plays in how humans explain, this property is largely missing from current methods for explaining NLP models. We present Minimal Contrastive Editing (MiCE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks--binary sentiment classification, topic classification, and multiple-choice question answering--show that MiCE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MiCE edits can be used for two use cases in NLP system development--debugging incorrect model outputs and uncovering dataset artifacts--and thereby illustrate that producing contrastive explanations is a promising research direction for model interpretability.


Introduction
Cognitive science and philosophy research has shown that human explanations are contrastive (Miller, 2019): People explain why an observed event happened rather than some counterfactual event called the contrast case. This contrast case plays a key role in modulating what explanations are given. Consider Figure 1. When we seek an explanation of the model's prediction "by train," we seek it not in absolute terms, but in contrast to another possible prediction (i.e. "on foot"). Additionally, we tailor our explanation to this contrast case. For instance, we might explain why the prediction is "by train" and not "on foot" by saying that the writer discusses meeting Ann at the train station Figure 1: An example MICE edit for a multiple-choice question from the RACE dataset. MICE generates contrastive explanations in the form of edits to inputs that change model predictions to target (contrast) predictions. The edit (bolded in red) is minimal and fluent, and it changes the model's prediction from "by train" to the contrast prediction "on foot" (highlighted in gray).
instead of at Ann's home on foot; such information is captured by the edit (bolded red) that results in the new model prediction "on foot." For a different contrast prediction, such as "by car," we would provide a different explanation. In this work, we propose to give contrastive explanations of model predictions in the form of targeted minimal edits, as shown in Figure 1, that cause the model to change its original prediction to the contrast prediction.
Given the key role that contrastivity plays in human explanations, making model explanations contrastive could make them more user-centered and thus more useful for their intended purposes, such as debugging and exposing dataset biases (Ribera and Lapedriza, 2019)-purposes which require that humans work with explanations (Alvarez-Melis et al., 2019). However, many currently popular instance-based explanation methods produce highlights-segments of input that support a prediction (Zaidan et al., 2007;Lei et al., 2016;Chang et al., 2019;Bastings et al., 2019;Yu et al., 2019;DeYoung et al., 2020;Jain et al., 2020;Belinkov and Glass, 2019) that can be derived through gradients (Simonyan et al., 2014;Smilkov et al., 2017;Sundararajan et al., 2017), approximations with simpler models (Ribeiro et al., 2016), or attention (Wiegreffe and Pinter, 2019;Sun and Marasović, 2021). These methods are not contrastive, as they leave the contrast case undetermined; they do not tell us what would have to be different for a model to have predicted a particular contrast label. 1 As an alternative approach to NLP model explanation, we introduce MINIMAL CONTRASTIVE EDITING (MICE)-a two-stage approach to generating contrastive explanations in the form of targeted minimal edits (as shown in Figure 1). Given an input, a fixed PREDICTOR model, and a contrast prediction, MICE generates edits to the input that change the PREDICTOR's output from the original prediction to the contrast prediction. We formally define our edits and describe our approach in §2.
We design MICE to produce edits with properties motivated by human contrastive explanations. First, we desire edits to be minimal, altering only small portions of input, a property which has been argued to make explanations more intelligible (Alvarez-Melis et al., 2019;Miller, 2019). Second, MICE edits should be fluent, resulting in text natural for the domain and ensuring that any changes in model predictions are not driven by inputs falling out of distribution of naturally occurring text. Our experiments ( §3) on three English-language datasets, IMDB, NEWSGROUPS, and RACE, validate that MICE edits are indeed contrastive, minimal, and fluent.
We also analyze the quality of MICE edits ( §4) and show how they may be used for two use cases in NLP system development. First, we show that MICE edits are comparable in size and fluency to human edits on the IMDB dataset. Next, we illustrate how MICE edits can facilitate debugging individual model predictions. Finally, we show how MICE edits can be used to uncover dataset artifacts learned by a powerful PREDICTOR model. 2 1 Free-text rationales (Narang et al., 2020) can be contrastive if human justifications are collected by asking "why... instead of..." which is not the case with current benchmarks (Camburu et al., 2018;Rajani et al., 2019;Zellers et al., 2019). 2 Our code and trained EDITOR models are publicly available at https://github.com/allenai/mice.

MICE: Minimal Contrastive Editing
This section describes our proposed method, MINI-MAL CONTRASTIVE EDITING, or MICE, for explaining NLP models with contrastive edits.

MICE Edits as Contrastive Explanations
Contrastive explanations are answers to questions of the form Why p and not q? They explain why the observed event p happened instead of another event q, called the contrast case. 3 A long line of research in the cognitive sciences and philosophy has found that human explanations are contrastive (Van Fraassen, 1980;Lipton, 1990;Miller, 2019). Human contrastive explanations have several hallmark characteristics. First, they cite contrastive features: features that result in the contrast case when they are changed in a particular way (Chin-Parker and Cantelon, 2017). Second, they are minimal in the sense that they rarely cite the entire causal chain of a particular event, but select just a few relevant causes (Hilton, 2017). In this work, we argue that a minimal edit to a model input that causes the model output to change to the contrast case has both these properties and can function as an effective contrastive explanation. We first give an illustration of contrastive explanations humans might give and then show how minimal contrastive edits offer analogous contrastive information.
As an example, suppose we want to explain why the answer to the question "Q: Where can you find a clean pillow case that is not in use?" is "A: the drawer." 4 If someone asks why the answer is not "C1: on the bed," we might explain: "E1: Because only the drawer stores pillow cases that are not in use." However, E1 would not be an explanation of why the answer is not "C2: in the laundry hamper," since both drawers and laundry hampers store pillow cases that are not in use. For contrast case C2, we might instead explain: "E2: Because only laundry hampers store pillow cases that are not clean." We cite different parts of the original question depending on the contrast case.
In this work, we propose to offer contrastive explanations in the form of minimal edits that result in the contrast case as model output. Such edits are effective contrastive explanations because, by construction, they highlight contrastive features. For Figure 2: An overview of MICE, our two-stage approach to generating edits. In Stage 1 ( §2.3), we train the EDITOR to make edits targeting specific predictions from the PREDICTOR. In Stage 2 ( §2.4), we make contrastive edits with the EDITOR model from Stage 1 such that the PREDICTOR changes its output to the contrast prediction. example, a contrastive edit of the original question for contrast case C1 would be: "Where can you find a clean pillow case that is not in use?"; the information provided by this edit-that it is whether or not the pillow case is in use that determines whether the answer is "the drawer" or "on the bed"-is analogous to the information provided by E1. Similarly, a contrastive edit for contrast case C2 that changed the question to "Where can you find a clean dirty pillow case that is not in use?" provides analogous information to E2.

Overview of MICE
We define a contrastive edit to be a modification of an input instance that causes a PREDIC-TOR model (whose behavior is being explained) to change its output from its original prediction for the unedited input to a given target (contrast) prediction. Formally, for textual inputs, given a fixed PREDICTOR f , input x = (x 1 , x 2 , ..., x N ) of N tokens, original prediction f (x) = y p and contrast prediction y c = y p , a contrastive edit is a mapping e : (x 1 , ..., We propose MICE, a two-stage approach to generating contrastive edits, illustrated in Figure 2. In Stage 1, we prepare a highly-contextualized EDI-TOR model to associate edits with given end-task labels (i.e., labels for the task of the PREDICTOR) such that the contrast label y c is not ignored in MICE's second stage. Intuitively, we do this by masking the spans of text that are "important" for the given target label (as measured by the PREDIC-TOR's gradients) and training our EDITOR to reconstruct these spans of text given the masked text and target label as input. In Stage 2 of MICE, we generate contrastive edits e(x) using the EDITOR model from Stage 1. Specifically, we generate candidate edits e(x) by masking different percentages of x and giving masked inputs with prepended contrast label y c to the EDITOR; we use binary search to find optimal masking percentages and beam search to keep track of candidate edits that result in the highest probability of the contrast labels p(y c |e(x)) given by the PREDICTOR.

Stage 1: Fine-tuning the EDITOR
In Stage 1 of MICE, we fine-tune the EDITOR to infill masked spans of text in a targeted manner. Specifically, we fine-tune a pretrained model to infill masked spans given masked text and a target end-task label as input. In this work, we use the TEXT-TO-TEXT TRANSFER TRANSFORMER (T5) model (Raffel et al., 2020) as our pretrained EDI-TOR, but any model suitable for span infilling can in principle be the EDITOR in MICE. The addition of the target label allows the highly-contextualized EDITOR to condition its predictions on both the masked context and the given target label such that the contrast label is not ignored in Stage 2. What to use as target labels during Stage 1 depends on who the end-users of MICE are. The end-user could be: (1) a model developer who has access to the labeled data used to train the predictor, or (2) layusers, domain experts, or other developers without access to the labeled data. In the former case, we could use the gold label as targets, and in the latter case, we could use the labels predicted by PREDIC-TOR. Therefore, during fine-tuning, we experiment with using both gold labels and original predictions y p of our PREDICTOR model as target labels. To provide target labels, we prepend them to inputs to the EDITOR. For more information about how these inputs are formatted, see Appendix B. Results in Table 2 show that fine-tuning with target labels results in better edits than fine-tuning without them.
The above procedure allows our EDITOR to condition its infilled spans on both the context and the target label. But this still leaves open the question of where to mask our text. Intuitively, we want to mask the tokens that contribute most to the PREDICTOR's predictions, since these are the tokens that are most strongly associated with the target label. We propose to use gradient attribution (Simonyan et al., 2014) to choose tokens to mask. For each instance, we take the gradient of the predicted logit for the target label with respect to the embedding layers of f and take the 1 norm across the embedding dimension. We then mask the n 1 % of tokens with the highest gradient norms. We replace consecutive tokens (i.e., spans) with sentinel tokens, following Raffel et al. (2020). Results in Table 1 show that gradient-based masking outperforms random masking.

Stage 2: Making Edits with the EDITOR
In the second stage of our approach, we use our finetuned EDITOR to make edits using beam search (Reddy, 1977). In each round of edits, we mask consecutive spans of n 2 % of tokens in the original input, prepend the contrast prediction to the masked input, and feed the resulting masked instance to the EDITOR; the EDITOR then generates m edits. The masking procedure during this stage is gradientbased as in Stage 1.
In one round of edits, we conduct a binary search with s levels over values of n 2 between values n 2 = 0% to n 2 = 55% to efficiently find a value of n 2 that is large enough to result in the contrast prediction while also modifying only minimal parts of the input. After each round of edits, we get f 's predictions on the edited inputs, order them by contrast prediction probabilities, and update the beam to store the top b edited instances. As soon as an edit e * = e(t) is found that results in the contrast prediction, i.e., f (e * ) = y c , we stop the search procedure and return this edit. For generation, we use a combination of top-k (Fan et al., 2018) and top-p (nucleus) sampling (Holtzman et al., 2020). 5

Evaluation
This section presents empirical findings that MICE produces minimal and fluent contrastive edits.

Experimental Setup
Tasks We evaluate MICE on three Englishlanguage datasets: IMDB, a binary sentiment classification task (Maas et al., 2011), a 6-class version of the 20 NEWSGROUPS topic classification task (Lang, 1995), and RACE, a multiple choice question-answering task (Lai et al., 2017). 6 PREDICTORS MICE can be used to make contrastive edits for any differentiable PREDICTOR model, i.e., any end-to-end neural model. In this paper, for each task, we train a PREDICTOR model f built on ROBERTA-LARGE (Liu et al., 2019), and fix it during evaluation. The test accuracies of our PREDICTORS are 95.9%, 85.3% and 84% for IMDB, NEWSGROUPS, and RACE, respectively. For training details, see Appendix A.1.
EDITORS Our EDITORS build on the base version of T5. For fine-tuning our EDITORS (Stage 1), we use the original training data used to train PRE-DICTORS. We randomly split the data, 75%/25% for fine-tuning/validation and fine-tune until the validation loss stops decreasing (for a max of 10 epochs) with n 1 % of tokens masked, where n 1 is a randomly chosen value in [20,55]. For more details, see Appendix A.2. In Stage 2, for each instance, we set the label with the second highest predicted probability as the contrast prediction. We set beam width b = 3, consider s = 4 search levels during binary search over n 2 in each edit round, and run our search for a max of 3 edit rounds. For each n 2 , we sample m = 15 generations from our fine-tuned EDITORS with p = 0.95, k = 30. 7  Table 1: Efficacy of the MICE procedure. We evaluate MICE edits on three metrics (described in §3.1): flip rate, minimality, and fluency. We report mean values for minimality and fluency. * marks full MICE variants; others explore ablations. For each property (i.e., column), the best value across MICE variants is bolded. We experiment with PREDICTOR's predictions (PRED) and gold labels (GOLD) as target labels during Stage 1. Across datasets, our GRAD MICE procedure achieves a high flip rate with small and fluent edits.

Metrics
25K instances in the test set for evaluation because of the computational demands of evaluation. 9 For each dataset, we measure the following three properties: (1) flip rate: the proportion of instances for which an edit results in the contrast label; (2) minimality: the "size" of the edit as measured by the word-level Levenshtein distance between the original and edited input, which is the minimum number of deletions, insertions, or substitutions required to transform one into the other. We report a normalized version of this metric with a range from 0 to 1-the Levenshtein distance divided by the number of words in the original input; (3) fluency: a measure of how similarly distributed the edited output is to the original data. We evaluate fluency by comparing masked language modeling loss on both the original and edited inputs using a pretrained model. Specifically, given the original N -length sequence, we create N copies, each with a different token replaced by a mask token, following Salazar et al. (2020). We then take a pretrained T5-BASE model and compute the average loss across these N copies. We compute this loss value for both the original input and edited input and report their ratio-i.e., edited / original. We aim for a value of 1.0, which indicates equivalent losses for the original and edited texts. When MICE finds multiple edits, we report metrics for the edit with the smallest value for minimality.

Results
Results are shown in Table 1. Our proposed GRAD MICE procedure (upper part of Table 1) achieves a 9 A single contrastive edit is expensive and takes an average of ≈ 15 seconds per IMDB instance (≈ 230 tokens). Calculating the fluency metric adds an additional average of ≈ 16.5 seconds per IMDB instance. For more details, see Section 5.
high flip rate across all three tasks. This is the outcome regardless of whether predicted target labels (first row, 91.5-100% flip rate) or gold target labels (second row, 94.5-100% flip rate) are used for finetuning in Stage 1. We observe a slight improvement from using the gold labels for the RACE PREDIC-TOR, which may be explained by the fact that it is less accurate (with a training accuracy of 89.9%) than the IMDB and NEWSGROUPS classifiers.
MICE achieves a high flip-rate while its edits remain small and result in fluent text. In particular, MICE on average changes 17.3-33.1% of the original tokens when predicted labels are used in Stage 1 and 18.5-33.5% with gold labels. Fluency is close to 1.0 indicating no notable change in mask language modeling loss after the edit-i.e., edits fall in distribution of the original data. We achieve the best results across metrics on the IMDB dataset, as expected since IMDB is a binary classification task with a small label space. These results demonstrate that MICE presents a promising research direction for the generation of contrastive explanations; however, there is still room for improvement, especially for more challenging tasks such as RACE.
In the rest of this section, we provide results from several ablation experiments.
Fine-tuning vs. No Fine-tuning We investigate the effect of fine-tuning (Stage 1) with a baseline that skips Stage 1 altogether. For this NO-FINETUNE baseline variant of MICE, we use the vanilla pretrained T5-BASE as our EDITOR. As shown in Table 1, the NO-FINETUNE variant underperforms all other (two-stage) variants of MICE for the IMDB and NEWSGROUPS datasets. 10 Fine- 10 We leave RACE out from our evaluation with the NO-FINETUNE baseline because we observe that the pretrained  Table 2: Effect of using target end-task labels during the two stages of PRED+GRAD MICE on the IMDB dataset. When end-task labels are provided, they are original PREDICTOR labels during Stage 1 and contrast labels during Stage 2. The best values for each property (column) are bolded. Using end-task labels during both Stage 1 (EDITOR fine-tuning) and Stage 2 (making edits) of MICE outperforms all other conditions. tuning particularly improves the minimality of edits, while leaving the flip rate high. We hypothesize that this effect is due to the effectiveness of Stage 2 of MICE at finding contrastive edits: Because we iteratively generate many candidate edits using beam search, we are likely to find a predictionflipping edit. Fine-tuning allows us to find such an edit at a lower masking percentage.

Gradient vs. Random Masking
We study the impact of using gradient-based masking in Stage 1 of the MICE procedure with a RAND variant, which masks spans of randomly chosen tokens. As shown in the middle part of Table 1, gradient-based masking outperforms random masking when using both predicted and gold labels across all three tasks and metrics, suggesting that the gradient-based attribution used to mask text during Stage 1 of MICE is an important part of the procedure. The differences are especially notable for RACE, which is the most challenging task according to our metrics.
Targeted vs. Un-targeted Infilling We investigate the effect of using target labels in both stages of MICE by experimenting with removing target labels during Stage 1 (EDITOR fine-tuning) and Stage 2 (making edits). As shown in Table 2, we observe that giving target labels to our EDITORS during both stages of MICE improves edit quality. Fine-tuning EDITORS without labels in Stage 1 ("No Label") leads to worse flip rate, minimality, and fluency than does fine-tuning EDITORS with labels ("Label"). Minimality is particularly affected, and we hypothesize that using target end-task la-T5 model does not generate text formatted as span infills; we hypothesize that this model has not been trained to generate infills for masked inputs formatted as multiple choice inputs. bels in both stages provides signal that allows the EDITOR in Stage 2 to generate prediction-flipping edits at lower masking percentages.

Analysis of Edits
In this section, we compare MICE edits with human contrastive edits. Then, we turn to a key motivation for this work: the potential for contrastive explanations to assist in NLP system development.
We show how MICE edits can be used to debug incorrect predictions and uncover dataset artifacts.

Comparison with Human Edits
We ask whether the contrastive edits produced by MICE are minimal and fluent in a meaningful sense. In particular, we compare these two metrics for MICE edits and human contrastive edits. We work with the IMDB contrast set created by Gardner et al. (2020) When George was thirty-five, he bought a small plane and learned to fly it. He soon became very good and made his plane do all kinds of tricks. George had a friend, whose name was Mark. One day George offered to take Mark up in his plane. Mark thought, "I've traveled in a big plane several times, but I've never been in a small one, so I'll go." They went up, and George flew around for half an hour and did all kinds of tricks in the air. When they came down again, Mark was glad to be back safely, and he said to his friend in a shaking voice, "Well, George, thank you very much for those two trips tricks in your plane." George was very surprised and said, "Two trips? tricks." Yes, That's my first and my last time, George." answered said Mark. input in Table 3, for which the RACE PREDICTOR gives an incorrect prediction. In this case, a model developer may want to understand why the model got the answer wrong. This setting naturally brings rise to a contrastive question, i.e., Why did the model predict the wrong choice ("twice") instead of the correct one ("only once")?
The MICE edit shown offers insight into this question: Firstly, it highlights which part of the paragraph has an influence on the model prediction-the last few sentences. Secondly, it reveals that a source of confusion is Mark's joke about having traveled in George's plane twice, as changing Mark's dialogue from talking about a "first and...last" trip to a single trip results in a correct model prediction.
MICE edits can also be used to debug model capabilities by offering hypotheses about "bugs" present in models: For instance, the edit in Table  3 might prompt a developer to investigate whether this PREDICTOR lacks non-literal language understanding capabilities. In the next section, we show how insight from individual MICE edits can be used to uncover a bug in the form of a dataset-level artifact learned by a model. In Appendix D, we further analyze the debugging utility of MICE edits with a PREDICTOR designed to contain a bug.

Use Case 2: Uncovering Dataset Artifacts
Manual inspection of some edits for IMDB suggests that the IMDB PREDICTOR has learned to rely heavily on numerical ratings. For instance, in the IMDB example in Table 3, the MICE edit results in a neg-  ative prediction from the PREDICTOR even though the edited text is overwhelmingly positive. We test this hypothesis by investigating whether numerical tokens are more likely to be edited by MICE.
We analyze the edits produced by MICE (GOLD + GRAD) described in §3.1. We limit our analysis to a subset of the 5K instances for which the edit produced by MICE has a minimality value of ≤0.05, as we are interested in finding simple artifacts driving the predictions of the IMDB PREDIC-TOR; this subset has 902 instances. We compute three metrics for each unique token, i.e., type t: and report the tokens with the highest values for the ratios p r (t)/p(t) and p i (t)/p(t). Intuitively, these tokens are removed/inserted at a higher rate than expected given the frequency with which they appear in the original IMDB inputs. We exclude tokens that occur <10 times from our analysis.
Results from this analysis are shown in Table  4. In line with our hypothesis, we observe a bias towards removing low numerical ratings and inserting high ratings when the contrast prediction y c is positive, and vice versa when y c is negative. In other words, in the presence of a numerical score, the PREDICTOR may ignore the content of the review and base its prediction solely on the score (as in the IMDB example in Table 3).

Discussion
In this section, we reflect on MICE's shortcomings. Foremost, MICE is computationally expensive. Stage 1 requires fine-tuning a large pretrained generation model as the EDITOR. More significantly, Stage 2 requires multiple rounds of forward and backward passes to find a minimal edit: Each edit round in Stage 2 requires b × s × m decoded sequences with the EDITOR, as well as b×s×m forward passes and b backward passes with the PRE-DICTOR (with b = 1 the first edit round), where b is the beam width, s is the number of search levels in binary search over the masking percentages, and m is the number of generations sampled for each masking percentage. Our experiments required 180 forward passes, 180 decoded sequences, and 3 backward passes for edit rounds after the first.
While efficient search for targeted edits is an open challenge in other fields of machine learning (Russell, 2019;Dandl et al., 2020), this problem is even more challenging for language data, as the space of possible perturbations is much larger than for tabular data. An important future direction is to develop more efficient methods of finding edits.
This shortcoming prevents us from finding edits that are minimal in a precise sense. In particular, we may be interested in a constrained notion of minimality that defines an edit e(x) as minimal if there exists no subset of e(x) that results in the contrast prediction. Future work might consider creating methods to produce edits with this property.

Related Work
The problem of generating minimal contrastive edits, also called counterfactual explanations (Wachter et al., 2017), 12 has previously been explored for tabular data (Karimi et al., 2020) and 12 Formally, methods for producing targeted counterfactual explanations solve the same task as MICE. However, not all contrastive explanations are counterfactual explanations; contrastive explanations can take forms beyond contrastive edits, images (Hendricks et al., 2018;Goyal et al., 2019; Looveren and Klaise, 2019) but less for language. Recent work explores the use of minimal edits changing true labels for evaluation (Gardner et al., 2020) and data augmentation (Kaushik et al., 2020;Teney et al., 2020), whereas we focus on minimal edits changing model predictions for explanation.
Contrastive Explanations within NLP There exist limited methods for automatically generating contrastive explanations of NLP models. Jacovi and Goldberg (2020) define contrastive highlights, which are determined by the inclusion of contrastive features; in contrast, our contrastive edits specify how to edit (vs. whether to include) features and can insert new text. 13 Li et al. (2020a) generate counterfactuals using linguistically-informed transformations (LIT), and Yang et al. (2020) generate counterfactuals for binary financial text classification using grammatically plausible single-word edits (REP-SCD). Because both methods rely on manually curated, task-specific rules, they cannot be easily extended to tasks without predefined label spaces, such as RACE. 14 Most recently, Jacovi et al.
(2021) propose a method for producing contrastive explanations in the form of latent representations; in contrast, MICE edits are made at the textual level and are therefore more interpretable.
This work also has ties to the literature on causal explanation (Pearl, 2009). Recent work within NLP derives causal explanations of models through counterfactual interventions (Feder et al., 2021;Vig et al., 2020). The focus of our work is the largely unexplored task of creating targeted interventions for language data; however, the question of how to derive causal relationships from such interventions remains an interesting direction for future work.
Counterfactuals Beyond Explanations Concurrent work by Madaan et al. (2021) applies consuch as free-text rationales (Liang et al., 2020) or highlights (Jacovi and Goldberg, 2020). In this paper, we choose to refer to MICE edits as "contrastive" rather than "counterfactual" because we seek to argue for the utility of contrastive explanations of model predictions more broadly; we present MICE as one method for producing contrastive explanations of a particular form and hope future work will explore different forms of contrastive explanations. 13 See Appendix D for a longer discussion about the advantage of inserting new text in explanations, which MICE edits can do but methods that attribute feature importance (i.e. highlights) cannot. 14 LIT relies on hand-crafted transformation for NLI tasks based on linguistic knowledge, and REP-SCD makes antonym-based edits using manually curated, domain-specific lexicons for each label. trolled text generation methods to generate targeted counterfactuals and explores their use as test cases and augmented examples in the context of classification. Another concurrent work by Wu et al. (2021) presents POLYJUICE, a general-purpose, untargeted counterfactual generator. Very recent work by Sha et al. (2021), introduced after the submission of MICE, proposes a method for targeted contrastive editing for Q&A that selects answer-related tokens, masks them, and generates new tokens. Our work differs from these works in our novel framework for efficiently finding minimal edits (MICE Stage 2) and our use of edits as explanations. Connection to Style Transfer The goal of style transfer is to generate minimal edits to inputs to result in a target style (sentiment, formality, etc.) (Fu et al., 2018;Li et al., 2018;Goyal et al., 2020). Most existing approaches train an encoder to learn style-agnostic latent representation of inputs and train attribute-specific decoders to generate text reflecting the content of inputs but exhibiting a different target attribute (Fu et al., 2018;Li et al., 2018;Goyal et al., 2020). Recent works by Wu et al. (2019) and Malmi et al. (2020) adopt twostage approaches that first identify where to make edits and then make them using pretrained language models. Such approaches can only be applied to generate contrastive edits for classification tasks with well-defined "styles," which exclude more complex tasks such as question answering. 15 Song et al. (2020) propose a method to produce fluent semantic collisions, which they call the "inverse" of adversarial examples.

Conclusion
We argue that contrastive edits, which change the output of a PREDICTOR to a given contrast prediction, are effective explanations of neural NLP models. We propose MINIMAL CONTRASTIVE EDITING (MICE), a method for generating such edits. We introduce evaluation criteria for contrastive edits that are motivated by human contrastive explanations-minimality and fluencyand show that MICE edits for the IMDB, NEWS-GROUPS, and RACE datasets are contrastive, fluent, and minimal. Through qualitative analysis of MICE edits, we show that they have utility for robust and reliable NLP system development.

Broader Impact Statement
MICE is intended to aid the interpretation of NLP models. As a model-agnostic explanation method, it has the potential to impact NLP system development across a wide range of models and tasks. In particular, MICE edits can benefit NLP model developers in facilitating debugging and exposing dataset artifacts, as discussed in §4. As a consequence, they can also benefit downstream users of NLP models by facilitating access to less biased and more robust systems.
While the focus of our work is on interpreting NLP models, there are potential misuses of MICE that involve other applications. Firstly, malicious actors might employ MICE to generate adversarial examples; for instance, they may aim to generate hate speech that is minimally edited such that it fools a toxic language classifier. Secondly, naively applying MICE for data augmentation could plausibly lead to less robust and more biased models: Because MICE edits are intended to expose issues in models, straightforwardly using them as additional training examples could reinforce existing artifacts and biases present in data. To mitigate this risk, we encourage researchers exploring data augmentation to carefully think about how to select and label edited instances.
We also encourage researchers to develop more efficient methods of generating minimal contrastive edits. As discussed in §5, a limitation of MICE is its computational demand. Therefore, we recommend that future work focus on creating methods that require less compute.

A.1 PREDICTOR Models
For all datasets, f is initialized as a ROBERTA-LARGE model with a linear layer and maximum sequence length of 512 tokens. We train with AllenNLP (Gardner et al., 2017). For IMDB and NEWSGROUPS, we fine-tune f for 5 epochs with batch size 8 using Adam with initial learning rate of 2e−05, weight decay 0.1, and slanted triangular learning rate scheduler with cut frac 0.06. For RACE, we fine-tune f for 3 epochs with batch size 4 and 16 gradient accumulation steps using Adam with learning rate 1e−05, = 1e−08, and linear learning rate scheduler with 100 warm-up steps, and we fix f after the epoch with the lowest validation loss.

A.2 EDITOR Models
We use the transformers implementation (Wolf et al., 2020) of the base T5 for our EDI-TORS. We use Adam with a learning rate of 1e−4.
For IMDB EDITORS, we use batch size 4 for all variants. For NEWSGROUPS, we use batch size 4 for fine-tuning with predictor labels and batch size 8 for fine-tuning with gold labels. For RACE, we use batch size 4 for fine-tuning with predictor labels and batch size 6 for fine-tuning with gold labels.

B Data Processing
We remove newline and tab tokens (<br />, \t, \n) in all datasets, as these are tokenized differently by our PREDICTORS (ROBERTA-LARGE) and EDI-TORS (T5). For NEWSGROUPS, we also remove headers, footers, and quotes.
Inputs to EDITORS For IMDB and NEWS-GROUPS EDITORS, we simply prepend target labels to the masked original inputs. For RACE, we give the question, context, all answer options, and the correct choice as input to the RACE EDITOR. We only mask the context. See Table 5 for examples.

C T5 generation for large n 2
We noticed that generations sometimes degenerate when we decode from T5 with a large masking percentage n 2 . For example, sentinel tokens are sometimes generated out of consecutive order. We attribute this to the large difference between masking percentages we use (up to 55%) and masking percentage used during T5 pretraining (15%).
Specifically, we observed that generations tend to degenerate after the the 28th sentinel token. Thus, we heuristically reduce the number of sentinel tokens by combining neighboring sentinel tokens that are separated by 1-2 tokens into one sentinel token. When the output degenerates, we do the following: In-fill the mask tokens with the "good" parts of the generation (i.e. parts with correctly ordered sentinel tokens), and replace the remaining mask tokens with the original text; get the contrast label probabilities from f for these intermediate in-filled candidates; of these, take the m = 3 candidates with the highest probabilities and use as input to generate m/m new candidates. 16 D Using MICE Edits to Debug a "Buggy" PREDICTOR: A Case Study In §4, we illustrate how MICE edits can be used to debug both individual predictions and natural dataset artifacts learned by a model. Here, we further explore the utility of MICE edits in debugging through Data Staining (Sippy et al., 2020): We design a "buggy" PREDICTOR and evaluate whether MICE edits can recover the bug. We create a buggy RACE PREDICTOR by introducing an artifact into the RACE train set. This artifact is the presence of the phrase "It is interesting to note that" in front of the correct answer choice. We introduce this artifact as follows: We filter the RACE train data to contain instances for which the correct answer choice is contained by some sentence 17 and the overlapping sentence does not have a higher degree of n-gram overlap with some other (incorrect) choice. After filtering, 11,188 of 87,866 train instances remain. We then prepend "It is interesting to note that" to the overlapping sentence to design a correlation between the location of this phrase and the correct answer choice; our goal is to encourage a PREDICTOR to learn to predict the multiple choice option closest to this buggy phrase as the correct answer. If there are multiple overlapping sentences, we choose the one with the most overlap with the answer choice. We randomly sample from this filtered subset such that 10% of the train data contains this artifact. Our buggy RACE PREDICTOR is trained on this modified data using the same set-up from §A.1, except that we use a batch size of 2 and 32 gradient accumulation steps. 16 If one of the partially-infilled candidates results in the contrast label, we return this as the edited input. 17 A sentence "contains" the correct answer choice if the answer has at least a 4-gram overlap with the sentence.

Task
Original Input Input to EDITOR NEWS Michael, you sent your inquiry to the bmw mailing list, but the sw replaces your return addr with the list addr so I can't reply or manually add you. please see my post re the list or contact me directly.
label: misc. input: <extra_id_0>, you sent your <ex-tra_id_1> to the <extra_id_2>, but the <extra_id_3> your return <extra_id_4> with the list <extra_id_5> so I can't <extra_id_6> or <extra_id_7> add you. please see my post re the list or contact me directly.

RACE
article: The best way of learning a language is by using it. The best way of learning English is using English as much as possible. Sometimes you will get your words mixed up and people wont understand. Sometimes people will say things too quickly and you cant understand them. But if you keep your sense of humor( ),you can always have a good laugh at the mistakes you make. Dont be unhappy if the people seem to laugh at your mistakes. Itś much better for people to laugh at your mistake than to be angry because they dont know what you are saying. The most important rule for learning English is "Dont be afraid of making mistakes. Everyone makes mistakes." question: In learning English, you should _. choices: speak as quickly as possible., laugh as much as you can., use it as often as you can., write more than you read.
question: In learning English, you should _. answer: choice1: laugh as much as you can. context: The <extra_id_0> <extra_id_1>. Sometimes you will get your words <extra_id_2> <extra_id_3> <extra_id_4> have a good laugh at the mistakes you make. Don't be unhappy if the people seem to laugh at your mistakes. It's much better for people to laugh at your mistake than to be angry because they don't know what you are saying. The most important rule for learning English is "Don't be afraid of making mistakes. Everyone makes <extra_id_5>." choice0: speak as quickly as possible. choice1: laugh as much as you can. choice2: use it as often as you can. choice3: write more than you read. Original pred yp = (d) All Contrast pred yc = (b) Three Just as "Tiger Mom" leaves, here comes the "Wolf Daddy" called Xiao Baiyou. He believes he's the best parent in the world. Some days ago, Xiao Baiyou's latest book about how to be a successful parent came out. He is pretty strict with his four children. Sometimes he even beat them. But the children don't hate their daddy at all. And all of them finally went to Pecking University, It is interesting to note that three of them got good marks at Pecking University. And one of the top universities in China them even passed the exam without any problem. So Xiao proudly tells others about his education idea that children need strict rules. In his microblog, he said, "Come on, want your children to enter Peking University without rules? You must be joking." And, "Leave your children more money, and strict rules at the same time."But the "Wolf Daddy" way was soon questioned by other parents. Some say that Xiao Baiyou just want to be famous by doing so. The "Wolf Daddy" Xiao Baiyou is a 47-year-old Guangdong businessman who deals in luxury goods in Hong Kong. Unlike many other parents who usually have one child, Xiao has four children. Two of them were born in Hong Kong and two in the US. Some people on the Internet think the reason why his children were able to enter Peking University is because the exam is much easier taken from Hong Kong. The test accuracies of our original and buggy RACE PREDICTORS are both 84%, and so we cannot use this measure to select the better classifier. We ask whether MICE edits can be used for this purpose. One such edit is shown in Table 6. We observe that the signal from the edit, which contains both the manual artifact "It is interesting to note that" and the contrast prediction "three," is enough to overpower the signal from the explicit assertion that "All" is the correct answer ("And all of them finally went to Pecking University") such that the PREDICTOR's prediction changes to "Three." This edit thus provides evidence that some heuristic may have been learned by the predictor. Considering multiple MICE edits can validate such a hypothesis: We find that 17.2% of the edits produced by MICE reflect this bug (i.e. contain the phrase "interesting to note that"); in other words, they do uncover the manually inserted bug. Furthermore, MICE edits are able to uncover the artifact because they can insert new text. For instance, in the edit in Table 5, the buggy phrase "It is interesting to note that" is not part of the original input. Applying saliency-based explanation methods, such as gradient attribution, to the buggy PREDICTOR's prediction would not reveal the PRE-DICTOR's reliance on the manual artifact, as the buggy phrase is not already present in the text. This difference highlights a key advantage of MICE over existing instance-based explanation methods that attribute feature importance, which can only cite text already present in original inputs.

IMDB Original pred yp = negative
Contrast pred yc = positive With a catchy title like the Butcher of Plainfield this Ed Gein variation and Kane Hodder playing him will no doubt fly off the shelves for a couple of weeks.Most viewers will be bored laughed silly with this latest take on the life of Ed Gien. The movie focuses on Ed's rampage and gives us a(few)glimpses into his Psycosis and dwelling in Plainfeild.Its these scenes that give the movie a much needed jolt. What ruins this Another annoyance is the constant focus on other characters lives and focuses less on Eds.Big mistake here. Kane Hodder is a strange choice to play Gein,but He does pull it off quite well,and deserves more acting credits than he gets these days.Prascilla Barnes and Micahel Barryman also show up. 3/10 9/10 Original pred yp = positive Contrast pred yc = negative I have just sat through this film again and can only wonder if we will see the likes kind of films like this anymore? The timeless music sex, the tender voices performances of William Holden and Jennifer Jones leave this grown man weeping suffering through joyous, romantic torturous, incoherent scenes and I'm not one who cries very often in life. Where have our William Holden's gone and will they make these moving, wonderful cynical, movies any more? It's sad to have to realize that they probably won't but don't think about it, just try to block that out of your mind. Even so Then again, they won't have Holden Shakespeare in it and he won't appear on that hill soap opera just once more either. You can only enjoy safely skip this film and watch it again.
Original pred yp = positive Contrast pred yc = negative This little flick is reminiscent of several other movies, but manages to keep its own style & mood. "Troll Trusty" & "Don't Be Afraid of the Dark" come to mind. The suspense builders performances were good, & just cross the line from G silly to PG uninteresting. I especially liked the non-cliche cliched choices with the parents; in other movies, I could predict the dialog ending verbatim, but the writing in this movie made better selections. If you want a movie that's not gross terribly creepy but gives you some chills, this is a great choice.

NEWSGROUPS
Original pred yp = talk Contrast pred yc = sci Would someone be kind enought to document the exact nature of the evidence against the BD NRA's without reference to hearsay or newsreports. I would also like to know more about their past record etc. but again based on solid not media reports. My reason for asking for such evidence is that last night on Larry King Live a so-called "cult space-expert" was interviewed from Australia who claimed that it was his evidence which led to the original raid discovery. This admission, if true, raises the nasty possibility that the Government acted in good faith, which I believe they did, on faulty evidence. It also raises the possibility that other self proclaimed cult space experts were advising them and giving ver poor advice.  until 1972 1973). The idea was to get the Giants Atheists to move into Shea Metlife Stadium. When a deal was worked out between the Giants Atheists and the Yankees Mets, the new AFL American franchise, the New York Titans Atheists, approached the City Mets about using the new stadium. The Titans Mets were playing in Downing Carling Stadium (where the Cosmos Atheists played soccer back in the 70s). Because Shea Stadium was tied into the World's Fair anyway, the city thought it would be a novel idea to promote the new franchise and the World's Fair (like they were doing with the Mets). So the deal was worked out. I'm under the impression that when Murph says it, he means it! As a regular goer to Shea, it is not a bad place since they've cleaned and renovated the place. Remember, this is its 30th Year!  The Internet has led to a huge increase in credit-card fraud. Your card information could even be for sale in an illegal web site. Web sites offering cheap goods and services should be regarded with care. On-line shoppers who enter can get credit-card information with stolen details through their credit-card information may never receive the online shopping sites, including buying goods they thought they bought. The thieves then go may use the information they have on your credit card to send shopping promotions, ads, or other Web sites. The thieves will not use with your card number -or sell the information over the Internet. Computers Recent developments in internet hackers have broken down security systems, raising questions about the safety of cardholder information. Several months ago, 25, 000 customers of CD Universe, an on-line music retailer, were not lucky. Their names, addresses and credit-card numbers were posted on a Web site after the retailer refused to pay US $157, 828 to get back the information. Credit-card firms are now fighting against on-line fraud. Mastercard is working on plans for Web -only credit card, with a lower credit limit. The card could be used only for shopping on-line purchases. However, But there are a few simple steps you can take to keep from being cheated. Ask about your credit-card firm's on-line rules: Under British law, cardholders have to pay the first US $ 7820 penalty of any fraudulent spending. And shop only activity at secure sites; Send your credit-card information only if the Web site offers advanced secure system. If the security is in place, a letter will appear in the bottom right-hand corner of your screen. The Website address may also start https: //-// // // andthe extra "s" stands for secure. If in doubt, Never give your credit-card information over the telephone. Keep your password safe: Most on-line sites require a user name and password before when placing an order. Treat your passwords with care.
Question: If you want to be a football player, you should __.
(a) buy a good football (b) play football (c) watch others play football (d) put your football away Original pred yp = (b) Contrast pred yc = (a) We are all learning English, but how can we learn English well? A student can know a lot about English, but maybe he can't speak English. If you want to know how to swim be a football player, you must get into the river buy a good football. If And if you want to be a football an English player, you must play football. So, you see. You can learn English only by using it. You must listen to your teacher in class. You must read your lessons every day. You must speak English to your classmates and also you must write something sometimes. Then one day, you may find your English very good.
Question: This story most probably took place __.
(a) at the beginning of the term (b) in the middle of the term (c) at the end of the term (d) at the beginning of the school year Original pred yp = (c) Contrast pred yc = (b) A teacher stood was giving new classes to students in front the middle of his history this term. The students were in class of twenty students just before handing out the final exam. His students by now. They sat quietly and waited for him to speak. "It's been a pleasure teaching you this term my last chance," he said to them. The class started to cry. They cried for a long time. Finally, the teacher got up. He looked them in surprise. Then he asked them to leave. They "You've all worked very hard, so I have a pleasant surprise for you. Everyone who chooses not to take the final exam will get a 'B' for the course." Most of the students jumped out of their seats. They thanked the teacher happily, and walked out of the classroom. Only a few students stayed. The teacher looked at them. "This is your last chance," he said. "Does anyone else want to leave?" All the students there stayed in their seats and took out their pencils. The teacher smiled. "Congratulations," he said. "I'm glad to see you believe in yourselves. You all get A on well." Table 9: Examples of edits produced by MICE for inputs from the RACE dataset. Insertions are bolded in red.
Deletions are struck through. y p is the PREDICTOR's original prediction, and y c the contrast prediction. True labels for original inputs are underlined.