Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models

While counterfactual examples are useful for analysis and training of NLP models, current generation methods either rely on manual labor to create very few counterfactuals, or only instantiate limited types of perturbations such as paraphrases or word substitutions. We present Polyjuice, a general-purpose counterfactual generator that allows for control over perturbation types and locations, trained by finetuning GPT-2 on multiple datasets of paired sentences. We show that Polyjuice produces diverse sets of realistic counterfactuals, which in turn are useful in various distinct applications: improving training and evaluation on three different tasks (with around 70% less annotation effort than manual generation), augmenting state-of-the-art explanation techniques, and supporting systematic counterfactual error analysis by revealing behaviors easily missed by human experts.


Introduction
Counterfactual reasoning -mentally simulating what would have happened if conditions were different -is a common tool for making causality assessments (Kahneman and Tversky, 1981), which in turn are crucial for model evaluation, error analysis, and explanation (Miller, 2019). For example, in Figure 1, "It is great for kids" is perturbed into multiple variations, each providing unique insights by simulating what would have happened if the sentence was different.
Applications of counterfactual reasoning to NLP generally specify the relationship x )x, and then createx according to the relationship. As a result, prior work has tailored counterfactual generators for different applications, only collecting subsets ofx that are useful for the specific task. For example, to support model training and evaluation, human annotators create counterfactuals It is great for kids.
It is great for kids→adults. It is great→scary for kids.

Error Analysis
It is not great for kids. It is great for kids→no one. that change the groundtruth labels by manually rewriting instances (Gardner et al., 2020;Qin et al., 2019) or defining perturbation functions (Ribeiro et al., 2020). Manual rewrites are costly (e.g., 4-5 minutes per counterfactual (Kaushik et al., 2020)) and susceptible to systematic omissions (e.g., human annotators may cover great ) not great, but miss kids ) no one in Figure 1B). Meanwhile, automated generators for model analysis and explanation usually focus on other relationships, e.g., generatingx that have different model predictions than x (Ross et al., 2020;Zhang et al., 2019a). As a result, they neglect prediction-preserving counterfactuals that are equally important for understanding or shaping model behaviors, like kids ) no one and great ) scary linked to Figure 1D. However, counterfactual generation does not have to be task-specific. The same set of counterfactuals in Figure 1 can support a variety of applica-tions. Moreover, for cases like model explanation and analysis, a general-purpose pool of counterfactuals may be preferable, as the relationship of interest can be more exploratory and user-oriented (Wu et al., 2019). In this work, we formalize the task of counterfactual generation, disentangling generation from the application of counterfactuals. Given an input x ( Figure 1A), our generator produces a set of counterfactualsX = {x 1 ,x 2 , ...} with applicationagnostic relationships x )x i ( Figure 1B). Afterwards, we use application-specific selection methods to find subsets ofx that are most effective for a given use case ( Figure 1C).
We frame the generation step as conditional text generation, and finetune GPT-2 (Radford et al., 2019) into a generator called Polyjuice using (x,x) pairs. To allow for targeted counterfactuals, we also design control codes like negation or delete ( Figure 1B), and adopt fill-in-the-blank structures (Donahue et al., 2020) to specify where the perturbation occurs and how. Intrinsic evaluation shows that Polyjuice generatesx that are fluent, diverse, and close to x, and that the control mechanisms retrieve perturbations that would likely not be sampled from off-the-shelf language models.
With simple selection heuristics, we show that a single Polyjuice model can significantly aid humans in diverse downstream applications. 2 For counterfactual training and evaluation ( §3), humans label Polyjuice counterfactuals rather than creating them from scratch. They produce training data that significantly improve model generalization, as well as contrast sets that help identify model vulnerabilities (Gardner et al., 2020), with around 70% less annotation effort. In another application, Polyjuice produces counterfactual explanations ( §4), providing significant insight on top of state-of-the-art explanation techniques. Finally, Polyjuice supports counterfactual error analysis ( §5). It allows users to explore related counterfactuals (e.g., the model responds differently to different negation forms in Figure 1B), and to aggregate individual counterfactuals into patterns in order to gain systematic understanding of model behavior.

Definition and Desiderata
Given an instance x, a generator g produces a set of counterfactualsX = {x 1 ,x 2 , ...} with various re- 2 We demonstrate Polyjuice in semi-automatic settings, but as discussed in §2.2, it can also work automatically. Figure 2: (A) Polyjuice prompt format, which concatenates the original x, the control code, and thex ("It is not great for children" converted to an infilling structure). At generation time, Polyjuice accepts prompts that just include x (Line 1), or optionally with the code and the [BLANK]s (Lines 2-3), and fills in the blanks sequentially with spans separated by [ANSWER]s (Line 4). (B) Polyjuice allows blanking at different granularities (even the entire sentence), such that Lines 3-4 in (A) can be replaced by Lines 6-7 or 8-9. lationships x )x i . For example, great ) not great, kids ) no one in Figure 1B are both instances of the negation relationship. Each (x,x) pair shares multiple relationships -these two are also instances of the label flipping relationship if the task is sentiment analysis (but might not be for other tasks). As illustrated in §1, knowing which relationships apply aids selection for downstream applications.
We expect g to produce counterfactualsx that are (1) close to x, preferably only involving the minimal changes necessary to establish a certain effect (Pearl, 2018), allowing users to make causality assessments. The generatedx should also be (2) fluent, i.e., grammatically correct (Morris et al., 2020) and semantically meaningful (e.g.,"Colorless green ideas sleep furiously" is not meaningful (Chomsky, 2002)). Fluency operationalizes "probable" counterfactuals in the context of NLP; as Kahneman and Tversky (1981) stated, humans strongly favor counterfactuals that are close to the original instance, but also prefer those that could have easily happened without assuming rare events or strange coincidences. Further, as a general-purpose generator, g should produce counterfactuals with a measure of (3) control over relationships x )x, such that the counterfactuals can vary with the object-of-attention in each application (the "focus rule" (Kahneman and Tversky, 1981)). Finally, we expect g to output a (4) diverse set ofx in terms of relationships, covering a large variety of "what-ifs" for different applications (Pearl, 2018). shuffle To move (or swap) key phrases or entities around the sentence. A dog ) woman is embraced by the woman ) dog. (Zhang et al., 2019b) lexical To change just one word or noun chunk without altering the POS tags. A dog is embraced ) attacked by the woman. (Sakaguchi et al., 2020) resemantic To replace short phrases without altering the remaining dependency tree. A dog is embraced by the woman ) wrapped in a blanket. (Wieting and Gimpel, 2018) insert To add short phrases without altering the remaining dependency tree. restructure To alter the dependency tree structure, e.g., changing from passive to active. A dog is embraced by ) hugging the woman. (Wieting and Gimpel, 2018)

Conditional Counterfactual Generation
We frame counterfactual generation as a conditional text generation task using language models (LMs), and train Polyjuice by finetuning GPT-2 (Radford et al., 2019) using the following prompt design (alternative LMs could also have been used).
Prompt format design. To ensure thatx is close to x rather than arbitrary text, we condition the generation on x, followed by a special token (Line 1 in Figure 2A). In Line 2, we have control codes (Keskar et al., 2019) such as negation. We design them to specify types of perturbation from among lexical, syntactic, or semantic aspects (see Table  1), inspired by prior work that categorizes manually created counterfactuals (Kaushik et al., 2020;Gardner et al., 2020). As an additional layer of control over x )x, we allow users to specify where changes happen by having the LM infill [BLANK] tokens (Donahue et al., 2020), rather than generating arbitrary counterfactuals (Lines 3-4).
Finetuning GPT-2 -a causal LM for predicting next tokens -additionally allows us to exercise control at various levels of granularity. At generation time, if the user provides only the original example, Polyjuice will generate the control code, the blank locations, and the infilling (Lines 2-4). Alternatively, the user can specify the control code, or the control code and the blanks, to exercise different degrees of control depending on the application. As later shown in §4 and §5, such control is important for different use cases.
Training data. To train a conditional model, we combine six existing sentence-pair datasets, each containing a subset of the desired phenomena in Table 1. Further, we find naturally occurring sentence pairs (filtered by edit distance to guarantee closeness) in non-paired datasets including CommonGen (Lin et al., 2020), Natural Questions (Kwiatkowski et al., 2019, and SQuAD (Rajpurkar et al., 2016), such that the resulting dataset contains diverse counterfactuals. 3 We translate these sentence pairs into the format given in Figure 2A. For each (x,x), we compute its primary control code using part-of-speech tags and dependency trees. For example, negation occurs when we observe changes to negation modifiers or specific words like "supposedly", and shuffle occurs when we have overlap between tokens deleted and added. When multiple changes occur, we label it with the control code which most significantly changes the semantics of the corresponding subphrase as computed by SBERT (Reimers and Gurevych, 2019). For example, in Figure 2A, negation (great ) not great) is more significant than lexical (kids ) children). To balance the distribution ( Table 7 in Appendix A), for each dataset, we extract control codes from all the (x,x), 4 and randomly sample up to 10,000 instances per codes.
In order to allow for flexible blanking at generation time, we generate multiple training prompts per pair, covering different dependency tree struc-  tures related to the perturbed spans ( Figure 2B), including (1) just the changed tokens, (2) the associated parsing structures, (3) the merged changes, and (4) the entire sentence. We eventually obtain 657, 144 prompts from 186, 451 pairs.
Fluency filtering. While the original GPT-2 produces fluent text, some combinations of control codes and blanks cause Polyjuice to generate nonsensical results. Following Morris et al. (2020), we score both x andx with GPT-2, and filterx when the log-probability (on the full sentence or the perturbed chunks) decreases by more than 10 points relative to x. Fully automated uses of Polyjuice (e.g., adversarial attacks) may benefit from stricter constraints, at the cost of diversity (as surprising changes may be filtered even if they are fluent).

Intrinsic Evaluation
We evaluate Polyjuice on closeness and diversity by comparing its perturbations on 300 randomly selected sentences with baselines that use more or less context from x: (1) non-finetuned GPT-2, (2) token-infilling RoBERTa (Liu et al., 2019) and (3) span-infilling T5 (Raffel et al., 2020). As shown in Table 2, Polyjuice generates counterfactuals that are close to the original instance, measured by syntactic tree (Zhang and Shasha, 1989) and Levenshtein edit distance (Levenshtein, 1966). In contrast, non-finetuned GPT-2 generates arbitrary text instead of perturbations when given the starting tokens of a sentence, as it only leverages context in a single direction. As for infilling models, Polyjuice counterfactuals are more diverse (measured by self-BLEU (Zhu et al., 2018)) than RoBERTa ones, which is restricted to word substitution. Meanwhile, T5 displays higher diversity but less closeness, probably due to the fact that it does not consider the original masked tokens when generatingx. For example, in Figure 1 "It is great for kids," T5 replaces "for kids" with "idea", "to meet you," whereas Polyjuice generates "for kids yet adults can enjoy," "for any audience." We evaluate controllability by comparing Polyjuice with T5 as well as with GPT-2 finetuned on prompts without codes. We verify that the codes improve the success rate of generating counterfactuals with the desired perturbation types set out in Table 1 by as much as 42% for perturbations such as negation and insert. For example, given "It is [BLANK] great for kids," baselines generate "also," "fun and," rather than "not" (negation).

Counterfactual Evaluation & Training
We ask crowdworkers to label Polyjuice-generated counterfactuals for Sentiment, NLI, and QQP, for the purposes of evaluation and training. 5 In each labeling round, the worker is presented with an original x and its label, and asked to annotate the groundtruth for threex, rejecting non-fluent ones (details and interface in Appendix B.1).
We use a simple heuristic to select which counterfactuals are presented for labeling, aimed at increasing diversity. Representing eachx by its token changes, control code, and dependency tree structure, we greedily select the ones that are least similar to those already selected for labeling. This avoids redundancy in the labeling set, e.g., common perturbation patterns such as black ) white.

Evaluation with Contrast Sets
We verify whether Polyjuice counterfactuals can be used to create contrast sets (Gardner et al., 2020), i.e., evaluation sets where each instance has a nearby counterfactual with a different groundtruth, to better evaluate model decision boundaries. We  construct these sets by simply filtering out counterfactuals that are labeled the same as their original instances (40%-63% depending on the task).
For each task, we test multiple classifers opensourced by Huggingface (Wolf et al., 2020), and report the best performing model for each 6 in Table 3 (results for other models are analogous). Polyjuice contrast sets display performance gaps consistent with those of Gardner et al. (2020), where the sets are constructed manually by NLP researchers, even though we use non-expert annotators who only label examples rather than creating them. To distinguish the benefit of counterfactuals from that of just adding more data, we further add a baseline that uses n + m original examples (m-baseline). In addition to in-domain test set accuracy, we measure models' generalization on out-of-domain datasets, as well as contrast sets and challenge sets. We also evaluate model capabilities with CheckList (Ribeiro et al., 2020) for Sentiment and QQP. Reported model performances are averaged across multiple data samples and random seeds (Appendix B.2).

Training with Counterfactuals
For Sentiment, we select random Polyjuice counterfactuals regardless of their labels, as long as an original x has at least onex that flips the label. For NLI and QQP, we observed in a pilot study that randomly chosen counterfactuals may not be more effective than the same amount of additional data. We suspect that Polyjuice lacks domain knowledge and context for identifying critical perturbations, and therefore brings benefits redundant with pretraining (Longpre et al., 2020). Thus, we use the slicing functions of Chen et al. (2019) to find patterns of interest (e.g., prepositions in NLI), and perturb those patterns by placing [BLANK]s on the matched spans. For example, "His surfboard is beneath him" becomes "His surfboard is [BLANK] him", and Polyjuice generates counterfactuals such as "His surfboard is beneath ) next to him." Results. Tables 4-6 indicate that Polyjuice augmentation is effective in all tasks: m-polyjuice maintains in-domain accuracy while consistently improving or maintaining generalization accuracy in various out-of-domain and challenge sets. On NLI, Polyjuice counterfactuals are as effective or more effective than counterfactuals created from scratch (m-CAD). Notably, we obtain the largest gains on challenge and contrast sets (e.g., Break and DNC in Table 5) or when the out-of-domain dataset is sufficiently different from the training domain (e.g., Senti140 and SemEval in Table 4). Polyjuice also improves results on CheckList tests that previously had high error rates: it significantly lowers the error rates on 11 out of 27 QQP tests, 7 making 2/27 tests worse. For Sentiment, it improves the model on 5 out of 15 tests, hurting 1. Here, we only report a low m/n ratio (<10% for NLI and QQP) to show that a small amount of augmentation is already beneficial. The results are similar for other combinations we explored (see Appendix B.2), except when the ratio of counterfactual to original data was too high (e.g.,, m = n may decrease vocabulary diversity or induce additional data bias, echoing (Khashabi et al., 2020)).

Discussion
We show that Polyjuice counterfactuals are useful for evaluation, and more effective than additional (non-counterfactual) data for training in a variety of tasks. In contrast to prior work where humans generate counterfactuals from scratch, we only ask them to label automatically generated ones, while still achieving similar or better results.
We believe our approach is more effective than manual creation (although both are beneficial): in     Table 6: Polyjuice with n=20, 000 and m=1, 911 improves accuracy on PAWS-QQP (Zhang et al., 2019b). terms of implementation effort, the process of just labeling counterfactuals is the same as labeling original examples, such that no additional annotator training or separate pipelines are required; in contrast, Kaushik et al. (2020) set up two separate crowdsourcing tasks for creating and labeling the counterfactuals. Further, annotator effort is much lower, as evaluating examples is easier than creating them -Kaushik et al. (2020) report an average of ≈2 minutes per NLI counterfactual prior to quality validation, while our median time was 10 seconds per counterfactual. Even after our quality validation (removing noisy annotators, disregarding non-fluent counterfactuals), our rate for NLI is ≈36 seconds per counterfactual (used in Table 5).
In terms of the utility per counterfactual, manual creation and Polyjuice may be complementary. Manual annotation may be unreliable or incomplete for certain forms of counterfactuals (Ribeiro et al., 2018), whereas Polyjuice can miss more complex or context-dependent changes, and could benefit from target perturbations that compensate for its lack of domain knowledge (targeted guidance is also helpful for human annotators (Huang et al., 2020)). Thus, it may be important to mix both approaches (Khashabi et al., 2020). Polyjuice's flexibility opens up possibilities for hybrids between human creation and human verification of targeted, machine-generated counterfactuals. , perturbed Q2 is Duplicate (=) at 98.2% confidence, with SHAP importance weights for tokens in Q2. Counterfactual explanations complement SHAP with concrete examples and surprising behaviors, e.g., (B) shows that friend ) woman surprisingly flips the prediction to Non-Duplicate ( ), despite the low weight on "friend."

Counterfactual Explanations
A popular way of explaining NLP models is to attribute importance weights to the input tokens, either using attention scores (Wiegreffe and Pinter, 2019) or by summarizing the model behavior on perturbed instances (e.g., LIME (Ribeiro et al., 2016) and SHAP (Lundberg and Lee, 2017)). Though ubiquitous, token scores may not always reflect their real importance (Pruthi et al., 2020). Popular packages like LIME or SHAP estimate scores by masking words, and therefore may not reflect model behavior on natural counterfactual cases. For example, the token "friend" in Figure 3A is not considered important even though a natural substitution in Figure 3B flips the prediction. The opposite happens to "in depression," where a significant change makes no difference to the model's prediction ( Figure 3C). Even perfect importance scores may be too abstract for users to gain real understanding (Miller, 2019), e.g., users may not grasp the significance of a low importance score for the token "help" without concrete examples such as the one in Figure 3D.
Since presenting a large number of concrete counterfactuals would be overwhelming, we propose a hybrid approach, displaying feature attributions as a high-level summary, together with a judicious selection of Polyjuice counterfactuals that make behaviors concrete and highlight potential limitations. Following Miller (2019)'s observation that people look for explanations revealing unexpected behavior, we select surprising counterfactuals. 8 That is, we estimate the expected change in prediction with feature attributions, and select counterfactuals that violate these expectations, i.e., examples where the real change in prediction is large even though importance scores are low (Figure 3B), and examples where the change is small but importance scores are high ( Figure 3C). Of course, users can also view additional counterfactuals that perturb tokens of particular interest, a technique that we explore in the next section.
User evaluation. We study the scenario where an expert has access to a model and local explanations, and evaluate the additional benefit of showing counterfactuals, i.e., whether they bring new insights. We evaluate three ways of generating counterfactuals: (1) Polyjuice-random, a baseline where we show random Polyjuice counterfactuals, (2) Expert-surprise, where two graduate students (non-participants) were given access to the model and instructed to create counterfactuals that are surprising given the associated SHAP scores, and (3) Polyjuice-surprise, which uses the selection procedure described in the previous paragraph.
We recruited 13 participants (graduate students with experience in model explanation), and had them analyze the aforementioned QQP model. In each round, users were shown an example, the model prediction, and a SHAP explanation, as in Figure 3A. Users were instructed to create up to 10 counterfactuals in order to better understand model behavior around the example, for which model predictions were given (users created 6 on average). Finally, users simulated what the model would do on six counterfactuals (Hase and Bansal, 2020), two from each condition (in random order). Counterfactuals where users make mistakes are prefer-8 Details in Appendix C.1.

Error Rate
Polyjuice-random

Expert-surprise
Polyjuice-surprise Figure 4: Simulation error rates per condition (higher the better). Polyjuice-surprise has the highest error rate, indicating these counterfactuals would add the most information to users if displayed.

Conditions
able, as displaying these would add information that users do not already have. As shown in Figure 4, humans simulated model behavior on Polyjuice-surprise counterfactuals only slightly better than random guessing (45% ± 6%), i.e., these examples display model behavior that is surprising to users even after seeing explanations and creating their own counterfactuals. Expert-surprise also had a high error rate, but at a much higher cost: generating these for just 20 original instances took 1.5-2 hours of expert labor.
While high error rates could be achieved with unrelated or nonsensical examples, all counterfactuals under evaluation were close to the original examples, when measured by syntactic tree edit (≈1.0) or Levenshtein distance (≈0.2), Polyjuicesurprise being the closest on both. An independent rater labeled 95% of Polyjuice-surprise counterfactuals as "likely written by a native speaker," in contrast to 85% for Expert-surprise, indicating that experts sometimes resorted to ungrammatical or nonsensical sentences to find surprising behaviors.
Qualitatively, the study participants tended to create counterfactuals by perturbing the token with the highest weights (84% of theirx perturbed tokens in the top 15% quantile of weights), not gaining a real understanding of how the other tokens impact predictions. Participants also made a significant number of mistakes even for tokens they had inspected, e.g., a participant perturbed the example in Figure 3A by replacing help ) play with, yielding a Non-Duplicate model prediction. When faced with help ) find in Figure 3D, they incorrectly assumed the behavior would be the same.
These results indicate that Polyjuice counterfactuals complement feature attribution explanations by displaying information that users often miss, even after they have manually explored the model behavior beyond explanations. Moreover, Polyjuice counterfactuals for this application were more surprising and fluent than Expert-surprise, despite being computed automatically.
, perturbed H through [negation] x P: A woman is holding a baby by a window. H: This woman is looking out the window.

Interactive Analysis
While our use of Polyjuice has so far relied on automatic selection of counterfactuals, we show in this section how an analyst can benefit from multiple counterfactuals per x, make use of controlled generation for more advanced analysis, and extract general patterns from individual observations. Our use case is counterfactual error analysis (Wu et al., 2019) of RoBERTa finetuned on NLI (used in §3.1), although the techniques are generally applicable.
There is a known correlation between the label Contradiction and hypotheses with negation in NLI datasets (Gururangan et al., 2018), which may cause models to fail on non-contradiction negations. We explore this in Figure 5A by generating counterfactual hypotheses for a random Neutral instance, conditioning only on the original x and the negation control code. While the first two counterfactuals display this failure mode, there is a surprising inconsistency in model behavior between "not" and "n't". We note that manual analysis may not explore these three negation forms, and thus not surface this puzzling behavior.
To verify if the pattern is widespread, we generate counterfactuals with the negation control code for a random set of instances correctly predicted as Neutral (n = 895). To generalize individual changes into patterns, we extract frequent counterfactual templates with Tempura  (details in Appendix D.2), shown in Figure 5B  its prediction from Neutral to Contradiction with roughly the same frequency (≈43%) whether the negation word is "not" or "n't", but flips much more frequently with a different negation pattern where a determiner is replaced with "no" (92.8%). While these behaviors may be correct in some instances, they often are not (e.g., Figure 5A), and thus would warrant further exploration, and potential mitigation strategies (e.g., counterfactual training, §3). Tangentially, the impact of DET ) no might lead the analyst to explore the impact of perturbing the subject of hypotheses, which we do in Figure 6 by placing a [BLANK] on the subject rather than using a control code. This leads to the discovery of unstable and erroneous behaviors regarding quantifiers, which we analyze in more detail in Appendix D.1.
Discussion. Polyjuice is a powerful tool for interactive analysis. Generating multiple counterfactuals per instance leads to insights that might be missed by manual analysis, and the steering provided by control codes and [BLANK]s allow for analyses that would be non-trivial to do manually (Wu et al., 2019) or with masked language models (e.g., Figure 5B places negations in various parts of sentences, and Figure 6 replaces spans with other spans of varying lengths). Besides error analysis, an analogous interactive use of Polyjuice may be suitable for test creation (Ribeiro et al., 2020) and forms of data augmentation that are more controlled than what we presented in §3.

Related Work
Some prior work in training and evaluation relies on humans to generate counterfactuals from scratch (Gardner et al., 2020;Teney et al., 2020;Kaushik et al., 2020). Our experiments in §3 indicate that asking humans to label Polyjuice counterfactuals yields similar or better results at a lower cost, which motivates an exploration of a mixture of manual and semi-automated generation. Similarly, prior work on analysis relies on experts to create individual counterfactuals or perturbation functions (Wu et al., 2019;Ribeiro et al., 2020). In §5, we show that Polyjuice enhances current practice by generating multiple counterfactuals that might have been overlooked, and by providing abstractions that allow for new kinds of analyses.
Prior work on automatically generating counterfactuals typically has a narrower scope in terms of the relationships x )x. For example, adversarial generators aim to maintain semantics while changing model predictions (Ribeiro et al., 2018;Iyyer et al., 2018;Li et al., 2021), whereas concurrent work to our own (Madaan et al., 2021;Ross et al., 2020) automatically generatesx that change predictions for explanation or analysis, with no constraints on semantics. However, as shown in §3- §5, a mix of label-preserving and label-flipping counterfactuals generated by Polyjuice is quite useful for training, evaluation, explanation, and analysis. Further, general-purpose counterfactuals may lead to serendipitous discoveries ( §5), especially as Polyjuice is not fine-tuned to the target domain (and thus less liable to merely replicate what is already there). Finally, by allowing control through control codes and [BLANK]s, Polyjuice supports humangenerator collaboration, where a person specifies desired changes (e.g., perturb the sentence subject). Such collaboration is hard to imagine using automatic generators with no control, or with coarser control through predefined style attributes or labels (

Conclusion and Future Work
We propose Polyjuice, a general-purpose generator that produces fluent and diverse counterfactuals, allowing for control over the kinds and locations of perturbations. With simple, task-specific selection heuristics, Polyjuice supports various downstream tasks on different domains, including counterfactual data augmentation, contrast set generation, counterfactual explanation, and error analysis.
While Polyjuice is broadly applicable, it is not bias-free: control codes are pre-defined and certainly not exhaustive, and the model is fine-tuned on a collection of paired datasets where certain perturbations are more or less likely (e.g., we observe that words with negative sentiment tend to be slightly more likely than positive ones in some contexts). Collecting naturally occurring counterfactuals is an important area of future research, as is the development of generators that allow for control even without a-priori control codes.
Besides improving the generators, further work is needed to improve the value of counterfactuals. For example, while Polyjuice shows consistent gains across tasks in data augmentation, the improvements on some datasets are not as significant. This aligns with observations in prior work that even manual counterfactuals can be marginally beneficial (Kaushik et al., 2020;Huang et al., 2020), possibly because the original data is already diverse enough, or the perturbed signal in counterfactuals is too subtle to affect the model (e.g., when only a single word is changed in a long sentence.) We hope to perform more thorough experiments on tuning the amount and the distribution of counterfactual augmentation, as well as other ways of incorporating counterfactuals, such as having explicit terms in the loss function for contrasting counterfactuals with original data (Teney et al., 2020), or other forms of contrastive learning.
Although our applications all involved people, the human-Polyjuice collaboration in labeling and explanations could benefit from richer interaction mechanisms. We believe Polyjuice motivates future research on more expressive forms of counterfactual training, where users generate counterfactuals together with Polyjuice, and label counterfactual patterns rather than individual instances. Similarly, interactive explanations and analysis are exciting directions, especially as we develop new ways of selecting, presenting, and aggregating counterfactuals for various analysis objectives. Having noted these opportunities, we believe Polyjuice is already a powerful tool for counterfactual reasoning, in particular for tasks where people are directly involved. Polyjuice is opensource, and available at https://github.com/tongshuangwu/polyjuice.

Ethical Considerations
Our work includes labeling counterfactuals on crowdsourcing platforms, as well as conducting user studies with graduate students. As detailed in Appendix B.1 and C.2, we compensated the MTurk workers $2.5 for ≈15 minutes of labeling, and the graduate students $20 for the user study (one hour), above the U.S. federal minimum wage. The studies are with IRB approval.
We only finetune GPT-2 rather than training it from scratch, such that our compute costs are relatively low (around 8 hours for finetuning, Appendix A). All of our finetuning experiments involved finetuning RoBERTa on smaller datasets.
More critically, with most of our demonstrated applications using a human-generator hybrid mechanism, we stress that the interaction between the two deserves careful consideration. It has long been reported that algorithms interacting with humans can negatively impact the human. 9 In our case, the concern might be that users can develop an over-reliance on Polyjuice (Bansal et al., 2021) and hastily accept its generations. Not only can this decrease users' creativity (Green et al., 2014), but it may bias their analysis process: as discussed in §7, Polyjuice generation is not exhaustive, and may favor some perturbation patterns over others in unpredictable ways. In the short term, we plan to highlight these limitations as part of the model documentation, while future research should identify interaction mechanisms, so as to ensure that Polyjuice or other counterfactual generators support humans, rather than hindering their performance.  A GPT-2 as Counterfactual Generator

A.1 Training Data and Parameters
We combine several datasets to finetune Polyjuice. Contrast set. Authors of 10 existing NLP dataset each manually perturbed 100-1,000 instances to change the gold label, so to inspect a model's local decision boundary (Gardner et al., 2020). The perturbation patterns vary based on the tasks and the annotators, allowing us to learn diverse strategies. To make sure we can use the contrast set to evaluate the Sentiment model, we excluded the IMDb movie review from the training.
Counterfactually-augmented data (CAD). Kaushik et al. (2020) crowdsourced counterfactuals for IMDb movie review (1.7k), which we split into paired sentences to match the text length of other datasets. CAD's perturbation patterns also vary based on the task, but can especially contribute to negation. As NLI is in our demonstrating applications, we did not use their 6.6k SNLI counterfactuals. 10 WinoGrande is a large-scale dataset of 44k instances for testing common sense problems (Sakaguchi et al., 2020). It contains sentences that differ only by one trigger word (e.g., one noun), making it most suitable for learning lexical exchanges.
ParaNMT-50M contains 50 million English-English sentential paraphrase pairs, covering various domains and styles of text, as well as different sentence structures (Wieting and Gimpel, 2018).
PAWS (Zhang et al., 2019b) contains pairs with high text overlaps, created through controlled word swapping, best demonstrating shuffle and restructure. We used its 49k Wikipedia parts.
HANS (McCoy et al., 2019), a challenge set for NLI, contains 10k pairs of premises and hypotheses created based on 10 heavily fallible syntactic templates, and therefore compensates rarer structural changes that may be missed by PAWS.
Crawled We additionally crawl naturally occurring sentence pairs from non-paired datasets boost some specific patterns and increase lexical diversity. This include (1)  , whose paragraphs involve Wikipedia knowledge. We estimate close pairs using edit distance, and broadly accept those with less than 60% editing. To exclude tricky cases (e.g.,"how do I not be" can be incorrectly regarded as negation for "how do I recover it"), we only augment the most determined patterns: lexical, insert, delete, and shuffle.
To balance the distribution (Table 7), for each dataset, we extract control codes from all the (x,x), and randomly sample up to 10,000 instances per codes. Still, quantifier and negation have less training data compared to other codes. Fortunately, these codes tend to be limited to more specific patterns ("more than", "not", "never") when compared to "broad" codes like lexical, and thus even a small sample is enough to learn them. We finetuned an off-the-shelf GPT-2 model from Wolf et al. (2020) for 10 epochs with an initial learning rate 5e-5, a batch size of 8, and a sequence length of 120 (but any LM can potentially be used). We select the best epoch based on the evaluation loss on a holdout set of size 5,000. The training took around 8 hours on two Titan RTXs.

A.2.1 Closeness and Diversity
Similar to Madaan et al. (2021), we compare the diversity and closeness of Polyjuice with alternative generators, i.e., RoBERTa and T5, representing masked language models that prioritize word and span substitution, and original GPT-2, representing the standard generative model not conditioned on x. For a given x and its counterfactualsX, we approx-imate diversity using self-BLEU (Zhu et al., 2018) withinX. Meanwhile, closeness is the average distance between x and everyx ∈X, both with the normalized word level Levenshtein edit distance ((Levenshtein, 1966), used in MiCE (Ross et al., 2020)), and syntactic tree edit distance ( (Zhang and Shasha, 1989) in GYC (Madaan et al., 2021)).
We run the three generators on 300 sentences in total. In GPT-2, we take the first two words of an x as the input context (prompt), limit the length of the generation to be similar to x, and collect 10 counterfactuals. As for RoBERTa and T5, we repeatedly perturb x for three times, each time randomly placing up to three [MASK] tokens, and ask the generator to generate 5 counterfactuals through beam search, following Ribeiro et al. (2020). Polyjuice uses the same blank (mask) placement as in RoBERTa and T5, but we additionally enumerate through all control codes. For each x, we randomly sample 5 counterfactuals to formX per generator.
As shown in Table 2, Polyjuice achieves a balance between diversity and closeness. Ideally, we would also like to compare Polyjuice with concurrent work (Madaan et al., 2021;Ross et al., 2020), but these are yet to be open-sourced and require extensive implementation or finetuning.

A.2.2 Controllability
To evaluate controllability, we compare Polyjuice with T5, and GPT-2 finetuned on prompts without codes (called Polyjuice -a), such that both baselines consider sufficient context. For each control code, we compare the control success rate of Polyjuice and Polyjuice-a on 300 prompts. For each prompt, we generate counterfactuals through beam search (beam = 5), and recompute the codes on the top three generatedx. We deem the control successful if at least one of the three recomputed codes matches the desired control code (though in Polyjuice-a, we only measure whether the code naturally occurs in the uncontrolled generation.) The success rate increases by 26% ± 13% across all control codes, ranging from quantifier (increasing 6%, from 50% to 56%) to negation (42%, from 5% to 47%). Non-finetuned T5 also achieves less control (success rate decreases by 33% on average.) Common failure cases include (1) The control codes conflict with the blanks, e.g.,"a dog is embraced by a [BLANK]" would not respond to negation.
(2) x does not have a corresponding pattern, e.g., shuffle is not applicable to "the movie is good." (3) certain salient patterns dominate the generation probability, e.g., the model tends to perturb the quantifier "two" in "two dogs are running," regardless of the code.

B Additional Train & Eval Details, §3
B.1 MTurk Labeling Details Procedure The study started with an introduction that explained the context and tasks. To familiarize crowdworkers with the task, we asked them to complete 1-2 training rounds, and explained the expected labels. Each annotator then completed 22 tasks, labeling 3 counterfactuals of a single example in each round, as in Figure 7. The 22 rounds consisted of 20 actual labeling tasks and 2 extra "gold rounds" with known correct labels. The gold cases later served to filter low-quality crowdworkers. The median annotation time was around 15 minutes, and participants received $2.5.
Participants. We recruited participants from MTurk, limiting the pool to subjects from within the US with a prior task approval rating of at least 97% and a minimum of 1,000 approved tasks.
Data quality. We applied two filtering strategies: (1) High-quality worker. We only kept data from participants whose median labeling time per round was more than 18 seconds and correctly labeled at least 4 gold counterfactuals (out of 6), or who correctly labeled all gold ones. (2) Majority vote labeling. We collected two annotations per counterfactual, and only kept those that at least one annotator deemed valid, and both annotators agreed on a particular class label. One of the authors la-4,000 4,500 5,000 5,500 6,000  Figure 8: The accuracy trend on two Sentiment datasets, as the total training datasize (m+n) varies. The blue line shows an augmentation of m = 2k counterfactuals, and the blue one represents the corresponding m-baseline.
Though the counterfactuals remains useful on datasets like SemEval across all m+n, it appears too many counterfactuals may be harmful (Amzbook).
beled a subset of 100x on 100 x in Sentiment, and reached high agreement with the majority-voted results (κ = 0.77, raw labeling agreement 88%).

B.2 Training Details & m/n Ratios, for §3.2
For each (m, n), we created three samples of training data. Each sample was further averaged over four random seeds. For each run, we heuristically picked the initial learning rates 1e-5, 2e-5, 2e-5 for Sentiment, NLI and QQP, and trained 20 epochs with a dropout rate of 0.1 and a batch size of 16. We selected the epoch that had the highest accuracy on the corresponding validation set, which takes 1/5 of the training data size, with the same ratio of m/n counterfactual and original examples. We further explore ratios of added counterfactuals. Take Sentiment as an example: while the counterfactual remains effective on most datasets, it hurts the model performance on Amzbook when the counterfactual takes a large proportion (Figure 8, Yelp followed a similar but more mild trend). We suspect that flipping out too much original data affects the data diversity, and in turn decreases the model performance. Similarly, Huang et al. (2020) asserted that augmenting n = 1.7k NLI data with m = 6.6k counterfactuals did not improve model generalization accuracy.

C Additional Explanation Details §4 C.1 Selection Methods
Because SHAP weights reflect the average effect of masking a token t, we also focus on word features that are abnormal on average.
More concretely, we define the expected changein-prediction for perturbing a token t to be the SHAP importance on it, H[D f (t, x)] = s(t). In Figure 3, s(t=depression) = 0.276. The actual prediction change D f (t, x) is the weighted average of | f p (x) − f p (x)| for all thex that affect t (depression ) trouble, depression ) a mood), where f p (x) is the prediction probability of f on x. The weight corresponds to the number of words modified inx: If e(x) denotes the set of edited words in x, then w(x) = 1/|e(x)|. Intuitively, the more words changed inx, the less impact each word has; In Figure 3D, we regard "depression" to be responsible for half of the impact in in depression ) suicidal. We groupx based on their affected words G t = {x | t ∈ e(x)}. D f (t, x) then becomes: The additional SHAP weight s(t) acts as a smoothing factor to penalize outliers. Then the gap between the expectation and reality is: We first find the abnormal tokens: (1) t with small SHAP weight, butx that change t experience large prediction change on average: t L = arg max t∈x ∆ D f (t, x), and (2) t with large SHAP weight, butx with t changed usually have intact prediction: t U = arg max t∈x −∆ D f (t, x).
Then, we use the most extreme cases within the groups of G t L and G t U as the concrete counterfactual explanations, based on their prediction change | f p (x) − f p (x)|, and the aggregated SHAP weights of all the changed tokens: Figure 9 shows the sample interface. Participants started by just seeing the reference example and the model query box on the left hand side. When they chose to start the task or after they had exhausted their ten query chances, the query box was disabled, the tasks on the right were displayed, and the participants completed the tasks. We compensated participants $20 for the one hour study.

D.1 Additional Case Study: Quantifiers
As a follow-up to Figure 6, we slice the data to find entailment instances that have numbers in the hypothesis sentence, and perturb their quantifiers.