Generating Realistic Natural Language Counterfactuals

Counterfactuals are a valuable means for understanding decisions made by ML systems. However, the counterfactuals generated by the methods currently available for natural language text are either unrealistic or introduce imperceptible changes. We propose CounterfactualGAN: a method that combines a conditional GAN and the embeddings of a pretrained BERT encoder to model-agnostically generate realistic natural language text counterfactuals for explaining regression and classification tasks. Experimental results show that our method produces perceptibly distinguishable counterfactuals, while outperforming four baseline methods on fidelity and human judgments of naturalness, across multiple datasets and multiple predictive models.


Introduction
The increase of machine learning (ML) applications in high-stakes domains has led to a proliferation of Explainable AI (XAI) and Interpretable ML approaches, aimed at making models (global explanations) or individual decisions (local explanations) more understandable (Doshi-Velez and Kim, 2017;Tomsett et al., 2018). Output explanations explain individual decisions by understanding the (local) behavior around the output (Guidotti et al., 2019). However, in practice individuals may not always have access to the models they want explained (e.g. because of intellectual property) (Edwards and Veale, 2017). To overcome this access problem, model-agnostic approaches (sometimes called post-hoc approaches (Lipton, 2016)) only require access to the model outputs for provided instances, with the added benefit of being applicable to explain any model for a type of ML task (Ribeiro et al., 2016a). Prominent model-agnostic output explanations are local surrogate models (Ribeiro et al., 2016b), feature importances (Lundberg and Lee, 2017;Fong and Vedaldi, 2017), example-based ex- planations (Kim et al., 2016) and counterfactual explanations (Wachter et al., 2018).
Counterfactual explanations express what might have happened instead (Roese and Olson, 1995): certain values in an input instance are perturbed (e.g. the age of a defendant) while keeping other values the same, in order to observe how that influences the output (e.g. they would not have been convicted). Each of the output-changing perturbations is a counterfactual, where the difference between the counterfactual and the original instance provides insights into how the inputs affect the outputs, and can be used to pinpoint fairness issues and to reach a desired output (Wachter et al., 2018). As these counterfactuals are a valuable means to understand the behavior of a system, in recent years the same technique has been applied for explaining ML decisions-mainly for structured data (e.g. (Russell, 2019;Ustun et al., 2019;Wachter et al., 2018)) and image data (e.g. (Dhurandhar et al., 2018;Guidotti et al., 2019;Poyiadzi et al., 2020)).
Unlike structured and image data, counterfactuals for natural language text data have largely been disregarded. For text classification, Martens and Provost (2014) proposed the removal of words as a means to measure their contribution to the output. This paradigm was later adopted for constructing model-agnostic local surrogate models (Ribeiro et al., 2016a) and to determine which words have to be necessarily present for a classification decision (Ribeiro et al., 2018). Yet, this paradigm fails to create realistic counterfactuals (as illustrated in Figure 1 for sentiment analysis). Realisticness is an important property for how humans create and accept counterfactuals (Byrne, 2019;Miller, 2018) and may help prevent misleading explanations produced by model-agnostic explanation methods (Slack et al., 2020). Recently, human-inthe-loop approaches were proposed (Ribeiro et al., 2020;Wu et al., 2021) to support explainees in forming realistic counterfactuals.
In this work, we propose CounterfactualGAN: a method able to model-agnostically generate realistic, targeted counterfactuals for natural language text regression and classification without explainee intervention. Our method (i) generates realistic counterfactuals for the text-domain-ensuring dataset-specific realisticness by adversarially training on a training set-, (ii) uses a single model to provide counterfactuals for any instance and any (classification or regression) target of a black-box model, (iii) generates counterfactuals with a single pass after training, and (iv) does not require explainee intervention to do so.

Counterfactuals for machine learning
In the literature, several properties of counterfactual generation methods for ML classification and/or regression models have been suggested. First, the generation of counterfactuals is either targeted (i.e. to a specific class or regression output) or untargeted (any target other than the original) (Zhang et al., 2020). Second, while generally viewed as being part of the post-hoc XAI approaches, some methods assume white-box access to the model rather than viewing it as a black-box (e.g. (Russell, 2019;Ustun et al., 2019)). For example, if the model is a linear classifier and its weights are accessible, these weights can be used to effectively find (targeted) counterfactuals. Third, for each original instance one or multiple counterfactuals can be found, providing either a single explanation or elucidating the various ways in which decisions may change (Wachter et al., 2018). When selecting multiple counterfactuals, approaches are typically concerned with the diversity of the counterfactuals to ensure maximal coverage with a sparse counterfactual set (Russell, 2019;Karimi et al., 2020).
While these properties impact the approach for obtaining counterfactuals, further strategies are required to confine the search space of possible counterfactuals to the ones that hold the best explanatory value. Some approaches select the nearest counterfactuals (e.g. (Wachter et al., 2018)), such that minimal changes are required to change to the counterfactual. However, more recently authors have addressed the issue of implausibility: "[...] the counterfactuals generated may not be valid datapoints in the domain or they may suggest featurechanges that are difficult-to-impossible" (Keane and Smyth, 2020, pp. 166-167). Implausibility has been tackled with various strategies: either enforcing user-imposed feasibility constraints (e.g. excluding explanations where one needs to lower their age to lower the risk of recidivism) (Poyiadzi et al., 2020;Karimi et al., 2020) or using automated methods, such as selection based on closeness to a class prototype (Van Looveren and Klaise, 2021), that on the path from changing from the factual to counterfactual no other outputs are encountered (Laugel et al., 2019) or selecting instances from the training set as counterfactuals instead of generating them (Keane and Smyth, 2020).

Counterfactuals for text
Many counterfactual generation methods mainly consider structured and image data. Two methods that do support the creation of counterfactuals for natural language text require humans to determine where to apply changes in the instance: CheckList (Ribeiro et al., 2020) suggests input perturbations to a user, who can then choose from these perturbations and test how they affect the output, while PolyJuice (Wu et al., 2021) uses control codes in a finetuned GPT-2 model to form counterfactuals. MiCE (Ross et al., 2021) can generate counterfactuals using a finetuned T5 model with white-box access to a predictive model. However, optimizing textual perturbations to find counterfactuals with black-box access in a fully automated manner poses specific problems, as encountered in the related areas of adversarial ML (seeking semantically imperceptible changes to text that change the black-box label) and style transfer (changing linguistic attributes of groundtruth texts, while retaining content). First, it is nontrivial how to define distance measures between the original instance and its perturbations, as they typically are discrete objects (Belinkov and Glass, 2019). Second, minimizing this distance cannot easily be formulated as an optimization problem, as this requires computing gradients on discrete inputs (Belinkov and Glass, 2019). The adversarial attack and style transfer literature tackles these problems by either (i) leveraging a combination of NLP algorithms and knowledge-engineered perturbation rules (e.g. (Li et al., 2019;Jin et al., 2020)), or (ii) finding perturbations in a latent space of an autoencoder model and decoding these into natural language instances (Melnyk et al., 2017;. We draw from insights in these areas, especially adversarial attacks-able to craft instances changing a black-box prediction-, but observe that their goal of imperceptibility of the perturbation (Zhang et al., 2020) (i.e. humans being unable to tell the difference between two instances and assigning the same label) is at odds with the goal of counterfactuals used in explanations to be humanunderstandable (Wachter et al., 2018). Counterfactuals provide explainees (e.g. developers, layusers or examiners (Tomsett et al., 2018)) with meaningful information regarding a models' prediction, without necessitating technical know-how (Wachter et al., 2018). This requires an explainee to perceive the difference between the original instance and counterfactual (e.g. which words have changed), potentially even with limited technical or domain expertise. We note this perceptibility does not have to align with human judgments of distinguishing factors in a task: changing from 'him' to 'her' in sentiment analysis may be equally perceptible as changing from 'good' to 'bad'.

Realistic Counterfactuals
For natural language text, we propose a new strategy to create plausible counterfactuals in an automated manner, by applying a realisticness constraint to the generated counterfactuals. This approach has several benefits over the strategies tackling implausibility in Section 2.1: (i) no domainspecific assumptions have to be made by users, (ii) the perturbation path between an original instance and its counterfactual is not constrained and (iii) counterfactuals are not restricted to only training instances. What realisticness entails may depend on the type of language expected within a certain context, e.g. realistic movie reviews typically differ in use of language and grammatical correctness from Tweets. Therefore, we deem an instance realistic if it is indistinguishable from other instances (expected) in a dataset from a specific domain.
Definitions. Let us assume we are able to provide inputs to a black-box ML model f : X → Y and get the corresponding predictions, and have an instance x ∈ X for which to find counterfactuals. f (·) was trained on a dataset containing N labeled or unlabeled instances (represented by respectively). Instance x is represented as a feature vector x = (x 1 , x 2 , ... , x n ), where each x denotes a feature value. In our work, x i is either a word-level token or is absent. A second index j (x i,j ) is used sometimes when tokens are represented by a probability distribution over a vocabulary of M tokens (including a special token indicating absence).
A perturbed instance x = x + δ is formed by applying one or more valid perturbations δ to x (Dhurandhar et al., 2018). A valid perturbation transforms the feature values in x such that x ∈ X . For example, valid perturbations are word replacements or removing a word by setting it to absent. We estimate instance realisticness by determining if it is indistinguishable from the original data distribution p(x). In practice we estimate this indistinguishability with a discriminator model g X : X → [0, 1] (trained on X) indicating the likelihood that an instance x could have come from p(x). A low score indicates out-of-distribution instances.
We define the set of counterfactuals (CFs) of the instance Karimi et al., 2020), i.e. all instances with an output different to x. Targeted counterfactuals (TCFs) are instances that change the fact (the current output y = f (x)), to the foil, an output of interest y ∈ Y. Instead of requiring the output f ( x) of perturbed instance x to be exactly equal to y , we generalize the assumption f ( x) = y and obtain: is to output of interest y and ≥ 0 a user-defined threshold for the strictness in including instances. 1 A realistic (targeted) counterfactual is a (targeted) counterfactual that also maximizes realisticness model g X (·).

Method: CounterfactualGAN
We operationalize model-agnostic, realistic, targeted counterfactuals for text regression and classification with our proposed method Counterfactual-GAN. Our method works across predictive models by crafting counterfactuals in the form of a string. To ensure counterfactual realisticness, we combine insights from pretrained language models (LMs) and generative adversarial networks (GANs).

Generative Adversarial Networks (GANs).
GANs use a generator G and discriminator D to learn latent representations or mappings in an unsupervised or semi-supervised manner (Creswell et al., 2018). By letting these networks compete, they are forced to jointly improve their performance. After training, generator G is used to generate realistic synthetic instances.
Rather than having G unconditionally generating realistic instances, Conditional GANs aim to create realistic mappings for specific inputs (Creswell et al., 2018). For example, an unconditional GAN may be able to create sentences with positive sentiment, while a Conditional GAN can turn the positive sentiment of an input sentence into a negative sentiment counterpart. First popularized in the image domain, approaches such as pix2pix  are able to e.g. convert grayscale photographs to full color and convert day scenes to night scenes. CycleGAN (Zhu et al., 2017) is able to do this without ground-truth pairs with example mappings between domains by jointly training two generators G ab and G ba -where G ab learns a mapping from domain a (e.g. positive sentiment) to b (e.g. negative sentiment), while G ba learns a mapping from domain b to a. An important contribution of CycleGAN is ensuring a minimal reconstruction loss of the network G ba (G ab (x)) ≈ x, helping the network to preserve relevant input features.
The downside of using CycleGAN is that in the multi-domain case (e.g. multi-class or a continuous domain) conditional generation requires training many separate generators and discriminators. Star-GAN (Choi et al., 2018) mitigates this shortcoming of CycleGAN by using one generator G that takes both the input instance and a target domain (e.g. positive/negative sentiment) as inputs, and a single discriminator D that predicts (1) if the instance is real [D adv ] and (2) which target domain it belongs to [D tgt ].
Language models (LMs). While GANs have shown promising results, their application to natural language text has been limited. The discrete nature of text makes propagating the gradient from the discriminator back to the generator infeasible. We therefore opt to use the approach of finding a mapping in latent (embedding) space Z (in similar vein to e.g. (Melnyk et al., 2017)), but in our case use a pretrained LM for the autoencoder. Pretrained LMs have proven to greatly improve the state-ofthe-art performance on a plethora of down-stream tasks (e.g semantic similarity, reading comprehension and commonsense reasoning) (Radford and Salimans, 2018). By using a pretrained LM, we leverage its adeptness in encoding syntax and semantic content-even beyond the training data.
CounterfactualGAN. Our method combines the encoder-decoder architecture of an LM and the generator-discriminator architecture of a GAN for finding counterfactuals. Discriminator D is responsible for determining the realisticness of an instance x, while we use the predictions of black-box model f (·) to determine if a counterfactual is of output y .
In practice, we use BERT (Devlin et al., 2019) as an LM encoder to create an embedding z. Decoder Dec(·) is then tasked with mapping z back to original instance x. For the GAN, we use a StarGAN (Choi et al., 2018), with a single generator G-that is provided with a target y -, and one discriminator with two heads, tasked with determining whether the instance was real or fake (D adv ) and how well the instance corresponds to the target (D tgt ). To ensure that the output is similar after mapping to the target domain and back, the reconstruction loss is not only calculated on the embeddings z and z = G(G(z, y ), y), but also on x and the token predictions according to the decoder x = Dec(z ).
As our goal is to provide counterfactual explanations, unlike the original StarGAN, D tgt is trained on the predicted labels y = f (x) of the black-box decision function we aim to explain rather than ground-truth labels. Because black-box f (·) uses instances x ∈ X to make its predictions rather than embedding z ∈ Z, D tgt is first trained to distinguish target outcomes using embedding z. In addition, as our method relies on a highly accurate mapping of the encoder-decoder part of the model, the encoder and decoder are pretrained on the training data as well. To incorporate these two requirements, we propose to train Counterfactual-  GAN in two phases, both illustrated in Figure 2.
Phase 1 (Figure 2a) starts with a pretrained LM encoder and decoder, and has the goal of (i) ensuring that the decoder accurately reconstructs embedding z into the encoded instance x, and (ii) ensuring that D tgt accurately mimics black-box f (·). Next, Phase 2 (Figure 2b) fixes the encoder and decoder weights, and introduces generator G to find the mapping in embedding space Z. Generator G first maps z to z with target y , and then back to original outcome y-resulting in z . For the discriminator, z is marked as a real instance of target y, z as a fake instance of target y and z as a fake instance of original target y. Counterfac-tualGAN uses a three-layer Transformer decoder (Vaswani et al., 2017) for generator G, while discriminator D uses a two-layer GRU to combine embedding z into a single low-dimensional embedding for the entire input, used by the two heads D adv and D tgt . After both phases, a counterfactual is generated from x with target y by running its encoding through generator G and decoding the generated embedding: x = Dec(G(Enc(x), y )). During this generation, a top-k of counterfactuals is generated for each instance, from which one string (i.e. the counterfactual) that is most similar to target y according to f (·) is returned to the end-user.
Training objectives. To generate instances that are indistinguishable from real instances, discriminator D uses an adversarial loss where z is a (generated) embedding and r a value indicating whether the instance was real (1) or fake (0). G tries to minimize this objective, while D tries to maximize it. Next to the adversarial loss we also include a target loss L tgt to ensure that the generated instances (embedded as v = z ) resemble the target domain, while the original instances (embedded as v = z) resemble the original domain (Choi et al., 2018). To handle multiple types of black-box methods, we further distinguish between classification and regression targets: Here, w is either the original label y (in case of v = z) or the target label y (in case of v = z ) corresponding to that respective instance. We indicate the version used by the discriminator (with z and y) as L D tgt , while we indicate the version used by the generator (with generated embedding z = G(z, y ) and y as target label) as L G tgt . Both G and D try to minimize this objective.
Lastly, the reconstruction loss (Zhu et al., 2017) L rec = 1 2 L rec,x + 1 2 L rec,z ensures that only domain-relevant parts of the inputs are changed when constructing a counterfactual. Here, we use a cross-entropy loss L rec,x between original instance x and the cycle-reconstructed instance x = Dec(z ) and the L 2 -norm L rec,z of their respective embeddings z and z = G(G(Enc(x), y ), y): where y is the original label, y the target label, and x i and x i (with the j-th token in a vocabulary) are corresponding elements of sequence x and x , respectively. G aims to minimize this objective.
In Phase 1 (finetuning) we train the encoderdecoder part with a language modelling loss L lm and pretrain discriminator D by jointly training D adv (L adv ) and D tgt (L D tgt ). To increase the informativeness of instances during finetuning, in some instances words are randomly swapped or commonly used tokens belonging to ground-truth instances with the approximate target values are inserted. The goal for Phase 1 is to minimize L f inetune using the aforementioned loss functions: where L adv is the adversarial loss, L D tgt the discriminator target loss and L lm the language modelling loss. λ D tgt and λ lm are user-defined hyperparameters indicating the objectives' relative weights.
In Phase 2, the generator and discriminator are jointly optimized in an adversarial setting. They are trained using objective L GAN : where G tries to minimize objective L GAN and D tries to maximize it. Generator loss L G comprises the adversarial loss L adv , the target loss L G tgt for the generator, and reconstruction loss L rec (responsible for ensuring minimal change when mapping to the target label and back). Again, λ G tgt and λ rec are user-defined hyperparameters. The implementation details for our method are included in Appendix A.

Experiments
CounterfactualGAN was evaluated against four baselines using a quantitative validation and a human evaluation in the form of a user experiment.

Predictive models and datasets
We evaluated the generation methods using three task-specific datasets: one regression analysis task, one binary classification task and one multi-class classification task. To assess model-agnosticism, for each of these tasks three models were devised.
Datasets & tasks. We used three well-known NLP datasets for the training and evaluation of the predictive models and generation methods. These datasets cover various domains of NLP regression and classification tasks. HATESPEECH (Davidson et al., 2017) is a Twitter dataset used for hatespeech identification, where for our purposes the three class labels hatespeech (1.0), offensive language (.4) or neither (.0) were recoded to be used in a regression analysis of hatespeech severity. During preprocessing, @mentions in Tweets were anonymized with a string '@user'. The Stanford Sentiment Treebank (Socher et al., 2013) (SST-2) contains movie reviews with either positive or negative sentiment. SNLI (Bowman et al., 2015) is a textual entailment dataset, where the goal is to determine whether a hypothesis entails, contradicts or is neutral to a premise.
Each of these datasets was split into a training set (used for predictive/counterfactual generation model training), development set (used for hyperparameter optimization) and test set (used for evaluation). An overview of each dataset is provided in Table 1, including the task description, size of the dataset and its mean number of words. 2 Predictive models. The predictive methods result in a model f (·), which gives predictive values (regression values or class probabilities) for each instance. First, we include a hand-crafted whitebox model (WB) where ground-truth counterfactuals can be deduced. 3 In addition, we used two recent popular approaches that have shown competitive performance on several text regression and classification tasks as black-box models: InferSent (IS) (Conneau et al., 2017) and BERT (BE) (Devlin et al., 2019). Both models were finetuned on the specific dataset and corresponding task. The performance of each method on each tasks is shown in Table 1, where the performance is measured with MSE (lower is better) for HATESPEECH and macroaveraged F 1 (higher is better) for SST-2 and SNLI.

Counterfactual generation methods
We compared CounterfactualGAN to four baseline model-agnostic counterfactual generation methods. Each method creates a single counterfactual x = TCF(x, y ) for each instance x ∈ X test .

SEDC. Search for Explanations for Document
Classification (Martens and Provost, 2014) aims to find the minimal set of words so that removing these words changes the decision from the current  (Miller, 1995) to craft untargeted adversarial examples for classification. We extend this method for targeted counterfactual generation by (i) also allowing for antonym substitutions, (ii) including support for regression analysis and (iii) returning an instance close to the target. TextFooler. TextFooler (Jin et al., 2020) provides a competitive baseline for semantic adversarial attacks for text classification and entailment. It replaces the most sensitive words with synonyms with an equal part-of-speech to craft adversarial instances, while ensuring maximal semantic similarity. For our purposes, we extend TextFooler to regression analysis by making two predictions y and y approximately equal when |y − y | ≤ 0.2.

Evaluation
A realistic, targeted counterfactual generation method should produce counterfactuals that (i) accurately mimic black-box f (·), (ii) are realistic for a given dataset (see Section 3) and (iii) are perceptibly distinguishable from the original instance (see Section 2.2). To capture these aspects, we evaluated the generation methods on each predictive model and task using three metrics: fidelity, naturalness and perceptibility. Fidelity determines how accurately the method captures f (·), high naturalness indicates realisticness and perceptibility quantitatively estimates if the difference between x and x is sufficient to be used for forming explanations. The quantitative metrics fidelity and perceptibility were evaluated on test set X test 4 , while we selected a representative subset of 30 instances in X test to evaluate naturalness in a human experiment. We assigned a random target y to each instance x ∈ X test , and used this to generate a corresponding targeted counterfactual x = TCF f (x, y ) using each generation method. This procedure was repeated for five random targets per instance, resulting in five counterfactuals for each instance. For CounterfactualGAN, the counterfactual for each instance was selected by generating the top-5 counterfactuals and selecting the one where f ( x) was closest to target y . Table 2 shows example targeted counterfactuals. Additional examples are included in Appendix B.
Fidelity. Fidelity evaluates how well the generation method estimates the true behavior of the black-box predictive model (Ribeiro et al., 2016b). A generation method accurately mimicking its black-box will be able to produce counterfactuals that are of target class y according to predictive model f (·). For classification, the fidelity is oftentimes captured using the label flip score (Wu et al., 2021), i.e. how often the predicted label 'flips' to the target label. To generalize this notion beyond classification, for each instance we compared output f ( x) to target y and measure its performance. The measures used are the same as for overall predictive model performance for each task. Table 3 reports on the results for the fidelity evaluation, showing that our method outperforms the baselines on 6 out of 9 model-dataset pairs. A one-way ANOVA shows a significant difference between methods for HATESPEECH [F (4, 70)  Masked-LM a blond young woman in a black blue shirt is standing being a counter seenly seated, in a large chair.
neutral TextFooler a blond blonds woman lady in a black negra shirt jumper is standing stands behind a counter counteract.
contradiction CounterfactualGAN a blond big woman man in a black shirt is standing behind a counter. contradiction Perceptibility. To be used in explanations, the generation methods should produce counterfactuals x that are perceptibly distinguishable from their corresponding instances x (see Section 2.2). We quantitatively estimate perceptibility by taking semantic similarity estimated by the Universal Sentence Encoder (USE) (Cer et al., 2018), where we measure perceptibility with semantic distance 1 − USE(x, x). Unlike semantic adversarial examples, which have the goal of minimizing this semantic distance, we aim to have a higher score such that the difference (e.g. positive words in a review to negative ones) can be easily perceived-while being far enough from a completely unrelated counterfactual (score of 1). Human experiment: naturalness. We qualitatively determined which of the generation methods produces the most natural counterfactuals according to 196 native English speakers sampled from crowdsourcing platform Prolific 5 . Naturalness in-5 https://prolific.co dicates how realistic an utterance is for a given context ('movie reviews ', 'Tweets' or 'reading comprehension'). Participants were provided with pairs of counterfactuals generated for the same instance and predictive model, and asked which utterance was more natural in that context. A natural instance is one that could have been produced by a human (Novikova et al., 2017). Note that unlike other experiments humans were not asked to judge if the counterfactual correctly belongs to the target (e.g. positive/negative reviews), as the counterfactual explains the model behavior on the data-which may not correspond with human interpretation of the distinguishing factors. Each participant received a random subset of 50 pairs in which they chose whether they prefer the first utterance (generated by one counterfactual generation method), the second (generated by another), or had no preference. The participants were urged to choose between the utterances even with a slight preference. All instances for each predictive model, dataset and generation method were shown at least five times to varying participants. Participants had excellent inter-rater reliability [Krippendorff's α = .84]. Appendix C expands further on the experimental procedure. The results, reported in Table 5, show that our method is preferred (wins) regarding naturalness across all datasets and predictive models.

Conclusion
In this paper, we proposed CounterfactualGAN: a counterfactual generation method providing realistic counterfactuals to explain natural language text regression and classification black-box models, using a combination of pretrained LMs and a StarGAN to craft counterfactuals. Experimental results showed that our counterfactual generation method outperforms baselines across predictive HATESPEECH (↓MSE)     Table 4: Mean perceptibility of each counterfactual generation method (5 run average). The best score (↑) for each column is highlighted in bold.  Table 5: Percentage (%) of wins (W), ties (T) and losses (L) of our method against four baseline methods according to human judgments of naturalness. The methods deemed most natural are highlighted in bold. models and datasets in finding human-perceptible, targeted counterfactuals, which remained natural according to human judgments.
CounterfactualGAN greatly improves natural language counterfactuals' quality, potentially having a profound effect on the explanation quality of model-agnostic XAI methods using perturbations to form explanations (e.g. local surrogates (Ribeiro et al., 2016b) and counterfactual explanations (Wachter et al., 2018)). For future work, we intend to (i) assess for which XAI methods and in which contexts realisticness is most beneficial, (ii) determine what level of perceptibility is optimal for human-understandable explanations, and (iii) extend CounterfactualGAN to other languages than English and more ML task types (e.g. multi-label classification).

A Implementation details
CounterfactualGAN is implemented in PyTorch (Python 3.7.5) and trained on a Tesla V100 GPU with CUDA 10.2. BERT is used as a pretrained language model (LM) encoder Enc(·), implemented using prajjwal1/bert-small of the Transformers package (Wolf et al., 2020)-allowing more effective training of the method parts than a larger model. bert-small is a small, uncased version of the BERT model from the official Google repository 6 with a hidden dimension size of 512, 4 attention heads and 4 layers that total to 29.1M parameters (Turc et al., 2019).
The encoder transforms the instance into an embedding of size t × h, where t are the maximum number of tokens (where before the first token we place the special [CLS] token and [SEP] after the last token, and fill the remainder with [PAD] tokens) and h the hidden dimension size. Decoder Dec(·) is a fully-connected linear layer with bias, transforming an embedding z into a tensor t × v containing logits for each token in a vocabulary of size v. We extract k token sequences from these using nucleus sampling with p = .9 (Holtzman et al., 2020). Nucleus sampling selects the top logits for each t such that their softmax probabilities sum to p. The chosen tokens are then recombined into strings using the Penn Treebank detokenizer in NLTK 7 for each of the top-k counterfactuals. These were then fed back into f (·) to calculate their true target, after which the one most similar to the provided target y was selected as the counterfactual.
During Phase 1, we increase instance informativeness by including copies of a batch where (i) in each instance 15% of the words are replaced by a [MASK] token, (ii) in each instance 15% of the tokens are randomly swapped and (iii) for 15% of the tokens random tokens belonging to the target (regression value bin or class) are inserted into the token sequence. In each case, this is done non-destructively to ensure that the special tokens [CLS] and [SEP] are not replaced.
Generator G is a three-layer Transformer decoder with three attention heads, which receives the embedding Enc(x) as a shared input. The target tokens are the same embedding, except that the first token is replaced with an embedding of size h that contains the target for that instance. Discriminator D is a two-layer gated recurrent unit (GRU) with 10% dropout, which transforms an embedding t × h into an embedding 1 × h that is used by the D tgt and D adv heads. Both D adv (determining realisticness of instances) and D tgt (predicting what the value of black-box f (·) for the embedding is) are single layer feed-forward neural networks.
For reproducibility purposes, the hyperparameters of the model with the highest fidelity for each predictive model-dataset are reported in Table 6  Time usage. The training and inference time vary by dataset due to their different sizes (see Table 1). For the hardware setup that was previously reported, PyTorch on a single core of the Tesla V100 GPU, we report the mean wall time for train-ing and the wall time for crafting counterfactuals for the whole test set (inference).  Predictive models. InferSent (Conneau et al., 2017) (IS) uses a bi-directional LSTM on wordlevel GloVe embeddings (Pennington et al., 2014) to create semantic representations of sentences. In BE the [CLS] token in the final layer is used as a sentence representation. In both cases, these were then input into a linear layer to produce the prediction for the black-box predictive model. Table 8 compares illustrative examples from all datasets for all counterfactual generation methods. In addition, we first include the original instance, and described its original prediction and the counterfactual target. Note that all instances and generated counterfactuals have been lowercased. Moreover, in HATESPEECH users in mentions are replaced by '@user' to ensure anonymity.

C Human Experiment
For our human experiment, we recruited 199 participants from crowd-sourcing platform Prolific. All users remained anonymous. Their Prolific ID was only used for participant pay-out (they were awarded £2.50 for 20 minutes of their time, providing a good hourly rate according to https: //prolific.co in January 2021), after which it was discarded in further processing. Participants were randomly assigned to us, where the selection criteria we provided was that their first spoken language is English (self-reported) and had an approval rate of at least 80%. First, participants were introduced to the task to determine the naturalness of two utterances. We defined naturalness in our study as " [...] an utterance is more natural if it is more likely that it was produced by a human. Aspects you could consider are the type of language used in a context, grammatical correctness and semantically meaningful sentences." Next, they were asked to agree to a GDPR-compliant informed consent form before continuing their participation. We generated pairwise comparisons by sampling 30 instances from each test set, and for each explanation method picking the corresponding counterfactuals from the run with the best fidelity score (highest F 1 or lowest M SE). Participants were provided with 50 pairwise comparisons, where for each question they were asked "Which of the following {context} is more natural?" The provided text in the {context} placeholder depended on the dataset these counterfactuals were generated for, namely Tweets for HATESPEECH, movie reviews for SST-2 and reading comprehension sentences for SNLI. Figure 3 provides an example question. The 50 questions were randomly drawn from all pairwise comparisons, and shown in a random order. To check the quality of each submission, we included two quality control mechanisms: (i) we recorded the time of each survey completion and (ii) we included two control pairwise comparisons before and after the pairwise comparisons. The estimated completion time of the survey was 20 minutes, with a true average completion time of 14 minutes and 42 seconds. The options in the control questions compared true instances to ones with the lowest word-level edit distance by any generation method, one for HATESPEECH and one for  Excluding participants with completion times ≤ 5 minutes (n = 2) and participants choosing the nonnatural answer for both control questions (n = 1), the final sample size was n = 196.

Dataset
Method Original instance and generated counterfactuals HATESPEECH (black-box IS, higher to lower hatespeech score) Original rt @user: i aint gonna text first cus pride b*tch SEDC @user i aint cus PWWS+ rt user: i gonna text first cus humility Masked-LM rt @user: i destroyedt gonna costumes & natural turner promising b*tch bree TextFooler rt @user: i aint gonna text first cus pride bitch Ours (top-1) rt @user: i aint na text first cus pride Ours (top-3) rt @user: i aint gonna text first cus my pride Ours (top-5) rt @user: i aint gonna text first cus my pride HATESPEECH (black-box WB, medium to higher hatespeech score) Original rt @user: sonia never criticises kejriwal. kejriwal, who trashes every other leader, never criticises sonia. touching.

SEDC
: sonia kejriwal, who .other leader criticises sonia. . PWWS+ rt @user: sonia ever criticises kejriwal ., who trashes every other leader, never criticises touching. Masked-LM rt @user: load defend extensionss ke auditionsri ore .ld charged willy rod, honest trashes prevented liu TextFooler rt @subscriptions: sonia never criticises kejriwal. kejriwal, who trashes each other executives, never criticises sonia. touching. Ours (top-1) rt @user: sonia never criticises kejriwal‚îÄ kejriwal, who trashes every other leader, never criticises sonia. touching. Ours (top-3) rt @user: sonia criticises kej wall. kejriwal, who trashes every other leader, never criticises sonia. touching b*tch. Ours (top-5) rt @user: sonia criticises kejriwal. kejriwal, who trashes every b*tch leader, never criticises sonia. b*tch touching b*tch.  (black-box BE, from positive to negative) Original the movie has lots of dancing and fabulous music SEDC movie lots and music PWWS+ the movie lack lots of fabulous music Masked-LM the movie has 295 of dancing andial music TextFooler the photo possesses parcel of cheer and amazing symphonic Ours (top-1) the movie has loads of dancing and fabulous music Ours (top-3) the movie has loads of dancing and terrible music Ours (top-5) the movie has a loss of music and terrible music  (black-box WB, from positive to negative) Original the problem with concept films is that if the concept is a poor one, there's no saving the movie SEDC the with concept films is that if concept is a, there no saving the PWWS+ the problem with concept films is that if the concept is a rich one, there saving the movie Masked-LM the problem with concept films is that if the concept is a ≥ one, there's no saving the movie TextFooler the matters with concepts movie is that if the concepts is a poorer one, there's no save the film Ours (top-1) the problem with concept films is that if the concept is a good one, there's no Ours (top-3) the problem with films is that if the concept is a good one, there's no saving Ours (top-5) the thing with concept films is that if the concept is a good one, there's saving SNLI (black-box WB, from neutral to contradicting where edits are only applied to the premise)

Original
Premise: a man is posing on a ski board with snow in the background. Hypothesis: a naked man is posing on a ski board with snow in the background.
SEDC a man is on ski board snow in. PWWS+ a man is a ski board snow in the play up. Masked-LM a man is posing on a ski board with snow in the background. TextFooler a friend is parading on a slalom juries with blizzards in the wellspring. Ours (top-1) a man is posing on a ski board with snow in the background. Ours (top-3) a girl is sitting on a ski board with snow in the background. Ours (top-5) a girl is posing on a ski with snow in the background.