Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis

While state-of-the-art NLP models have been achieving the excellent performance of a wide range of tasks in recent years, important questions are being raised about their robustness and their underlying sensitivity to systematic biases that may exist in their training and test data. Such issues come to be manifest in performance problems when faced with out-of-distribution data in the field. One recent solution has been to use counterfactually augmented datasets in order to reduce any reliance on spurious patterns that may exist in the original data. Producing high-quality augmented data can be costly and time-consuming as it usually needs to involve human feedback and crowdsourcing efforts. In this work, we propose an alternative by describing and evaluating an approach to automatically generating counterfactual data for the purpose of data augmentation and explanation. A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance when compared to models training on the original data and even when compared to models trained with the benefit of human-generated augmented data.

achieving the excellent performance of a wide range of tasks in recent years, important questions are being raised about their robustness and their underlying sensitivity to systematic biases that may exist in their training and test data. Such issues come to be manifest in performance problems when faced with out-ofdistribution data in the field. One recent solution has been to use counterfactually augmented datasets in order to reduce any reliance on spurious patterns that may exist in the original data. Producing high-quality augmented data can be costly and time-consuming as it usually needs to involve human feedback and crowdsourcing efforts. In this work, we propose an alternative by describing and evaluating an approach to automatically generating counterfactual data for the purpose of data augmentation and explanation. A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance when compared to models training on the original data and even when compared to models trained with the benefit of human-generated augmented data.

Introduction
Deep neural models have recently made remarkable advances on sentiment analysis (Devlin et al., 2018;Xie et al., 2020). However, their implementation in practical applications still encounters significant challenges. Of particular concern, these models tend to learn intended behavior that is often associated with spurious patterns (artifacts) (Jo and Bengio, 2017;Slack et al., 2020a). As an example, in the sentence "Nolan's films always shock people, thanks to his superb directing skills", the most influential word for the prediction of a positive sentiment should be "superb" instead of "Nolan" or "film". The issue of spurious patterns also partially affects the out-ofdomain (OOD) generalization of the models trained on independent, identical distribution (IID) data, leading to performance decay under distribution shift (Quionero-Candela et al., 2009;Sugiyama and Kawanabe, 2012;Ovadia et al., 2019).
Researchers have recently found that such concerns about model performance decay and social bias in NLP come about out-of-domain because of a sensitivity to semantically spurious signals (Gardner et al., 2020), and recent studies have uncovered a problematic tendency for gender bias in sentiment analysis (Zmigrod et al., 2019;Maudslay et al., 2019;Lu et al., 2020). To this end, one of the possible solutions is data augmentation with counterfactual examples (Kaushik et al., 2020) to ensure that models learn real causal associations between the input text and labels. For example, a sentiment-flipped counterfactual of last example could be "Nolan's movies always bore people, thanks to his poor directorial skills.". When added to the original set of training data, such kinds of counterfactually augmented data (CAD) have shown their benefits on learning real causal associations and improving the model robustness in recent studies (Kaushik et al., 2020(Kaushik et al., , 2021Wang and Culotta, 2021). Unlike gradient-based adversarial examples (Wang and Wan, 2019;Zang et al., 2020), which cannot provide a clear boundary between positive and negative instances to humans, counterfactuals could provide "human-like" logic to show a modification to the input that makes a difference to the output classification (Byrne, 2019).
Recent attempts for generating counterfactual examples (also known as minimal pairs) rely on human-in-the-loop systems. Kaushik et al. (2020) proposed a human-in-the-loop method to generate CAD by employing human annotators to generate sentiment-flipped reviews. The human labeler is asked to make minimal and faithful edits to produce counterfactual reviews. Similarly, Srivastava et al. (2020) presented a framework to leverage strong prior (human) knowledge to understand the possible distribution shifts for a specific machine learning task; they use human commonsense reasoning as a source of information to build a more robust model against spurious patterns. Although useful for reducing sensitivity to spurious correlations, collecting enough high-quality human annotations is costly and time-consuming.
The theory behind the ability of CAD to improve model robustness in sentiment analysis is discussed by Kaushik et al. (2021), where researchers present a theoretical characterization of the impact of noise in causal and non-causal features on model generalization. However, methods for automatically generating CAD have received less attention. The only existing approach (Wang and Culotta, 2021) has been tested on the logistic regression model only, despite the fact that recent state-of-the-art methods for sentiment classification are driven by neural models. Also, their automatically generated CAD cannot produce competitive performance compared to human-generated CAD. We believe that their method does not sufficiently leverage the power of pre-trained language models and fails to generate fluent and effective CAD. In addition, the relationships between out-of-domain generalization and sensitivity to spurious patterns were not explicitly investigated by Wang and Culotta (2021).
To address these issues, we use four benchmark datasets (IMDB movie reviews as hold-out test while Amazon, Yelp, and Twitter datasets for outof-domain generalization test) to further explore the efficacy of CAD for sentiment analysis. First, we conduct a systematic comparison of several different state-of-the-art models (Wang and Culotta, 2021). This reveals how large Transformerbased models (Vaswani et al., 2017) with larger parameter sizes may improve the resilience of machine learning models. Specifically, we have found that for increasing parameter spaces, CAD's per-formance benefit tends to decrease, regardless of whether CAD is controlled manually or automatically. Second, we introduce a novel masked language model for helping improve the fluency and grammar correctness of the generated CAD. Third, we add a fine-tuned model as a discriminator for automatically evaluating the edit-distance, using data generated with minimal and fluent edits (same requirements for human annotators in Kaushik et al. (2020)) to ensure the quality of generated counterfactuals. Experimental results show that it leads to significant prediction benefits using both hold-out tests and generalization tests.
To the best of our knowledge, we are the first to automatically generate counterfactuals for use as augmented data to improve the robustness of neural classifiers, which can outperform existing, state-ofthe-art, human-in-the-loop approaches. We will release our code and datasets on GitHub 1 .

Related Work
This work mainly touches on three important areas: approaches to evaluation that go beyond traditional accuracy measures (Bender and Koller, 2020;Warstadt et al., 2020), the importance of counterfactuals in eXplainable AI (XAI) (Byrne, 2019; Keane and Smyth, 2020), and out-of-domain generalization in sentiment analysis (Kim and Hovy, 2004;Zhang et al., 2018;. There has been an increasing interest in the role of Robustness Causal Thinking in ML, often by leveraging human feedback. Recently, some of the standard benchmark datasets have been challenged (Gardner et al., 2020;Ribeiro et al., 2020), in which the model performance is significantly lower on contrast sets than on original test sets; a difference of up to 25% in some cases. Researchers propose counterfactual data augmentation approaches for building robust models (Maudslay et al., 2019;Zmigrod et al., 2019;Lu et al., 2020), and find that spurious correlations threaten the model's validity and reliability. In an attempt to address this problem, Kaushik et al. (2020) explore opportunities for developing human-in-the-loop systems by using crowd-sourcing to generate counterfactual data from original data, for data augmentation. Teney et al. (2020) shows the continuous effectiveness of CAD in computer vision (CV) and NLP.
The idea of generating Counterfactuals in XAI

MoverScore Classifiers
MoverScore is used to control the minimal edits of the automatically generated CAD. Figure 1: Overview of previous CAD methods are shown on the left side, while the pipeline of our method is shown on the right. Hierarchical RM-CT (removing the casual terms) and Hierarchical REP-CT (replacing the casual terms) are our methods for automatically generating CAD, respectively. SCD denotes sampling and sensitivity of contextual decomposition. Sentiment Dictionary refers to the opinion lexicon published by (Hu and Liu, 2004).
also shares important conceptual features with our work. Since human counterfactual explanations are minimal in the sense that they select a few relevant causes (Byrne, 2019; Keane and Smyth, 2020) as is the requirement of minimal edits in our generation process. This has been explored more in the field of CV Kenny and Keane, 2021), but investigated less in NLP. Recent work (Jacovi and Goldberg, 2020) highlight explanations of a given causal format, and Yang et al. (2020a) generate counterfactuals for explaining the prediction of financial text classification. We propose a similar but different research question, that is, whether the automatically generated counterfactual can be used for data augmentation to build more robust models, which has not been considered by the previous methods in XAI (Pedreschi et al., 2019;Slack et al., 2020b;Yang et al., 2020b;Ding et al., 2020).
In the case of Sentiment Analysis, most of the previous works report experiments using a holdout test on the IID dataset (Liu, 2012;Yang et al., 2016;Johnson and Zhang, 2017). The current stateof-the-art methods make use of large pre-trained language models (e.g., BERT (Devlin et al., 2018), RoBERTa  and SMART-RoBERTa (Jiang et al., 2020)) for calculating input represntations. It has been shown that these methods can suffer from spurious patterns (Kaushik et al., 2020;Wang and Culotta, 2021). Very recently, Wang and Culotta (2021) provide a starting point for exploring the efficacy of automatically generated CAD for sentiment analysis, but it is still based on IID hold-out tests only. However, spurious patterns in the training and test sets could be tightly coupled, which may limit the possibility of observing their attendant accuracy issues using a hold-out test methodology. For this reason, we designed an indirect method for evaluating the robustness of models, by comparing the performance of models trained on original and augmented data using out-of-domain data. The prediction benefit for out-of-domain data should provide some evidence about whether a model's sensitivity to spurious patterns has been successfully mitigated. The resulting counterfactuals can be used for data augmentation and can also provide contrastive explanations for classifiers, and important and desirable consideration for the recent move towards more XAI (Ribeiro et al., 2016;Lundberg and Lee, 2017;Lipton, 2018;Pedreschi et al., 2019;Slack et al., 2020b).

Detailed Implementation
We propose a new approach for automatically generating counterfactuals to enhance the robustness of sentiment analysis models by inverting the sentiment of causally important terms according to Algorithm 1 and based on the following stages: 1. The identification of genuine causal terms using self-supervised contextual decomposition (Section 3.1).
The end result will be a set of counterfactuals that can be used to augment an existing dataset.

Identifying Causal Terms
To identify causally important terms, we propose a hierarchical method, based on the sampling and sensitivity of contextual decomposition technique from Jin et al. (2019), by incrementally removing words from a sentence in order to evaluate the model's sensitivity to these words. Significant changes in model outputs suggest the removal of important terms. For example, removing the word "best" from "The movie is the best that I have ever seen.", is likely to alter a model's sentiment prediction more than the removal of other words from the sentence; thus "best" is an important word with respect to this sentence's sentiment. In a similar way, phrases beginning with negative pronouns will likely be important; for instance, "not satisfy you" is important in "This movie could not satisfy you". Given a word (or phrase starting with negative limitations) w in the sentence s, the importance of w can be calculated as in Equation 1 where s β \p denotes the sentence that resulting after masking out a single word (or a negative phrase as above). We use l (s β \p; s) to represent the model prediction after replacing the masked-out context, while s β is a input sequence sampled from the input s. \p indicates the operation of masking out the phrase p in a input document D from the training set. The specific candidate causal terms found by this masking operation vary for different prediction models.

Generating Human-like Counterfactuals
This approach and the scoring function in Equation 1 is used in Algorithm 1 in two ways, to generate two types of plausible counterfactuals. First, it is used to identify words to remove from a sentence to produce a plausible counterfactual. This is referred to as RM-CT and is performed by lines 3-5 in Algorithm 1; for a sentence S (i) , it's correctly labeled sentiment words are identified (line 3), and sorted based on Equation 1 (line 4) with classifier C, and the most important of these words is removed from S (i) to produce S rep . For each word w we use a masked language model (MLM) to generate a set of plausible replacements, W p (line 8), and a subset of these, W c , as replacement candidates if their sentiment is different from the sentiment of S (i) , which is given by Y i (line 9). Here we are using the BERT-base-uncased as the pre-trained MLM for SVM and BiLSTM models 1 . The size of candidate substitutions found by MLM output is set to 100 for all models.Then, W c is sorted in descending order of importance using Equation 1 and the most important candidate is selected and used to replace w in S (i) rep (line 10). Algorithm 1 continues in this fashion to generate counterfactual sentences using RM-CT and REP-CT for each sentence in each paragraph of the target document 2 . It returns two counterfactual documents, which correspond to documents produced from the RM-CT and REP-CT sentences; see lines 15-18.
The above approach is not guaranteed to always generate counterfactuals. Typically, reviews that cannot be transformed into plausible counterfactuals contain spurious associations that interfere with the model's predictions. For example, in our method, the negative review "The film is pretty bad, and her performance is overacted" will be first modified as "The film is pretty good, and her performance is lifelike". The revised review's prediction will remain negative. Meanwhile, the word "her" will be identified as a potential causal term. To alleviate this problem, we further conduct the substitution of synonyms for those instances that have been already modified with antonym substitution by using causal terms. As an example, we will continue replacing the word "her" with "their" until the prediction has been flipped; see also Zmigrod et al. (2019) for related ideas.
In conclusion, then, the final augmented dataset that is produced of three parts: (1) counterfactuals generated by RM-CT; (2) counterfactuals generated by REP-CT; (3) adversarial examples generated by synonym substitutions.

Ensuring Minimal Changes
When generating plausible counterfactuals, it is desirable to make minimal changes so that the resulting counterfactual is as similar as possible to the original instance (Miller, 2019;Keane and Smyth, 2020). To evaluate this for the approach described we use the MoverScore (Zhao et al., 2019) -an edit-distance scoring metric originally designed for machine translation -which confirms that the MoverScore for the automatic CAD instances is marginally higher when compared to human-generated counterfactuals, indicated greater similarity between counterfactuals and their original instances. The MoverScore between humangenerated counterfactuals and original reviews is 0.74 on average (minimum value of 0.55) and our augmented data results in a slightly higher average score than human-generated data for all models. The generated counterfactuals and synonym substitutions that achieve a MoverScore above 0.55 are combined with the original dataset for training robust classifiers.

Datasets
Our evaluation uses three different kinds of datasets, in-domain data, challenge data, and outof-domain data.

In-domain Data
We first adopt two of the most popular benchmark datasets -SST-2 and IMDB (Maas et al., 2011) to show the recent advances on sentiment analysis with the benefit of pre-trained models. However, we mainly focus on the robustness of various models for sentiment analysis in this work, rather than in-domain accuracy. Hence, following Wang and Culotta (2021)

Challenge Data
Based on the in-domain IMDB data, Kaushik et al.
(2020) employ crowd workers not to label documents, but to revise movie review to reverse its sentiment, without making any gratuitous changes. We directly use human-generated counterfactuals by Kaushik et al. (2020) as our challenge data, enforcing a 50:50 class balance.

Out-of-domain Data
We also evaluate our method on different out-ofdomain datasets, including Amazon reviews (Ni et al., 2019) from six genres: beauty, fashion, appliances, gift cards, magazines, and software, a Yelp review dataset, and the Semeval-2017 Twitter dataset (Rosenthal et al., 2017). These have all been sampled to provide a 50:50 label split. The size of the training data has been kept the same for all methods, and the results reported are the average from five runs to facilitate a direct comparison with baselines (Kaushik et al., 2020(Kaushik et al., , 2021.

Results and Discussions
We first describe the performance of the current state-of-the-art methods on sentiment analysis based on the SST-2 and IMDB benchmark datasets. Next, we will discuss the performance benefits by using our automatically generated counterfactuals  Table 2: The accuracy of various models for sentiment analysis using different datasets, including the humangenerated counterfactual data and counterfactual samples generated by our pipeline. O denotes the original IMDB review dataset, CF represents the human-revised counterfactual samples, C denotes the combined dataset consisting of original and human-revised dataset, and AC denotes the original dataset combined with automatically generated counterfactuals. C and AC contain the same size of training samples (3.4K).  on an in-domain test. We further compare our method, human-label method, and two state-of-theart style-transfer methods (Sudhakar et al., 2019;Madaan et al., 2020) in terms of the model robustness on generalization test. Notably, we provide an ablation study lastly to discuss the influence of edit-distance for performance benefits.

State-of-the-art Models
As the human-generated counterfactuals (Kaushik et al., 2020) are sampled from Maas et al. (2011), the results in Table 1 cannot be directly compared with Table 2 3 . As shown in Table 1, by comparing BiLSTM to Transformer-base methods, it can be seen that remarkable advances in sentiment analysis have been achieved in recent years. On SST-2, SMART-RoBERTa (Jiang et al., 2020) outperforms Bi-LSTM by 10.8% (97.5% vs. 86.7%) accuracy, where a similar improvement is observed on IMDB (96.3% vs. 86.0%).
According to the results, we select the following models for our experiments, which covers a spectrum of statistical, neural and pre-trained neural methods: SVM (Suykens and Vandewalle, 1999), Bi-LSTM (Graves and Schmidhuber, 2005), BERT-Base (Devlin et al., 2018), RoBERTa-Large , and XLNet-Large . 3 We can only get the human-generated counterfactual examples (Kaushik et al., 2020) sampled from the IMDB dataset.
The SVM model for sentiment analysis is from scikit-learn and uses TF-IDF (Term Frequency-Inverse Document Frequency) scores, while the Transformer-based models are built based on the Pytorch-Transformer package 4 . We keep the prediction models the same as Kaushik et al. (2020), except for Naive Bayes, which has been abandoned due to its high-variance performance shown in our experiments.
In the following experiments, we only care about whether the robustness of models has been improved when training on the augmented dataset (original data & CAD). Different counterfactual examples have been generated for different models in terms of their own causal terms in practice, while the hyper-parameters for different prediction models are all identified using a grid search conducted over the validation set.

Comparison with Original Data
On the Influence of Spurious Patterns. As shown in Table 2, we find that the linear model (SVM) trained on the original and challenge (human-generated counterfactuals) data can achieve 80% and 91.2% accuracy testing on the IID hold-out data, respectively. However, the accuracy of the SVM model trained on the original set when testing on the challenge data drops dramatically (91.2% vs. 51%), and vice versa (80% vs. 58.3%). Similar findings were reported by Kaushik et al. (2020), where a similar pattern was observed in the Bi-LSTM model and BERT-base model. This provides further evidence supporting the idea that the spurious association in machine learning models is harmful to the performance on the challenge set for sentiment analysis.
On the Benefits of Robust BERT. As shown in Table 3, we also test whether the sensitivity to spurious patterns has been eliminated in the robust BERT model. We notice that the correlations of the real causal association "superb" and "poor" are improved from 0.213 to 0.627 and -0.551 to -0.999, respectively. While the correlation of spurious association "film" is decreased from 0.446 to 0.019 and -0.257 to -7e-7 on positive and the negative samples, respectively. This shows that the model trained with our CAD data does provide robustness against spurious patterns.
On the Influence of Model Size. Previous works (Kaushik et al., 2021;Wang and Culotta, 2021) have not investigated the performance benefits on larger pre-trained models. While we further conduct experiments on various Transformer-based models with different parameter sizes to explore whether the larger transformer-based models can still enjoy the performance benefits of CAD (Table  2). We observe that although the test result can increase with the parameter size increasing (best for 94.9% using XLNet), the performance benefits brought by human-generated CAD and the autogenerated CAD declines continuously with the parameter size increase. For example, the BERT-baseuncased model trained on the auto-generated combined dataset can receive 3.2% (90.6% vs. 87.4%) improvement on accuracy while performance increases only 0.6% (91.8% vs. 91.2%) on accuracy for WWM-BERT-Large. It suggests that larger pretrained Transformer models may be less sensitive to spurious patterns.

Comparison with Human CAD
Robustness in the In-domain Test. We can see that all of the models trained on automatic CAD -shown as AC in the Table 2 -can outperform the human-generated CAD varying with the models (AC/O vs. C/O) as follows: SVM (+1.1%), Bi-LSTM (+0.7%), BERT-base-uncased (+2.1%), BERT-Large (+0.8%), XLNet-Large (+1.0%), and RoBERTa-Large (+0.5%) when testing on the original data. If we adopt the automatic CAD (AC), we note a distinct improvement in Table 2 across all models trained on the challenge data in terms of 11.3% in average (AC/O vs. CF/O), whereas the human-generated CAD can achieve 10.2% accuracy improvement (C/O vs. CF/O) in average. It is noteworthy that the human-generated CAD can slightly outperform our method when testing on the  human-generated (CF) data, it may be because the training and test sets of the human-generated (CF) data are generated by the same group of labelers.
Robustness in the Generalization Test. We explore how our approach makes prediction models more robust out-of-domain in Table 4. For direct comparison between our method and the humangenerated method, we adopt the fine-tuned BERTbase model trained with the augmented dataset (original & automatically revised data). The finetuned model is directly tested for out-of-domain data without any adjustment. As shown in Table 4, only our method and the human-label method can outperform the BERT model trained on the original data with average 6.5% and 5.3% accuracy improvements, respectively. Our method also offers performance benefits over three datasets even when compared to the human-label method on BERT.
Neural Method vs. Statistical Method. As shown in Table 4, the performance of the SVM model with automatic CAD is more robust than other automated methods (Sudhakar et al., 2019;Madaan et al., 2020) across all datasets. However, the human-labeled CAD can improve Amazon reviews' accuracy compared to our method using the SVM model by 0.7%. It indicates that humangenerated data may lead to more performance benefits on a statistical model.

Types of Algorithms
Examples Ori: Some films just simply should not be remade. This is one of them. In and of itself it is not a bad film. Hierarchical RM-CT: Remove negative limitations Rev: Some films just simply should be remade. This is one of them. In and of itself it is a bad film. Ori: It is badly directed, badly acted and boring. Hierarchical RE-CT: Replacing the causal terms Rev: It is well directed, well acted and entertaining.
Ori: This movie is so bad, it can only be compared to the all-time worst "comedy": Police Academy 7. No laughs throughout the movie. Combined method: Rev: This movie is so good, it can only be compared to the all-time best "comedy": Police Academy 7. Laughs throughout the movie.

Comparison with Automatic Methods
Automatic CAD vs. Style-transfer Methods. As shown in Table 4, the style-transfer results are consistent with Kaushik et al. (2021). We find that the sentiment-flipped instances generated by style-transfer methods degrade the test accuracy for all models on all kinds of datasets, whereas our method has achieved the best performance for all settings. It suggests that our method have its absolute advantage for data augmentation in sentiment analysis when compared to the state-of-theart style-transfer models. Our Methods vs. Implausible CAD. The authors of the only existing approach for automatically generating CAD (Wang and Culotta, 2021) report that their methods are not able to match the performance of human-generated CAD. Our methods consistently outperform human-labeled methods on both In-domain and Out-of-domain tests. To further provide quantitative evidence of the influence of the edit-distance in automatic CAD, we demonstrate an ablation study in Table 6. The result shows that the quality of the generated CAD, which is ignored in the previous work Wang and Culotta (2021), is crucial when training the robust classifiers. In particular, the BERT model finetuned with implausible CAD (below the threshold) can receive comparable negative results with the style-transfer samples, alongside the performance decrease on all datasets, except for Twitter.

Case Study and Limitations
The three most popular kinds of edits are shown in Table 5. These are, negation words removal, sentiment words replacement, and the combination of these. It can be observed from these examples that we ensure the edits on original samples should be minimal and fluent as was required previously with human-annotated counterfactuals (Kaushik  et al., 2020). As shown in Table 5, we flipped the model's prediction by replacing the causal terms in the phrase "badly directed, badly acted and boring" to "well directed, well acted and entertaining", or removing "No laughs throughout the movie." to "Laughs throughout the movie" for a movie review. We also noticed that our method may face the challenge when handling more complex reviews. For example, the sentence "Watch this only if someone has a gun to your head ... maybe." is an apparent negative review for a human. However, our algorithm is hard to flip the sentiment of such reviews with no explicit casual terms. The technique on sarcasm and irony detection may have benefits for dealing with this challenge.

Conclusion
We proposed a new framework to automatically generate counterfactual augmented data (CAD) for enhancing the robustness of sentiment analysis models. By combining the automatically generated CAD with the original training data, we can produce more robust classifiers. We further show that our methods can achieve better performance even when compared to models trained with humangenerated counterfactuals. More importantly, our evaluation based on several datasets has demonstrated that models trained on the augmented data (original & automatic CAD) appear to be less af-fected by spurious patterns and generalize better to out-of-domain data. This suggests there exists a significant opportunity to explore the use of the CAD in a range of tasks (e.g., natural language inference, natural language understanding, and social bias correction.).

Impact Statement
Although the experiments in this paper are conducted only in the sentiment classification task, this study could be a good starting point to investigate the efficacy of automatically generated CAD for building robust systems in many NLP tasks, including Natural Language Inference (NLI), Named Entity Recognition (NER), Question Answering (QA) system, etc.