How Does Counterfactually Augmented Data Impact Models for Social Computing Constructs?

As NLP models are increasingly deployed in socially situated settings such as online abusive content detection, it is crucial to ensure that these models are robust. One way of improving model robustness is to generate counterfactually augmented data (CAD) for training models that can better learn to distinguish between core features and data artifacts. While models trained on this type of data have shown promising out-of-domain generalizability, it is still unclear what the sources of such improvements are. We investigate the benefits of CAD for social NLP models by focusing on three social computing constructs — sentiment, sexism, and hate speech. Assessing the performance of models trained with and without CAD across different types of datasets, we find that while models trained on CAD show lower in-domain performance, they generalize better out-of-domain. We unpack this apparent discrepancy using machine explanations and find that CAD reduces model reliance on spurious features. Leveraging a novel typology of CAD to analyze their relationship with model performance, we find that CAD which acts on the construct directly or a diverse set of CAD leads to higher performance.


Introduction
Dataset design is receiving increasing attention, especially in response to concerns related to the generalizability of machine learning-based NLP models. Recent critiques argue that models trained for NLP tasks may end up "learning the dataset" rather than a particular construct (Bras et al., 2020), i.e, the intangible measure like sentiment or stance that is the ultimate goal of the learning task (Jacobs and Wallach, 2021). In particular, in the process of inferring the mapping between an input space and output space, models may learn cues in the dataset which are spuriously correlated with the construct (Schlangen, 2020). For example, sentiment models trained on movie reviews tend to learn  more about movies than about sentiment, thereby failing to measure it as accurately in e.g., news media (Puschmann and Powell, 2018). This potential learning of spurious cues over meaningful manifestations of the construct makes it especially difficult to foresee how even small differences in the context of deployment would affect the performance of NLP models, with undesirable consequences for their applicability at large. The issue of model robustness is all the more crucial for social computing NLP models, particularly for constructs like hate speech and sexism, which are often deployed in detecting abusive content on online platforms (Jigsaw, 2021). In such settings, there is a risk of high societal and human harms such as sanctioning marginalized voices due to model misclassification and bias (Guynn, 2019). Even in contexts other than online governance, such as using social NLP models for detecting abuse faced by a certain subpopulations on a particular online space, we incur the risks and consequences of mismeasurement (Pine and Liboiron, 2015;Wagner et al., 2021).
One suggested solution to address the issue of spurious features is counterfactually augmented data (CAD)-instances generated by human annotators that are minimally edited to flip their labeland their variations such as iterative benchmark design (Potts et al., 2020), contrast data generation (Gardner et al., 2020), 1 and their combination (Vidgen et al., 2020). Drawing on the rich history of counterfactuals (Pearl, 2018;Lewis, 2013;Kasirzadeh and Smart, 2021), the promise of CAD is to offer a causality-based framework where only cues that are meaningfully associated with the construct are edited -which is expected to be conducive to models learning less spurious features. Indeed, recent work has shown that models trained on CAD generalize better out of domain (Kaushik et al., 2020;Samory et al., 2021). Yet, it is not well understood why or how these counterfactuals are effective, especially for social NLP tasks-do they reduce dependence on spurious features and to what extent?
This work.
We analyze how CAD affects social NLP models. Unlike previous work, we leverage multiple, related social computing constructs to avoid confounds that may arise due to the specific settings of a single construct. We conduct our experiments on three text classification tasks: sentiment, sexism, and hate speech identification. Sentiment has been thoroughly analyzed in past NLP robustness work, and abusive content has been widely studied in NLP (Schmidt and Wiegand, 2017;Vidgen and Derczynski, 2020;Jurgens et al., 2019;Sarwar et al., 2021). However, sexism and hate speech have not been studied in as much detail in the specific context of the impact of training on CAD. The multifaceted nature of these constructs warrants further investigation, especially in the context of developing models with less spurious features.
First, we ask: (RQ1) do models trained on CAD outperform models trained on original, unaltered data? We assess the overall performance of these two types of models and find that while models trained on original data outperform those trained on CAD in-domain, the opposite is true out-of-domain-models trained on CAD are more robust out-of-domain.
Next, we analyze (RQ2) the characteristics of effective counterfactuals, categorizing CAD according to their generation strategy, e.g., whether 1 Counterfactually augmented data and contrast sets refer to the same concept: making minimal changes to flip labels but have different conceptual grounding-causality for CAD and modeling decision boundaries for constrast sets. a negation was added or a gender word removed. Using this typology, we distinguish between construct-driven CAD, generated by directly acting on the construct (e.g., removing gender identity terms in sexism) versus constructagnostic ones, generated by other strategies (e.g., negating a clause). We find that construct-driven counterfactuals are more effective than constructagnostic ones, especially for sexism.
We unpack the gain in out-of-domain performance by analyzing (RQ3) whether models trained on CAD rely on less spurious features. Complementing prior work, which has focused on the overall performance of models trained on CAD, we use explainability techniques to understand what models have learned. We find that models trained on CAD promote core, or non-spurious features, more than models not trained on CAD.
Overall contributions Whereas previous work mainly assessed how much CAD affects model performance, we focus on why counterfactually augmented data improves performance for social computing NLP models. Our work has several implications on designing datasets and data augmentation, especially with respect to the benefits of different types of CAD. We release our code and collated data with the type of CAD labels for all three constructs to facilitate future research here: https://github.com/gesiscss/socialCAD.

Motivation
For a given text with an associated label, say a positive tweet, a counterfactual example is obtained by making minimal changes to the text in order to flip its label, i.e., into a negative tweet. Table 1 shows original-counterfactual pairs for the three types of NLP constructs studied in this paper. Counterfactual examples in text have the interesting property that, since they were generated with minimal changes, they allow one to focus on the manifestation of the construct; in our example, what makes a tweet have positive sentiment.

Task Setting
Formally, we have a model f (x) = y; y is an application task label; x is an instance that can be drawn from the original data set, or from the set of counterfactual data ( f (x c ) = y); f is a learned feature representation. We optimise the binary cross entropy loss for l( f ,x,y) during learning.  There are different ways of incorporating counterfactuals; here, we simply treat them as ordinary training instances. This means any text classification model can be used for training on CAD. We learn feature representations on fully original data (non-counterfactual or nCF models) or on a combination of counterfactuals and original data (counterfactual or CF models).
We have different sampling strategies -random and stratified sampling in different proportions to ensure various counterfactual generation strategies are presented equally. To ensure fair comparison between CF and nCF models, we train both types of models on equal sized datasets -for CF models, we simply substitute a portion of the original data with CAD. We either randomly sample the CAD (RQ1, RQ3), or sample based on CAD type (RQ2). Table 2 summarizes the datasets used in this work. In vs. out-of domain We consider two types of non-synthetic datasets per construct -in-domain (ID) and out-of-domain (OOD). Models are both trained and tested on in-domain data while out-ofdomain data is fully held-out for testing. For the in-domain data, we use the same train-test splits as the original work, except for sexism, where a test set is not provided, so we do a stratified split of 70-30 (train-test). The out-of-domain data is exclusively used for testing. The EXIST data 5 also contains Spanish data, but we restrict ourselves to only English content in this work, as the in-domain data used for training is in English. Counterfactually augmented data All indomain datasets we consider come with counterfactually augmented data, annotated by trained crowdworkers (Kaushik et al., 2020;Samory et al., 2021) or expert annotators (Vidgen et al., 2020). 2 Note that since previous work has shown that models trained on CAD tend to perform well on counterfactual examples (Kaushik et al., 2020;Samory et al., 2021), to prevent reporting inflated performance, we do not include counterfactual examples in any of the test sets.

Datasets
Following Kaushik et al. (2020), for sentiment and hate speech, the CF models are trained on 50% original and 50% CAD data, while for sexism, which has CAD only for non-sexist examples, models are trained on 50% original sexist data, 25% original non-sexist data, and 25% counterfactual non-sexist data (Samory et al., 2021). 3 Adversarial test set To further assess model robustness, in addition to evaluating on in-domain and out-of-domain data, we generate automated adversarial examples which do not flip the label through textattack (Morris et al., 2020). These are of two types -one which replaces words with synonyms (adv_swap, Wei and Zou (2019)) and another which replaces named entities with other named entities (adv_inv, Ribeiro et al. (2020)). They are both generated by perturbing the in-2 Only Samory et al. (2021) generate more than one counterfactual example per original, but to keep things consistent across all constructs, we randomly sample one counterfactualoriginal pair for sexism. Vidgen et al. (2020) generate different types of synthetic data, including CAD, as a part of dynamic benchmarking for collecting hate speech data. We only use the original-counterfactual pairs from their dataset. 3 We assess the effect of CAD proportion on model performance in Appendix 10.1 4 https://www.kaggle.com/c/tweet-sentiment-extraction This also contains tweets with neutral labels, but in this work, we restrict ourselves to positive and negative tweets only. 5 From the EXIST 2021 shared task on sexism detection (Ródriguez-Sánchez et al., 2021) available at http://nlp.uned.es/exist2021/ domain dataset. Note that due to the nature of these perturbations, adversarial data can only be generated for a subset of the training data, e.g., if an example does not contain any named entities, then we cannot generate an adv_swap version of it.

Text Classification Methods.
We use two different text classification models: logistic regression (LR) and finetuned-BERT (Devlin et al., 2019). We do so as we want to contrast a basic model trained from scratch, which only learns simple features directly observed in the dataset (LR); and one which encodes a combination of background knowledge and application dataset knowledge, and is capable of learning complex inter-dependencies between features (BERT). We train LR with TF-IDF bag-of-words feature representations using sklearn (Pedregosa et al., 2011), while the BERT base model is used for finetuning in conjunction with the subword tokenizer using HuggingFace Transformers (Wolf et al., 2020).
Each model is trained using 5-fold crossvalidation and we use gridsearch for hyperparameter tuning. We conduct 5 runs for all models to reduce variance. We report the hyperparameters of all our models and their bounds in Appendix 9.2.

Experiments
We first start by assessing overall performance on different types of data (RQ1), followed by introducing a typology of different types of CAD in order to understand if certain strategies of generating CAD are better for model performance (RQ2), and end by using explanations to understand which features the CF models promote (RQ3). Unless specified otherwise, we report results for BERT, while including the results for LR in the appendix for completeness (Appendix 10). We measure performance using macro F1 and positive class F1, where the latter metric is significant for constructs like sexism and hate speech.   Table 4 shows results with BERT for adversarial data. Recall that since we can only generate adversarial examples for a subset of the original data, we also include results on the original data for fair comparison. Results for LR models follow a similar trend and are included in Appendix 10.

Results
The overall results indicate that counterfactual models outperform non-counterfactual models on out-of-domain data, while results are mixed for indomain data.. There are several possible explanations of this -on one hand, the lower performance on the in-domain data could be due to the prevalence of spurious or domain-specific features in the nCF models as opposed to the CF models. On the other hand, CF models tend to learn less domainspecific features and more 'general' features, which leads to performance gains in other domains that the construct manifests in (as we explore in RQ3).
As for adversarial data, it appears that CF models perform worse on it than their nCF counterparts in absolute terms. Note though that the adversarial data is automatically generated from the in-domain data, which indicates that nCF models have an advantage on it since nCF models might be picking up artifacts in the in-domain data that are also present in the adversarial examples (See Section 3.1). On the other hand, we do not find CF models' performance degrading on adversarial data anymore than nCF models, and in certain cases have smaller gaps between original and adversarial performance compared to their nCF   Whereas the previous analyses assess whether CF models are more robust or not, we now turn to the question of whether all CAD is equally effective in improving classifier performance. Armed with a minimal set of instructions, annotators use several different strategies for generating CAD. Are some better than others? We aim to answer this question by categorizing different types of counterfactuals based on the strategy used to generate them. Then, to understand the 'power' of different types of CAD, we assess the overall performance of models trained on the different types.
A Typology of Counterfactuals. Previous work has manually assessed a sample of counterfactuals to understand the strategies used to generate them, such as introducing negation or distancing the speaker (Kaushik et al., 2020;Vidgen et al., 2020). Yet, to the best of our knowledge, there is no categorization of the entire dataset of counterfactuals. Inspired by causal inference, particularly the notion of direct and indirect mediation (Pearl, 2014;Frölich and Huber, 2014), we describe two distinct types of counterfactual data generation: constructdriven and construct-agnostic. Construct-driven CAD are generated by directly acting on the construct, e.g. replacing the gender word in sexism, or altering the affect-laden word in sentiment. On the other hand, construct-agnostic CAD are generated by indirectly acting on the construct, through general-purpose strategies such as introducing sarcasm or negation which yields CAD for several constructs (see Table 5). Since construct-driven CAD directly act on the construct, we hypothesize that construct-driven strategies are more effective.
To determine which instances represent which modification strategy, we use a simple lexiconbased automatic annotation strategy. Based on strategies manually assessed in previous literature (Kaushik et al., 2020;Vidgen et al., 2019), we devise 5 specific strategies -affect, gender, identity, hedges, and negation. The first three are construct-driven strategies for sentiment, sexism, and hate speech, respectively, while the last two are construct-agnostic. 4 We use a set of lexica for discerning each strategy -a lexicon of positive and negative words for affect (Hu and Liu, 2004), 5 list of gender words 6 and a list of identity-based hateful terms and slurs (Silva et al., 2016). 7 For negation, we use the list compiled by Ribeiro et al. (2020) and for hedges, we use Islam et al. (2020). Table 5 enumerates the different types of CAD. We consider any counterfactual that does not fall under the construct-driven category to be construct-agnostic, e.g., 21% of the CAD for sexism is constructagnostic (as 79% is construct-driven).
To determine whether a CAD sample is construct-driven or -agnostic, we first find the difference between the original datapoint and its counterfactually augmented counterpart and retrieve the additions and deletions based on that difference. We then check if the additions or deletions contain any of the words with the strategy-associated lex-4 A construct-driven strategy for one construct could be construct-agnostic for another, e.g., changing affect words is a construct-agnostic strategy for sexism and hate speech. 5 Figure 1: Performance (macro F1) of BERT models trained on different types of CAD over different injection proportions on the out of domain data. nCF model performance is included as a reference. Construct-driven CAD performs well especially for sexism, while in hate speech, diverse CAD is better.
icon. Note that a single counterfactual example could span multiple strategies; e.g, the tweet "It was horrible, I could not watch it", with the counterfactual "It was excellent, I could watch it many times" pertains to a change in affect and negation. We sample 100 random original-counterfactual pairs over all constructs to validate our automatic categorization and find that for 89 cases, we are able to correctly label the annotation strategy. Errors include misplellings of slurs, or creative distancing strategies like "[identity] stink" to "awful graffitti I saw today: '[identity] stink' ".

Models trained on different types of CAD.
We train models on the different types of counterfactuals (see Table 5). Specifically, we train three types of models: (a) models trained on just construct-driven counterfactuals (CF_c); (b) models trained on just construct-agnostic counterfactuals (CF_r); and (c) models trained on equal proportions of both (CF_a). 8 We measure the macro F1 of each of these types of models for the outof-domain data. Since we have almost negligible construct-agnostic CAD for sentiment, we conduct the analyses for RQ2 on sexism and hate speech only. 9 Furthermore, due to less than 50% CAD for certain types, instead of a 50% injection, we vary the proportion between 10% to 20%. 8 We train the last type with equal proportions instead of a random set of CAD like the CF models in RQ1 and RQ3 since construct-driven CAD makes up the majority for sexism. 9 One reason for the low proportion of construct-agnostic CAD in sentiment is the nature of the in-domain data; while for sexism and hate speech, the in-domain data consists of tweets or short single-sentence utterances, the data for sentiment comes from movie reviews which are much longer and have multiple edits made throughout. It is natural to find reviews which have a negation injected, while also having an affect word being changed.

Results
We show the macro F1 of these three types of models on out-of-domain data over different CAD proportions in Figure 1. We obtain mixed results for RQ2. First, we see that performance increases with the CAD proportion, except for hate speech at 20% (complemented by our analysis in Appendix 10.1). Our results indicate that models trained on construct-driven CAD (CF_c) are more effective than other types for sexism, especially at higher injection proportions. On the other hand, for hate speech, CF_a, or the diverse set of counterfactuals are better. Models trained on constructagnostic CAD (CF_r) have mixed efficacy.

RQ3: Do models trained on CAD rely on fewer artifacts?
While the overall performance gains can help us understand the improvements led by counterfactual data, we still do not know how or why these performance gains came to fruition. To that end, we apply explainability techniques to shed light on the models' inner workings and pinpoint what changes were brought about by the counterfactual data. While explainability for transformer models like BERT is an active area of research, explanation methods for them are usually at the level of individual predictions (local explanations). In this work, as we wish to assess how CAD holistically impacts social NLP models, we are primarily interested in model understanding over prediction understanding. Therefore, we need a way to aggregate local explanations into global features, a non-trivial task (van der Linden et al., 2019). Furthermore, explanations generated in an unsupervised way are not always faithful (Atanasova et al., 2020) and BERT does not learn weights for words, but for subwords, 10 making it difficult to find the importance of words. Therefore, as we cannot ascertain the reliability of BERT-generated global features and since LR and BERT models show similar trends in overall performance, for this analysis, we use the built-in feature weights of the LR models to compute the top-k global important features for CF and nCF models. We experiment with BERT explanations and include the result in Appendix 14 but we leave a detailed analysis of aggregation strategies of local BERT explanations for future work.
Quantitative  goal of training on CAD is to reduce the reliance on spurious features, we hypothesize that CF models have higher proportions of core (non-spurious) features in their feature ranking. 'Core' features are those that are consequential manifestations of the construct (e.g. the word 'happy' for sentiment), while spurious features are those that happen to be correlated with the construct in a particular dataset while not being truly indicative of it ('movie' for sentiment). Therefore, core features of a particular construct span multiple domains or datasets of that construct. Besides manually inspecting the top-20 global features, we also quantitatively assess the presence of spurious features in the global feature importances, i.e., we check the proportion of core features in the top feature rankings.
Identifying core features. To answer RQ3, we need a source of core features, or words associated with each of our constructs. To do so, we define two sources -(a) lexica and (b) pivot words. For the first, we use the same lexica for understanding the construct-driven modification strategies in RQ2, i.e., affect words for sentiment, gender words for sexism, and identity-based hate words for hate speech. Note that, while for sentiment, we have a list of core features for both classes, for sexism and hate speech, we only have core features for the positive class for sexist and hate cases, and not for non-sexist and non-hate cases. For the second source, we turn to the literature on domain adaptation, particularly work on pivot words (Blitzer et al., 2007). Concretely, for a given construct, we find words that are highly frequent in both domains; then find their correlation with the out-of-domain dataset labels to reduce the inclusion of in-domain artifacts. We rank these words based on mutual information and use the first 100 words as a set of core features. The list of pivot words is in Ap-pendix 11.

Results
We manually inspect the top 20 features ranked most important by each model.
The noncounterfactual models tend to learn more domainspecific features such as 'script' (sentiment), 'football' (sexism), and 'wrong' (hate speech), which prevents them from generalizing to other domains. The counterfactual models show fewer spurious features in their most important features, instead having more affect words (sentiment), gendered words (sexism), and identity-based slurs (hate speech). The top-20 features are in Appendix 12.
To scale this analysis, we use lexica and pivot words as proxies for core, i.e, non-spurious features. We plot the proportion of core features in the top positive feature ranking. Figure 2 shows that LR CF models rank core features more highly, especially based on the core feature list from lexica, strongly evident for sentiment, but also present to a lesser degree for sexism and hate speech. Therefore, our analysis indicates that training on CAD leads to reduced reliance on spurious features, while promoting core features. In contrast to lexica words, for pivot words, the gap between CF and nCF models is much smaller for sentiment and sexism. Whereas, for hate speech, the nCF models tend to have a higher proportion of core pivot word features after a certain k. We include the results for proportion of negative features in Appendix 13.

Related Work
Our work connects the area of learning with counterfactuals to improve NLP models' robustness with the area of social NLP.
Counterfactuals in NLP. Counterfactuals in NLP have been used for model testing, and ex-planation, but in this work, we are interested in using them for training models. Counterfactuals can be used for augmenting training data where previous research, focused on sentiment and NLI, has shown models trained on this augmented data are more robust to data artifacts (Kaushik et al., 2020;Teney et al., 2020). Counterfactuals need not always be label-flipping, but usually entail making minimal changes to original data either, and can be generated by manually or automatically (Nie et al., 2020). Recent work has also addressed automatic CAD generation through lexical or paraphrase changes (Garg et al., 2019;Iyyer et al., 2018), templates (Nie et al., 2020), and controlled text generation (Wu et al., 2021;Madaan et al., 2021). Concurrent and closely related to our work, Joshi and He assess the efficacy of CAD for Natural Language Inference and Question Answering, and find that diverse CAD is crucial for improving generalizability, in line with our current work. On the other hand, CAD generated by human annotators has not been analyzed in detail to see which strategies are used for generating counterfactuals nor which strategies are more effective, particularly for social computing NLP tasks.
In this work, we focus on human generated, label-flipping counterfactuals for relatively understudied constructs in this domain -sexism and hate speech, while more importantly focusing on how CAD impacts models. Inspired by causal mediation (Pearl, 2014), we put forth a typology of construct-driven and construct-agnostic CAD. Complementing previous research on overall performance, we take a deeper dive into which features CAD promotes, and which types are effective.
Social Computing and Online Abuse Detection. Even though sentiment, sexism, and hate speech can all be considered social computing tasks, the latter two, and generally NLP tasks related to abuse detection (Schmidt and Wiegand, 2017;Jurgens et al., 2019;Nakov et al., 2021;Vidgen and Derczynski, 2020;Sarwar et al., 2021), differ from tasks like sentiment and NLI because of their subjective nature and the relatively higher risk of social harms incurred by deploying spurious and non-robust models for decision making. Previous work has shed light on several dimensions of hate speech data that prevents generalisation, such as imprecise construct specification (Samory et al., 2021), biased data collection (Ousidhoum et al., 2020), and annotation artifacts (Waseem, 2016). Several solutions have been proposed for these issues such as adversarial data generation (Dinan et al., 2019), dynamic benchmarking (Kiela et al., 2021) and debiasing techniques (Nozza et al., 2019).
Building on these threads of research, we aim to understand the benefits of different types and proportions of CAD in training social NLP models.
6 Discussion: Designing Counterfactually Augmented Data NLP models are now embedded in many real-world applications and understanding their limits and robustness is of the utmost importance, especially for social computing applications. In this work, using a detailed and systematic set of analyses we establish convergent validity of the use of counterfactually augmented data for improving the reliability of datasets, particularly for learning social constructs like sentiment, sexism, and hate speech. Through extensive testing on different types of data, including adversarial data, we corroborate and strengthen previous findings that training on CAD leads to robust models (Kaushik et al., 2020;Samory et al., 2021). While it is promising that CF models do not fall prey to adversarial perturbations any more than their nCF counterparts, the disparity in out-of-domain performance and the lack thereof in adversarial examples might indicate that adversarial examples are not strong testbeds for detecting model robustness on out-of-domain data.
Having established this, we assessed if all CAD are equally effective. Using a fine-grained categorization of counterfactual generation strategy, we find that to not be the case, where for sexism, examples generated by directly acting on the construct are more effective in improving overall performance. Our results indicate that different strategies have different strengths, and model designers can prioritize certain strategies over others based on their needs. Finally, using explainability techniques, we establish that models trained on CAD tend to rely and promote core features over spurious ones using lexica and domain-agnostic words.
Limitations. The main limitation of our paper is that we rely on lexica and automated methods for several prongs of our analyses -for detecting core or non-spurious features and for classifying the different types of counterfactuals. Although manual vetting of both reveals that the results are sound, we caution against using them outside of this particular context. As we are limited in our computational resources, we further did not compare different explanation generation methods.
The second limitation of our work is using explanations from a bag-of-words LR model, which is motivated by two factors. First, since we want to understand how counterfactuals affect ML models holistically, we require precise and faithful global explanations, making the feature importances from LR an ideal choice. Second, explanation methods are an active area of research for Transformer models, and aggregating local explanations to global ones remains challenging (van der Linden et al., 2019). As we could not guarantee that the aggregated BERT explanations would reflect the model's internal decision-making mechanism, we default to the LR models for this particular analysis.
Future Work. We used lexica to detect types of counterfactuals, however, they have several drawbacks such as limited recall. A supervised classification approach could be considered as a step forward, which might be more sophisticated and accurate. On the other hand, such an approach would have to grapple with the complexities of the task of finding types of counterfactuals, since the input is paired (original-counterfactual) rather than a single document. Furthermore, a labeled dataset of sufficient size and careful feature engineering would be needed, which could be tackled in future work.
The use of counterfactuals for training data augmentation is fairly recent, with work by Kaushik et al. in 2019, even more so for social computing constructs. Therefore, there are several open questions about their properties as training data, including the notion of minimality of a counterfactual, i.e, what constitutes a minimal edit in generating CAD, either through quantitaive measures such as lexical distance, qualitative approaches, or their combination. Recent work has also attempted to automatically generate CAD (Wu et al., 2021;Madaan et al., 2021). However, comparing automated and human generated counterfactuals as training data is an open question and the analysis conducted in our work could be reused for this comparison.
Finally, the measurement of all three constructs in this work were modeled as binary classification tasks. Indeed, the counterfactual generation framework implicitly assumes binary labels as the approach asks for annotators to flip the label. Nev-ertheless, social constructs are multifaceted and could be modeled as multiclass (or even, multilabel) classification tasks. Future work could extend the current binary setup of counterfactual generation to accommodate multiclass classifications for example, through a one-vs-rest approach.

Conclusion
We take a deeper dive into the utility of training on counterfactually augmented data (CAD) for improving the robustness of social NLP models. For three text classification constructs-sentiment, sexism, and hate speech-we train LR and BERT models with and without counterfactual data. For the counterfactual models, we experiment with different sampling strategies to understand how different types of CAD affect model performance. Firstly, we corroborate previous findings on using counterfactual data, showing that models trained on CAD have higher out-of-domain performance. Our work's core novelty is that we study different strategies for CAD generation, and find that examples generated by acting on the construct are effective for sexism, while a diverse set is better for hate speech. Finally, we show that models trained on CAD promote core or non-spurious features over spurious ones. Taken together, our analysis serves as a blueprint for assessing the potential of CAD, while our findings can help dataset and model designers design better CAD for social NLP tasks.

Acknowledgments
We thank members of the CopeNLU group and the anonymous reviewers for their constructive feedback. Isabelle Augenstein's research is partially funded by a DFF Sapere Aude research leader grant.

Ethical Considerations
In this work, we attempt to understand the connection between training on counterfactually augmented data and increased model robustness. Our work centers on social NLP constructs like sexism and hate speech, whose manifestations in data can be harmful and potentially traumatizing to researchers. Furthermore, the sensitive nature of this data has the potential of victimising or revictimizing the people referred to in them. Therefore, in accordance with ethical guidelines (Vitak et al., 2016;Zimmer and Kinder-Kurlanda, 2017; Vidgen and Derczynski, 2020) we conduct our analyses on aggregate data only and do not infer any attributes of the speakers in the data. We release a dataset which only contains the IDs of the original data and the typology labels we annotate.
Following common practice in NLP, we use a gendered lexicon that only contains gendered words based on the gender binary. We acknowledge that this practice is exclusionary towards nonbinary individuals. We alleviate this to a certain extent by having a broader and more detailed list of identity terms, which also contains hateful terms and slurs directed towards non-binary people. In future, we hope to adopt a more intersectional perspective which is more inclusive of the sexism faced by trans and non-binary people (Serano . We use lexica to determine core features of sexism or hate speech, but we acknowledge that both of these may manifest in context-dependent ways and there is no single objective determinant of hate speech or sexism (or even sentiment). Furthermore, promoting features like identity terms can increase the risk of misclassifying non-hate content with such terms, such as disclosure or reports of facing hate speech, leading to unintended bias (Blodgett et al., 2020).
We do not undertake any further data generation or data annotation by human subjects, as we use data made available by previous researchers and use lexica for annotating counterfactual types. Nonetheless, as we show the potential of CAD in improving some aspects of model robustness, we hope that the community will adopt annotation guidelines that factor in the risk of harm that annotators and CAD designers working on abusive language might face (Vidgen and Derczynski, 2020).
We aim to understand how CAD improves model robustness, but we acknowledge and caution that these types of data augmentation can also be used to poison NLP models and cause them to have several harmful properties (Wallace et al., 2020;Sun et al., 2018).

Appendix
Here is the appendix for our paper, "How Does Counterfactually Augmented Data Impact Models for Social Computing Constructs?". The appendix contains details for facilitating reproducibility (9), the LR results to supplement the BERT results in the paper (10), the entire list of pivot words (11), global top-20 features (12), results for negative features' in RQ3 (13), and the BERT explanations (14). Caution: The appendix contains examples of terminology found to be discerning of hate speech and sexism, and are therefore, of an offensive nature. 9 Reproducibility 9.1 Compute Infrastructure All models were trained or finetuned on a 40 core Intel(R) Xeon(R) CPU E5-2690 (without GPU).
We use gridsearch for determining hyperparameter, where the metric for selection was macro F1. Run times and hyperparameter configuartions for the best performance for all CF (with randomly sampled 50% data) and nCF models (RQ1) are included in Table 6. The hyperparameters and run times for the CF models trained on different types of CAD (RQ2) are in Table 7.

Metrics
The evaluation metrics used in this paper are macro average F1, positive class precision for RQ1 and RQ2. We used the sklearn implementation of these metrics: https://scikit-learn.org/ stable/modules/generated/sklearn.metrics. precision _ recall _ fscore _ support.html. For RQ3, we compute the fraction of core features in a feature list based on intersection with the lexica and the pivot words (included in the appendix 11). The code for computing the metric is included in our code (uploaded with the submission)

Model Parameters
Model parameters are included in Table 8.

LR Results
Here we show the results for LR models. While the BERT models have much higher performance than LR, both family of models show similar trends, indicating that CAD is beneficial across model families. We show the results for LR for adversarial examples in Table 9. We also experiment with different proportions of CAD and measure their effect on performance in Figure 3. Finally, we also include the performance of the LR models trained on different types of CAD in Figure 4.

Injection Analysis.
In the main paper, we have replaced half of the original data with CAD (25% for sexism) and seen that it improves out-of-domain performance. But is there a limit to CAD's benefits? We investigate which amount of counterfactually augmented data is effective. We assess how different proportions of counterfactual examples injected affect the overall performance in Figure 3. While substituting original training data with counterfactually augmented data leads to reduced performance in-domain where the decrease is proportional to the amount of counterfactually augmented data, the trends are dissimilar for out-of-domain performance. Models trained on counterfactually augmented data perform better out-of-domain but only to a certain extent, after which point they begin degrading, potentially due to learning CAD-specific cues, though the limits are different for different constructs. Our analysis implies that while injecting counterfactually augmented data can be indeed effective for out-of-domain data, using an equal proportion of counterfactual and normal data achieves best performance.       Figure 4: Performance (macro F1) of LR models trained on different types of counterfactually augmented data over different injection proportions on the out of domain data. Construct-driven CAD performs well especially for sexism (like the BERT models), while in hate speech there is more variance.
11 Pivot Words 12 Top 20 words by construct for CF and nCF models

LR Negative Class Features
Complementing Figure 2 in the main paper, we plot the proportion of core features in the most important negative feature importance ranking of the LF CF ad nCF models in Figure 5. This analysis demonstrates an interesting distinction between sentiment and the other two constructs, also seen in the top-20 global feature importances. Since it is difficult to envision negative features for constructs like sexism and hate speech, there is very little difference in the rankings of global features for the negative class between the two types of models, as opposed to sentiment where there is a clear difference between CF and nCF models.

BERT Explanations
As we state in the main paper, we use LR feature weights for understanding if CF models tend to rely on less spurious features. The reason for using LR is the purported unreliability of Transformerbased methods' explanations (Jain and Wallace, As an exploratory step, we complement the LR explanations with explanations for BERT, using Integrated Gradients (Sundararajan et al., 2017), where input importance is measured using the gradients computed with respect to the inputs. Previous research has found gradient-based methods outperform perturbation or model simplificationbased approaches. As we are interested in model understanding rather than prediction understanding, we convert local explanations for BERT into a global feature ranking by aggregating the weights for every token in a local explanation. 11 While the trends are similar for sexism and sentiment, though the disparity between CF and nCF models is much smaller compared to the LR results, we caution against making concrete inferences from these results due to the potential unreliability of global BERT explanations. Non-Counterfactual pos feature pos coef neg feature neg coeff pos feature pos coef neg feature neg coeff      Table 12: We enumerate the top 20 global feature importances for hate speech detection. Spurious features are marked in red. We find that the counterfactual models learn less spurious or in-domain-specific features. Note that we only mark the spurious positive features because it is difficult to ascertain spurious features for the negative class.