NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation

While counterfactual data augmentation offers a promising step towards robust generalization in natural language processing, producing a set of counterfactuals that offer valuable inductive bias for models remains a challenge. Most existing approaches for producing counterfactuals, manual or automated, rely on small perturbations via minimal edits, resulting in simplistic changes. We introduce NeuroCounterfactuals, designed as loose counterfactuals, allowing for larger edits which result in naturalistic generations containing linguistic diversity, while still bearing similarity to the original document. Our novel generative approach bridges the benefits of constrained decoding, with those of language model adaptation for sentiment steering. Training data augmentation with our generations results in both in-domain and out-of-domain improvements for sentiment classification, outperforming even manually curated counterfactuals, under select settings. We further present detailed analyses to show the advantages of NeuroCounterfactuals over approaches involving simple, minimal edits.


Introduction
Despite the enormous successes in natural language processing, out-of-domain (OOD) generalization still poses a challenge for even the most powerful models, which achieve remarkable performance in domain (Recht et al., 2019;Torralba and Efros, 2011).This can be attributed to the models' reliance on spurious biases (Geirhos et al., 2020;Mc-Coy et al., 2019;Gururangan et al., 2018), i.e. features which co-occur with the ground truth without any causal dependence (Simon, 1954).Adopting methods from causal inference (Pearl, 2009;Feder et al., 2022), training data augmentation with counterfactuals (CFs) has been proposed for NLP as one potential solution (Levesque et al., 2012;Kaushik Figure 1: Illustration of our approach.1 We extract tokens from an Original (negative) movie review that evoke concepts from ConceptNet ( §2.1). 2 We use a GPT-2 model adapted to only reviews with the opposite (positive) polarity as a sentiment steer ( §2.2). 3 Finally, to ensure that the generation is similar to the original, we use NeuroLogic, a constrained decoding approach ( §2.3; Lu et al., 2021), where the constraints are extracted tokens from 1.This results in NeuroCounterfactuals, which are loose counterfactuals of the original, but are more naturalistic ( §4; Tab.1), compared to minimal edit counterfactuals (bottom).4 When used to augment training data for sentiment classification, our generations are valuable for OOD generalization ( §3).target label), following an intervention (e.g., altering a causal feature), typically in the form of edits to the input text (Khashabi et al., 2020;Andreas, 2020).Training data augmentation with counterfactuals can thus provide strong inductive biases to help with robustness against spurious biases, resulting in improved OOD generalization (Vig et al., 2020;Eisenstein, 2022).
However, designing the appropriate interventions to produce counterfactuals can be challenging.Indeed, most counterfactuals are produced via basic edits to the input text, either manually (Gardner et al., 2020;Kaushik et al., 2019) or automatically (Yang et al., 2021;Wang and Culotta, 2021;Wu et al., 2021), such that the target label changes.These minimal edits are made via substitution, insertion or deletion of tokens in the original sentence, resulting in simplistic generations, which are often unrealistic and lack linguistic diversity. 1As a result, counterfactuals via minimum edits often fail to provide adequate inductive biases to promote robustness (Khashabi et al., 2020;Huang et al., 2020;Joshi and He, 2022).
In this paper, we investigate the potential of more realistic and creative counterfactuals, which go beyond simple token-level edits, towards improving robust generalization.While allowing larger edits reduces proximity to the original sentence, we believe that this is a worthwhile trade-off for more realistic and creative counterfactuals, which offer greater flexibility in sentiment steering, increasing the likelihood that the counterfactual possesses the desired label.We propose a novel approach that can generate diverse counterfactuals via conceptcontrolled text generation, illustrated in Figure 1.In particular, our approach combines the benefits of domain adaptive pretraining (Gururangan et al., 2020) for soft steering of the target label (Liu et al., 2021), with those of NeuroLogic decoding (Lu et al., 2021), an unsupervised, inference-time algorithm that generates fluent text while strictly satisfying complex lexical constraints.As constraints, we use tokens that evoke salient concepts derived from ConceptNet (Speer et al., 2017).Our resulting generations, called NeuroCounterfactuals 2 , provide loose counterfactuals to the original, while demonstrating nuanced linguistic alterations to change the target label ( §2).
Compared to minimal-edit counterfactuals, our counterfactuals are more natural and linguistically diverse, resulting in syntactic, semantic and pragmatic changes which alter the label while preserving relevance to the original concepts (Table 1).On experiments with training data augmentation for sentiment classification, our approach achieves better performance compared to competitive baselines using minimal edit counterfactuals ( §3).Our performance even matches baselines using humanannotated counterfactuals, on some settings, while avoiding the cost of human annotation.While Neu-1 For instance, the minimal edit counterfactual in Figure 1 contains the phrase "loose collection of intelligible analogies", a somewhat unnatural construction for a positive movie review.

NeuroCounterfactuals
We describe our methodology for automatic generation of loose counterfactuals, NeuroCFs, for sentiment classification.The key idea underlying our approach is the need for retention of concepts to ensure content similarity to the original text, while steering the sentiment to the opposite polarity.Our method, illustrated in Figure 1, combines a concept-constrained decoding strategy with a sentiment-steered language model.First, we detail our approach for extracting the salient concepts from a document ( §2.1).Next, we discuss language model adaptation to produce sentiment-steered LMs ( §2.2).Finally, we provide an overview of the NeuroLogic decoding algorithm for controlled text generation, and how it can be adapted for the task of generating sentiment counterfactuals ( §2.3).

Extracting Salient Concepts
Our first step constitutes extraction of concepts from the original document, which can be used to reconstruct its content, when used as constraints during decoding ( §2.3).Specifically, we aim to identify a set of constraints which will require the counterfactual to be similar in content to the original sentence while still allowing the generation to be steered towards the opposite polarity.Using extracted concepts as constraints achieves this because the concepts consist of the content-bearing noun phrases as opposed to the sentiment-bearing adjectives.For example, in the original sentence from Figure 1, we seek to constrain our generated counterfactual to contain concept-oriented phrases, such as "movie", "analogy", and "plot devices" without explicitly requiring the presence of other tokens which may indicate the sentiment (e.g., "unintelligible", "ill-conceived").
We achieve this mapping via linking tokens and phrases in the document to nodes in the ConceptNet knowledge graph (Speer et al., 2017), thus evoking

Label Review
Original Á But this film decided to throw away the talents of the people involved in a simpering version so watered down from the source material that it's amazing they had the guts to call it Wuthering Heights at all.

W&C.
Á But this film decided to throw away the talents of the people involved in a simpering version so watered down from the source material that it s unimpressive they had the guts to call it wuthering heights at all Y.et al.

Á
But this film decided to throw away the talents of the people involved in a simpering version so watered down from the source material that it's amazing they had the guts to call it wuthering heights at all.

NeuroCFs-1g
À But the film guts its source material, and it does so with a version of the heights of artistry that people have come to expect from the talents of jean renoir.
NeuroCFs-np À But this film decided to take the talents of the source material and make them its own, and it's a gutsier version of the people we know and love from the heights.

À
Unfortunately I had to rent a Dreamcast to play it, but even though I did beat it I can't wait to buy it for PS2.

W&C.
À Fortunately i had to rent a dreamcast to play it but even though i did beat it i can t wait to buy it for ps2 Y.et al.

??
Unfortunately i had to rent a dreamcast to play it, but even though i did beat it i can't wait to buy it for ps2.

NeuroCFs-1g
Á Unfortunately it's not nearly as good as the dreamcast ps2 version.

NeuroCFs-np Á
Unfortunately i had to rent a dreamcast to play it but even though i did beat it i can't recommend it for ps2 or xbox.We primarily use COCO-EX for its ability to identify meaningful concepts, but also explore the use of links to related concepts it provides in Section 4.4.We also compare with a baseline using noun chunks as constraints in App C.2.

Steering Sentiment via LM Adaptation
The second component for our method is a sentiment "steer", i.e. an autoregressive language model which has been trained or adapted via finetuning (Gururangan et al., 2020) exclusively on sentences with single (negative or positive) polarity.Specifically, we use two steers for each sentiment label: one which models positive sentiment text, (denoted p θ ), and another which models negative sentiment text, (denoted p θ ), where θ indicates the parameters of the adapted language model.In contrast to the hard predicate constraints over specific tokens as given by the extracted concepts in §2.1, our selective use of steering LMs can be viewed as a softer type of constraint which biases the generations towards text containing the desired sentiment polarity (Liu et al., 2021).

Decoding with Conceptual Constraints
Our method utilizes NeuroLogic Decoding (Lu et al., 2021), a controlled text generation algorithm to generate fluent text satisfying a set of lexical constraints from a pretrained language model.Given a series of predicates Dpa, yq which are true iff a appears in the generated sequence y, NeuroLogic accepts a set of clauses tC i | i P 1, ¨¨¨mu consisting of one or more predicates specified in Conjunctive Normal Form (CNF): where each predicate D i is a positive constraint, Dpa i , yq, which is satisfied (i.e., evaluates as true) if the subsequence a i appears in the generated sequence y.
NeuroLogic employs a beam search approximation of an objective function which maximizes the probability of the generated sequence while penalizing deviations from the set of m clauses: where λ " 0 penalizes deviations from the set of constraints.Candidates are scored at each stage t of beam search according to their partial or full satisfaction of the constraints: where â represents a subsequence of a in the current generation and the maximum is taken over all unsatisfied constraints consisting of more than one token.This has the effect of preferring candidates which at least partially satisfy multi-token constraints; for example, a generated sequence y ďt " "The boy climbs an apple" would be rewarded for partially satisfying the constraint a " "apple tree" via its subsequence â " "apple".Unlike the top-k selection strategy used in traditional beam search, NeuroLogic performs pruning, grouping, and selection steps to identify the best candidates which satisfy the given constraints.Specifically, candidates which irreversibly violate one or more constraints are pruned, and the remaining candidates are grouped according to their number of satisfied clauses in order to encourage diversity.The best candidate within each group is then selected according to the scoring function in Equation 2.
Each word or phrase in the original example which is linked to a ConceptNet node ( §2.1) becomes a clause in our constraint set used with Neu-roLogic.We allow each clause to be satisfied by the lowercase or capitalized form of the concept via an OR constraint.For the example in Figure 1, this constraint set would be specified in CNF as follows: pMovie _ movieq ^pPlot Devices _ plot devicesqp Collection _ collectionq ^pAnalogies _ analogiesq Once the constraints have been identified in the original, we substitute the sentiment-steered LMs ( §2.2) into Equation 1, corresponding to a polarity opposite to the original: p1 ´Cj q. (3) Here, p i θ " p θ when we aim to generate a positivesentiment example and p i θ " p θ , for a negativesentiment example.The resulting generation, ŷ, is a NeuroCounterfactual (NeuroCF).
In Eq. 3, the generation is conditioned on x, which indicates a prompt, comprising a prefix of the original input; we investigate two variants for x.When x is a unigram (1g) comprising the first token of the original input, we call the generations NeuroCFs-1g.When x is the longest neutral prefix of the original input, we call the generations NeuroCFs-np; these are slightly tighter Neu-roCFs containing a greater portion of the original input.Table 1 provides examples showing the original sentence and our generated NeuroCFs, highlighting words in the original that were included in the concept-oriented constraint set for NeuroLogic decoding.NeuroCFs are not guaranteed to not contain new concepts, beyond the specifications of the constraint set.See App.§A for further examples.

Data Augmentation with NeuroCFs
Our experiments compare NeuroCFs to CFs from minimal edit approaches, for augmentation of sentiment classification training data.

Experimental Setup
Sentiment Steer Our positive and negative sentiment steers are based on a GPT-2 Large model (Radford et al., 2019), finetuned on (positive and negative, resp.)subsets of the Stanford Sentiment Treebank (SST-2; Socher et al., 2013) corpus, including train, test and validation splits.4 NeuroLogic For decoding with NeuroLogic, we use a beam size of 20, length penalty of 0.3, and an n-gram size of 2 for preventing repetitions.We use β " 1.25 as the reward factor for in-progress constraint satisfaction and set the constraint satisfaction tolerance to 2. Please refer to Lu et al. (2021) for details on these hyperparameters.
For the generation of NeuroCFs-np, we identify the longest neutral prefix of the original input.As candidates, we consider all prefixes containing at least 4 tokens, such that the rest of the review contains at least one identified concept.We filter the longest candidate, predicted as neutral using an off-the-shelf 5-way sentiment classifier. 5ollowing prior work (Kaushik et al., 2019), we generate NeuroCFs for a subset of movie reviews from the Internet Movie Database (IMDB; Pang and Lee, 2005), comprising 2440 examples randomly sampled and split into 70% training, 10% validation, and 20% test partitions, for a sentiment classification task (Maas et al., 2011).We augment the training data of a sentence-level version of this dataset (IMDB-S)6 , introduced by Wang and Culotta (2021); see App.B.1 for details.

Baselines: Other CF sources
We compare with multiple sentiment classification baselines employing counterfactuals for training data augmentation.Kaushik et al. (2019) crowdsource counterfactuals for IMDB, by soliciting minimal revisions to maintain coherence while flipping the sentiment, creating both a counterfactually augmented train as well as test dataset.We also consider two approaches that produce automatically generated counterfactuals.Wang and Culotta (2021) generate counterfactuals by automatically identifying causal words in the original example and substituting them with their antonyms, ensuring minimal edits.Similarly, Yang et al. ( 2021) automate counterfactual generation through the identification of causal terms which are either removed or replaced; they then filter candidates using Mover-Score (Zhao et al., 2019) to ensure minimal edits were made to the original example.For all the baselines above, we train on sentence-level IMDB reviews, as well as sentence-level variants of the counterfactuals. 7entiment Classifier We compare several models, based on a RoBERTa-base architecture (Liu et al., 2019).Each model is trained on a counterfactually augmented dataset, where the CFs are either obtained via baselines above, or via our approaches ( §2).We additionally train a baseline only on the original IMDB-S training data, without any CFs.Details on model training are provided in App.B.3.
Evaluation We report classification accuracy on a combination of in-domain and out-of-domain test sets.As in-domain test sets, we evaluate on the IMDB test set.We also evaluate on CFs for IMDB, crowdsourced by Kaushik et al. (2019).In addition, we evaluate on contrast sets (Gardner et al., 2020), which are expert-annotated CFs for IMDB test data.As another in-domain test set, we evaluate on the SST-2 movie reviews test set.8Wu et al. ( 2021) produce task-agnostic, minimal edit counterfactuals with fine-grained semantic controls over different types of perturbations, followed by human labeling; we also evaluate on these so-called Polyjuice CFs for SST-2 test.9While SST-2 differs from IMDB in terms of word length and style, we nevertheless consider it in-domain for the purpose of our evaluations because both datasets are comprised of movie reviews.
For the OOD test sets, we consider the following binary sentiment classification datasets: • The Amazon dataset (Ni et al., 2019) consists of consumer product reviews in the categories of software, fashion, appliances, beauty, magazines, and gift cards.• The Twitter dataset (Rosenthal et al., 2017) from SemEval-2017 Task 4 contains social media posts collected from Twitter.• The Yelp dataset10 contains consumer reviews originating from the Yelp dataset challenge.

NeuroCFs for Train Data Augmentation
Table 2 shows our results.NeuroCFs outperform alternative methods for automatic CF generation across every in-domain as well as OOD setting, including performance on CF test sets.The only exception is IMDB test, where we match the performance of the best approach (up to standard deviation).Across most CF and OOD test sets, the magnitude of our improvements is similar to or greater than the amount by which existing methods improve on the no-counterfactual baseline.Furthermore, most of these improvements are statistically significant pp ď 0.05q relative to the results of both Yang et al. ( 2021) and Wang and Culotta (2021).NeuroCFs even surpass the performance of augmentation with crowdsourced counterfactuals from Kaushik et al. (2019) on most OOD settings.However, training on manual CFs results in higher performance when tested on human-written CFs; this might be attributed to distributional similarities (Geirhos et al., 2020;Koh et al., 2020).Regardless, our performance is close enough, despite using fewer training instances while avoiding the significant cost of human annotation.
Both NeuroCF-variants have comparable performance, with the NeuroCFs-np faring better on 4/8 benchmarks.Consistent with prior work (Wang and Culotta, 2021), we observe that training with CFs generally results in similar or slightly worse Table 4 compares our NeuroCFs and CFs from other sources, to the original, across three similarity metrics: BLEU (n-gram" 2) (Papineni et al., 2002), Levenshtein edit distance (Levenshtein et al., 1966), andMoverScore (Zhao et al., 2019).Additionally, Table 4 provides the mean perplexity of generated counterfactuals as measured by GPT-J (Wang and Komatsuzaki, 2021) as well as the Distinct-2 diversity measure (Li et al., 2015).Neu-roCFs are loose counterfactuals by design, and are therefore farther away from the original sentence; NeuroCFs-np are tighter CFs compared to NeuroCFs-1g.However, NeuroCFs have greater fluency (as evidenced by lower mean perplexity) and offer performance benefits over more similar CFs via minimal edit approaches (Table 2).Moreover, more dissimilar variants, generated without constraints for generation ( §4.3), or with alternative concepts ( §4.4) also hurt performance.

Impact of NeuroCF quantity
In contrast to minimal edit approaches, our approach has the added advantage of producing more than a single NeuroCF for each original example, via NeuroLogic hyperparameter variation.We seek to investigate how the quantity of NeuroCFs for training data augmentation impacts OOD generalization.To investigate the effect of size beyond results in  Results in Figure 2 show monotonic increase in accuracy on IMDB contrast sets (Gardner et al., 2020) with NeuroCFs-np size.However, performance on the Amazon OOD set plateaus, suggesting overfitting to the IMDB domain; this echoes the findings of prior work on the efficacy of counterfactuals (Khashabi et al., 2020;Huang et al., 2020;Joshi and He, 2022).

Impact of NeuroCF Similarity
We investigate the impact of the similarity of Neu-roCFs to the original example on sentiment classification performance after augmentation.From the NeuroCFs candidate set described in §4.1, we create two sets of alternative NeuroCFs for each instance: one with the lowest MoverScore (most dissimilar) w.r.t. the original (NeuroCFs loose ) and the other with the highest MoverScore (most similar; NeuroCFs tight ).
Table 5 compares these two alternatives via classifier performance across our in-domain and outof-domain tests.In general, we observe that tighter (i.e., more similar to the original sentence) counterfactuals improve generalization more when evaluated on counterfactual and contrast sets.They also improve out-of-domain generalization, with the exception of the Yelp dataset where both variants result in similar performance.Tighter counterfactuals are more likely to break spurious correlations that help classifiers perform better on in-domain test sets, which may explain why NeuroCFs loose performs better on IMDB Test and SST Test.While NeuroCFs are designed to be loose CFs, these results suggest that higher similarity between the original and its NeuroCF is still important for generalization.

Impact of Constrained Decoding
Our approach uses a sentiment-steered LM to control the sentiment of the NeuroCFs, and constraintbased decoding to encourage its similarity to the original example.To investigate the impact of constraint decoding, we run an ablation without the use of NeuroLogic, i.e., only using the sentiment steer.Specifically, we use the first token of each original sentence as a prompt and decode from our sentiment experts using beam search with the same hyperparameters as NeuroCFs.
Table 4 compares both variants by their similarity to the original, and Figure 3 compares the performance of training data augmentation with both variants.The use of constrained-based decoding results in substantial performance improvements over the no-constraint baseline across all evaluation sets except the in-domain IMDB test set.This   highlights the value of using constraints to encourage similarity to the original, thus resulting in a NeuroCF, as opposed to simply a new example of the opposite polarity.These results, along with those from §4.2 indicate the existence of an optimal degree of similarity, which is not as high as minimal edit counterfactuals, and not as low as constraint-free counterexamples.Initial experiments further point to the value of ConceptNet constraints, as opposed to nominal constraints; the former results in more similar Neu-roCFs (see App. §C.2 for details).

Leveraging ConceptNet for alternative constraint sets
Our use of COCO-EX for identifying concept constraint sets provides a link between each of our constraints and a node in ConceptNet.We explore whether the structured knowledge contained in ConceptNet can provide alternative constraint sets for NeuroCFs.
For each concept in our original constraint sets, we query ConceptNet for its most similar 11 English-11 Via similarity scores calculated over pre-computed Con-language node in the graph and use the label of this nearest neighbor to replace our original concept constraint (see Appendix C.3 for examples).Table 6 compares the performance of a RoBERTabase classifier trained on NeuroCFs-np, and their counterparts produced by alternative conceptual constraints derived from ConceptNet, and a combined set of NeuroCFs produced by both the original and concept-altered constraints.We observe that further increasing the size of our CFs using concept-altered NeuroCFs increases performance on in-domain CF test sets while retaining performance on OOD test sets.While this pilot shows promising results, we leave a systematic investigation into ConceptNet knowledge to create counterexamples for data augmentation, to future work.

Can NeuroCFs be used for evaluation?
Inspired by the success of NeuroCFs for training data augmentation, we further investigate if these can be used as a challenge set for evaluation (Rudinger et al., 2018).However, before deploying them as test sets, we need to first verify that Neu-roCFs indeed alter the target label, as intended by the sentiment steering process ( §2.2).We randomly select 50 NeuroCFs, as well as CFs from baseline approaches, to evaluate whether they successfully steered the sentiment of the original example.12Results show that NeuroCFs-np and NeuroCFs-1g are more successful in steering sentiment compared to the baseline approaches; however, only about 50% of the resulting NeuroCFs-np actually result in sentiment change; see further discussion in App.C.4.Hence, we cannot reliably use generated counterfactuals for evaluation.Future work might investigate manually labeling NeuroCFs for use as challenge sets, following Wu et al. (2021).

Related Work
Counterfactual data augmentation is emerging as a viable solution for improving model robustness towards spurious correlations (Geirhos et al., 2020).
In previous sections, we present comparisons to various minimal edit approaches for producing counterfactuals (Kaushik et al., 2019;Wang and Culotta, 2021;Yang et al., 2021;Wu et al., 2021;Gardner et al., 2020), either manually or automatically.Our approach steers away from minimal edits, as well as manual intervention for creating counterfactuals.
Beyond sentiment classification, this approach has been employed for tasks such as question answering (Paranjape et al., 2022), fairness in social computing (Sen et al., 2021), and natural language inference (Glockner et al., 2018).Most work focus on minimal edits of training instances via small perturbations to the causal features, via manually editing instances.Madaan et al. (2021) introduce a controlled text generation approach to create counterfactuals containing specific attributes, but focus on applications to debiasing and evaluation rather than our objective of training data augmentation.Hu and Li (2021) propose a structural causal model for combing attribute-conditional generation and text attribute transfer (i.e., minimal edits), but similarly produce counterfactuals for different purposes than ours.Ross et al. (2022) automate contrast sets (Gardner et al., 2020)

Discussion
We presented an approach to generate NeuroCFs, via sentiment steering and concept-constrained decoding.Training data augmentation with Neu-roCFs results in improvement on sentiment classification performance over existing minimal-edit methods, both in and out of domain; even matching human counterfactuals in some cases.We presented several analyses for NeuroCFs, and ablations showing the effectiveness of our approach.While NeuroCFs are loose by design, our analyses indicate the existence of an optimal degree of similarity, which is not as high as minimal edit counterfactuals, and not as low as constraint-free counterexamples.
While this work focused on NeuroCFs for movie reviews only, our results show that training on them transfers to other domains such as product reviews and social media posts for the same sentiment analysis task.Future directions of research might investigate generating NeuroCFs for evaluation, and tasks beyond sentiment classification.Our approach is broadly compatible with tasks for which a language model steer can be trained; future applications of this work could therefore include other NLP tasks where global attributes are available, such as toxicity removal or style transfer.Further, we could consider generating a NeuroCFs neighborhood around individual instances, similar to contrast sets (Gardner et al., 2020).

Limitations
Our approach to generate NeuroCFs is designed specifically for binary sentiment classification in English language only.For generating NeuroCFs, we needed the knowledge of the original example's sentiment polarity; however, it is possible to produce NeuroCFs for both polarities without knowledge of the original label.Applications to other classification settings might involve the need to train multiple language model steers, which can be challenging in the absence of global labels (for e.g.instance-specific labels in multiple-choice question answering).NeuroCFs might need to be filtered for grammaticality and for steering accuracy for their use beyond training data augmentation.Our approach investigated producing loose counterfactuals at the sentence level; efficient extensions of our approach to paragraph-level transformations were not explored in this work.Throughout this work, we use RoBERTa-base and GPT2-Large architectures; however, there are more powerful architectures which could potentially improve our results.
It is possible that language generated through automatic approaches, and labeled automatically might contain their own annotation artifacts (Gururangan et al., 2018), leading to a different set of spurious biases.Potential harms of generated language include harmful social biases (Bender et al., 2021), which were not investigated in this work.Approaches that involve a human validation phase after data collection (Liu et al., 2022), might be explored in future work to mitigate such harms.

Ethical Considerations
We acknowledge that generated language is susceptible to harmful social biases (Bender et al., 2021) and toxicity (Gehman et al., 2020).We caution practitioners against training models solely on model generated data.We do not filter our training data or our generations for toxicity, bias, or offensiveness.Hence, we recommend practitioners interested in using our generations and replicating this work to carefully check the generated content before deployment in any real world application.
Our work uses only publicly available datasets.To the best of our knowledge, these do not contain any explicit information about a user's identity, health, negative financial status, racial or ethnic origin, religious or philosophical affiliation or beliefs, beyond their reviews on movies and products.

A Extended Qualitative Analysis
A larger qualitative analysis is provided in Table 14, which further highlights how NeuroCFs result in more complex changes to the original sentence, and are more successful in sentiment steering than minimal-edit counterfactuals.Minimal edits are at times unable to result in meaningful sentiment flips, and result in reduced grammaticality and pragmatics, producing phrases such as "racism was best" (W&C), and "part in the game" (Y et.al.).
A.1 Examples of cases where a counterfactual was not generated  Our experiments were conducted on a Slurm linux cluster with Nvidia RTX 3090 GPUs.We parallelized the generation of NeuroCFs across 32 GPUs in this environment, resulting in a total running time of 75 minutes.Table 8 reports the mean time to train our RoBERTa-base classifier on the various sets of counterfactuals, measured across 30 different random seeds.Each training run for a given source of counterfactuals and seed was conducted on a single GPU.RoBERTa-base has 125M parameters.

C.1 Evaluating on sentence-level test sets
Table 12 shows the results of all our approaches and baselines on sentence-level test sets.

C.2 Noun Chunk Concepts as Constraints
As detailed in Section 2.1, we form our constraint sets by using COCO-EX to identify meaningful concepts in the original example.To investigate This is a good This is a good, dark film that I highly recommend.
I really liked the I really liked the black and white cinematography.1 and Table 14), and we considered them as incorrectly sentiment-steered.
how our concept-constrained generations differ from those produced by constraint sets derived from nouns, we generated an alternative set of Neu-roCFs using constraints consisting of noun chunks identified by spaCy 14 .We found that these alternative noun-chunk conceptual NeuroCFs had an average MoverScore of 0.15 w.r.t.their corresponding COCO-EX concept-constrained NeuroCFs, indicating that the use of concepts for constraint formulation produces substantially different counterfactuals than the use of noun chunks for constraints.Moreover, based on the evidence from Table 5, we hypothesize that these alternative concepts might not result in a performance boost.

C.3 Examples of concept-altered constraint sets derived from ConceptNet
Table 13 provides examples of our original NeuroCFs-np and their concept-altered versions after replacing constraints with similar nodes from ConceptNet.examples of this annotation can be seen in Table 14 in Appendix A and in Table 1.We report the performance of a RoBERTa-base classifier finetuned on the Yelp dataset15 using the original IMDB dataset and various CF test sets in Table 11.

Figure 2 :
Figure 2: Increasing NeuroCF quantity for training data augmentation improves in-domain performance, while OOD generalization plateaus.

Figure 3 :
Figure 3: Conceptual constraint-based decoding with NeuroLogic improves performance, as seen by the comparison between training data augmentation with NeuroCFs-np, and their counterparts generated without any constraints.Reported RoBERTa-base accuracy is averaged over 30 random seeds.
for question answering, dependency parsing and relation extraction, via training a generator with semantic control codes; however, their method requires the user to specify what changes in the original sentence are desired.Beyond Counterfactuals: Srivastava et al. (2020) collect human annotations for commonsense reasoning behind examples, in a robust optimization setting to minimize worst-case loss, without explicitly collecting counterfactuals.Ribeiro et al. (2018) demonstrate how state-of-the-art models are vulnerable to semantically-equivalent adversarial examples constructed from a rule-based method.Ribeiro et al. (2020) propose Checklists, which contain heuristic edits of the evaluation data instances.Other approaches employ perturbations without creating actual data instances (Veitch et al., 2021).

Table 1
NeuroCFs result in more complex changes to the original, and are more successful in steering the sentiment for label flipping; minimal edits are at times unable to result in meaningful changes to the sentiment, and result in reduced grammaticality.Concepts in the original sentence that were used as constraints to generate NeuroCFs are in blue italics .Also see App §A; Tab.14.

Table 2 :
(Gardner et al., 2020)on accuracies, comparing IMDB-S training data augmentation with NeuroCFs vs. other sources of counterfactuals.IMDB CF(K.et al.)and Cont.Sets refer to the human-authored counterfactuals(Kaushik et al., 2019)and contrast sets(Gardner et al., 2020), respectively.|Dtrain | shows the total number of training instances, including 8,173 original IMDB-S training examples.Results report mean over 30 differnt random seeds, with s.d. as a subscript.All models are based on the RoBERTa-base architecture.Best results using auto-generated CFs for training are in boldface.Results for NeuroCFs-1g and NeuroCFs-np are underlined when a one-tailed t-test indicates that their improvements over both Yang et al., 2021 and Wang and Culotta, 2021 are statistically significant (p ď 0.05q.indicatesmanually created counterfactuals.

Table 3 :
Results controlling for training data quantity (|D train |), comparing different counterfactual data augmentaton approaches.The first row shows a baseline trained without CFs.All other settings are identical to Table 2. in-domain test performance on IMDB-Test, relative to training without CFs.Each source of CFs evaluated in Table 2 produces different amounts of training data, D train .To control for training data quantity, we present results with downsampling the training data for uniformity across settings, in Table 3. Surprisingly, even lower amounts of NeuroCFs achieve the best performance compared to other methods of autogenerating CFs.Notably, NeuroCFs-np achieves statistically significant improvements over both Yang et al. (2021) and Wang and Culotta (2021) on every evaluated dataset.These results demonstrate that the performance improvements achieved on OOD sets can be attributed to the quality of the NeuroCFs.App.C.1 provides further results on sentence-level tests.

Table 2 ,
we generate more NeuroCFs-np

Table 4 :
Comparing fluency, diversity, and similarity of generated and human ( ) CFs to the original, across various metrics.NeuroCFs are loose counterfactuals by design, and are therefore farther away from the original sentence.

Table 5 :
Impact of the similarity of a NeuroCF to the original.NeuroCFs loose are more dissimilar to the original, than NeuroCFs tight , as given by the mean MoverScore.Tighter NeuroCFs result in better performance.

Table 7 :
Table9provides examples of sentences for which a NeuroCFs-np was not generated.In these cases, no prefix of the original sentence at least 4 tokens in length was predicted to be neutral.This can be attributed to sentiment-bearing words being present at the start of the sentence.Size of datasets used in experiments

Table 7
provides details on the size of the datasets used in our experiments.All datasets consist of English language text which we used without Awful, despicable, unpleasant, unhappy, unredeemable saga of a complete Loser.

Table 9 :
Examples of cases where a NeuroCFs-np was not generated

Table 10 :
Accuracy of sentiment steering, based on manual evaluation by authors of this work, on 50 randomly sampled IMDB-S train instances for which CFs were available from all approaches.Many generations from each approach were ungrammatical and unpragmatic (see examples in Table

Table 11 :
Table10shows the steering accuracy of NeuroCFs as well as CFs from baseline approaches, as evaluated by the authors of this work on a sample of 50 randomly selected examples from each.Some 14 https://spacy.io/Classificationaccuracy of an off-the-shelf sentiment classifier from the Huggingface Transformers library (RoBERTa-base finetuned on the Yelp dataset).Each row indicates an evaluation set comprised of counterfactuals of the original IMDB-S test set (top row), from different sources.|DCF test | indicates size of the counterfactual test set.indicatesmanually created counterfactuals.Greater the ∆, more challenging the CF test set.However, NeuroCFs-1g and NeuroCFs-np do not possess human-annotated target labels; also see §4.5.

Table 12 :
Evaluation of counterfactual data augmentation on sentence-level test sets; other settings similar to Table 2.