NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?

While a substantial body of prior work has explored adversarial example generation for natural language understanding tasks, these examples are often unrealistic and diverge from the real-world data distributions. In this work, we introduce a two-stage adversarial example generation framework (NaturalAdversaries), for designing adversaries that are effective at fooling a given classifier and demonstrate natural-looking failure cases that could plausibly occur during in-the-wild deployment of the models. At the first stage a token attribution method is used to summarize a given classifier's behaviour as a function of the key tokens in the input. In the second stage a generative model is conditioned on the key tokens from the first stage. NaturalAdversaries is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters. Our results indicate these adversaries generalize across domains, and offer insights for future research on improving robustness of neural text classification models.


Introduction
Transformer models have gained prominence in NLP research due to their powerful performances on leaderboards.However, numerous studies have shown these neural models are brittle, frequently taking shortcuts to reach decisions rather than reasoning about the underlying semantics correctly (Geirhos et al.;Bras et al., 2020) or failing when exposed to adversarial perturbations of inputs (e.g, Goodfellow et al., 2015;Szegedy et al., 2015;Jia and Liang, 2017;Glockner et al., 2018;Dinan et al., 2019).Due to the opaque nature of neural modeling, methods for adversarial example generation may also steer algorithms towards generating unlikely examples that exhibit unrealistic properties (Zhao et al., 2018).
In this work, we pose the question, "what does it really mean for an adversarial attack to be effective and can naturalistic adversaries match artificial ones?"We argue that effectiveness should be dependent not only on attacking accuracy, but on usefulness of adversaries for improving robustness under realistic conditions (e.g identifying social biases learned by neural models, Buolamwini and Gebru, 2018;Sheng et al., 2019;Stanovsky et al., 2019;Sap et al., 2019;Ross et al., 2021).
We propose a framework NaturalAdversaries1 for generating convincingly naturalistic adversaries.We first approximate the behavior of a given classifier decision function F c (x) and then train a generative model F g (x) to mimic this behavior.As shown in Figure 2, we condition generative models on influential tokens extracted using black box or white box explainability methods (Ribeiro et al., 2016;Sundararajan et al., 2017), and a desired label (e.g."entailment" or "contradiction"), to produce new examples that match a distribution learned from F c (x) through sampling.
Our results on two different tasks (hate speech detection and natural language inference) show that our approach leads to adversaries that are perceived by annotators as considerably more natural.While naturalistic adversaries are often less adversarial than artificial adversaries (following prior literature, e.g.Morris et al., 2020a), we find this depends on the evaluation setting and they can be better defenses.

Defining Naturalness
We first define what it means for machinegenerated adversaries to have the quality of "naturalness."Prior work on text generation has defined this property in terms of linguistic competence (Novikova et al., 2018;Lau et al., 2020), as well as enumerating undesirable characteristics that lower perceived naturalness like self-contradiction (Dou et al., 2022).In our work, we ask human Figure 1: Averaged naturalness scores from human evaluation.The adversarial examples were generated from the DynaHate test set (Vidgen et al., 2021a) using two common baselines (baseline) (Ebrahimi et al., 2018;Jin et al., 2020) as well as NaturalAdversaries (Ours).We show the distribution of scores for examples that are effective or ineffective respectively at fooling a RoBERTa toxicity classification model (Zhou et al., 2021).This shows that not only does NaturalAdversaries generate more natural examples, but naturalistic examples can also be effective at demonstrating adversarial behavior.

S:
The politicians filed in with gloomy faces.Everyone looked overjoyed.

Premise Hypothesis
Natural Language Inference (NLI) example

Generative model F g (x)
Input: Influential tokens, y' y': Entailment They did not seem happy.

Overview of adversarial generation
Figure 2: Our proposed framework for testing robustness of models using machine-generated adversarial examples (NaturalAdversaries).In the initial step, we probe the behavior of a black-box classifier (e.g.RoBERTa) using an explainability method like integrated gradients to find tokens with high attribution weights (influential tokens).We then use a generative model (e.g.GPT-2) to produce new adversarial examples conditioned on these tokens.
evaluators to judge naturalness in terms of whether generated text fragments are coherent, well-formed and likely to be human-written.
3 Description of NaturalAdversaries

DeBERTa, BERT
Table 1: Description of considered datasets.For ANLI, each example comes with a premise (P) providing context and a hypothesis statement (H).

Adversarial Generation
Given a generative autoregressive model F g and training set D 1 with triples (S, y, F attr (S)), we construct the following input sequence where z is a sequence of influential tokens sampled from S using the attribution weights defined by F attr (S), attr is a special token indicating the start of this sequence, y ′ is the desired classification label, label and text are special tokens indicating the start of y ′ and S respectively, and eos is a special token indicating where the full input sequence ends.At training time, y ′ = y as the generated model is trained to mimic the behavior of the classifier model and generate examples with a given label y ′ based on the classifier's observed behavior.At decoding time, we encourage adversarial behavior by reversing the label (e.g.setting y ′ = 1 − y).The model is prompted using only the influential tokens z and y ′ .For example, given a natural language inference (NLI) premise and hypothesis pair ("It was sunny outside", "it was too dark to see anything outside") where the gold label is contradiction, at training time we use (y ′ ="contradiction";z="influentialWord1", "in-fluentialWord2", "influentialWord3", S="It was sunny outside.It was too dark to see anything outside.").At decoding time we would use (y ′ ="entailment";z="influentialWord1", "influen-tialWord2", "influentialWord3") and predict S.
We minimize cross-entropy loss during training time:

Experimental Setup
In this section we first introduce the domains we test on ( §3.1) and then methods used for baseline comparison ( §4.1).All adversarial generators are based on the GPT-2 124M parameter model.We describe evaluation setups ( §4.2.1), as well as outof-distribution evaluation ( §4.2.2).An advantage of the proposed generative method is that we can automatically extend human-in-theloop adversarial generation methods like Adversarial NLI (ANLI, Nie et al., 2020), which are costly and time-consuming to curate.Given this motivation, we focus on particularly challenging human-in-the-loop examples rather than cases that are already well solved by existing benchmarks.To study effectiveness of the proposed approach, we conducted experiments on the hate speech detection (DynaHate (Vidgen et al., 2021b)) and natural language inference (NLI).For DynaHate we use a RoBERTa classifier trained on tweets (Founta et al., 2018;Zhou et al., 2021).We test generalization across both model architectures and (nonadversarial) data domains using a BERT model (Devlin et al., 2019) trained on the HateXplain dataset (Mathew et al., 2021).For NLI we use DeBERTa (He et al., 2021) trained on MNLI (Williams et al., 2018).We test generalization using BERT trained on the QNLI dataset (Wang et al., 2019) ) of naturalness, along with adversarial performance against the original target classifier Adv1 and an unseen classifier Adv2.In the last topright column we show macro-averaged F1 performance on HateCheck (Röttger et al., 2021) after finetuning RoBERTa on 150 adversarial examples, compared to the original performance.We conduct a similar experiment for NLI using the SNLI-Hard evaluation set (Gururangan et al., 2018) with results in the last bottomright column.We bold the best-performing model and underline the second best model.

Baselines
For automatic adversarial example construction, we compare against several common adversarial example generation approaches which are designed for either black-box (model-agnostic) or white-box (model-dependent) attacks.Baselines are implemented using TextAttack (Morris et al., 2020b).

Black-Box Baselines
We use the TextFooler (Jin et al., 2020) algorithm for generating coherent adversaries by replacing high-importance words in original examples with words that preserve semantic similarity.
White-Box Baselines Since our approach has an advantage over other baselines in the white-box setting of utilizing knowledge about model parameters, we also compare against a word-level version of the widely used HotFlip gradient-based approach (Ebrahimi et al., 2018).

Human Evaluation
To compare effectiveness of automatic methods, adversarial examples are manually validated to determine the true label.We also assess naturalness of examples, i.e. whether they are perceived as realistic examples that could be written by humans.We use 156 crowd-source workers from Amazon Mechanical Turk (MTurk) with prior experience vali-dating hate speech (Sap et al., 2020) and 79 workers with experience validating NLI data (Liu et al., 2022).We sample 150 examples using each approach (some approaches may impose constraints that are unsatisfied by all candidate transformed sentences, we also filter out examples that are already adversarial to avoid conflating adversarialness of original examples with effects of the transformation).Each example is judged by 3 different workers.For hate speech, we classify an example as toxic if at least one annotator considers it so.We achieve moderate inter-annotator agreement of Fleiss' κ = .51for hate speech and κ = .52for NLI.

Out-of-Distribution Performance
Here we frame domain adaptation as a few-shot learning problem, where the adversarial evaluation set represents training examples from outside the seen domain of the classifier.To test out-ofdistribution (OOD) performance on hate speech data, we use the HateCheck test suite (Röttger et al., 2021), which consists of test cases for 29 model functionalities relating to real-world concerns of stakeholders.For NLI we check OOD performance on the SNLI-Hard dataset (Gururangan et al., 2018), which assesses common model vulnerabilities.

Results
We discuss results for the TextFooler (TF) and Hot-Flip (HF) baselines along with our two model variations (NA-LIME and NA-IG).
Quality and effectiveness of adversarial generations.From Table 2

Conclusion
We introduce a framework for generation of naturalistic adversaries that is effective for multiple neural classifiers and across domains.We encourage further work on how naturalistic adversaries may improve robustness in real-world settings.

Limitations
While it is well-known that transformer-based language models suffer from lexical biases (Gururangan et al., 2018), it may be an oversimplification to say that a single keyword is independently the cause of a particular classification decision.It has been shown that language models may consider compositionality to some degree (Shwartz and Dagan, 2019;Baroni, 2019), and future work may explore explainability methods that take this into consideration (e.g., Ye et al., 2021).Another limitation is that generative approaches are highly dependent on the decoding method of choice (Holtzman et al., 2020), and while this provides us more flexibility, it also leads to more variability in performance.

Ethics & Broader Impact Statement
General Statement While there is a risk of any technologies aimed at mimicking natural language being used for malicious purposes, our work has wide-ranging potential societal benefit by improving fairness and real-world robustness of neural classifiers.Increasingly it has become clear that pretrained neural language models do not operate from a neutral perspective, and implicitly learn behaviors that pose real harm to users from training data (Jernite et al., 2022).We demonstrate that our framework is effective at generating adversaries that uncover model vulnerabilities for two well-studied domains (hate speech and NLI), and it is hypothetically extensible to other domains like automated fact-checking.Given the sensitive nature of toxic language and hate speech detection in particular, we strongly emphasize that the work is intended only for research purposes or improving robustness of automated systems.For data and code release, we include detailed model and data cards (Bender and Friedman, 2018;Mitchell et al., 2019).
Annotation Based on time estimates, the annotator wage is approx.$10-$16 per hour.All annotators were required to click a consent button before working on the tasks.For hate speech annotation, annotators were cautioned about the possibly disturbing nature of the content before being shown examples.We also provided crisis hot-line information in case of emotional distress.

A Explainability Method Details
A.1 Black-Box Setting Given a classifier F c (x) and a set of initial seed examples D with ground truth labels y, we predict classifier labels ŷ and measure the contribution of each token s i in a given sequence S ∈ D to classifier's prediction using LIME (Ribeiro et al., 2016).The LIME algorithm defines a local neighborhood around a point x representing S, N (x), using a proximity measure π x , and optimizes linear models g ∈ G to jointly minimize the distance of decision functions g and F c for x ∈ N (x) and also the complexity of g.In this case: where L(F c , g, π x ) is the locality-aware loss, Ω(g) is model complexity, and π x is an exponential kernel defined using cosine distance2 .We also tested Shapley additive explanation values (Lundberg and Lee, 2017) in early experimentation, but found that the results were less promising than LIME.

A.2 White-Box Setting
Given a classifier F c (x) and a set of initial seed examples D with labels y, we predict classifier labels ŷ and measure the contribution of each token S i in an example sequence S ∈ D to this final output decision using the following computation: Here x i is the embedding of S i .Following (Mudrakarta et al., 2018), the baseline input embedding (x ′ ) is defined by a sequence of pad tokens that is the same length as the input, since the embedded pad tokens should not be informative.∂F (x) ∂x i is the gradient of F c (x) with respect to x i (Sundararajan et al., 2017).After identifying contribution of each token, we can partition D into two sets D 1 and D 2 based on the model behavior, where D 1 forms a subset representing the space of correctly predicted examples and D 2 consists of incorrect predictions.We use the examples and attribution weights from D 1 as training data for a generative model F g (x).

B Domains B.1 Hate Speech Detection
For hate speech detection, we train the Adversari-alGen generation model on the DynaHate benchmark.(Vidgen et al., 2021b).We use a RoBERTa classifier trained on the twitter hate speech dataset (Founta et al., 2018;Zhou et al., 2021).DynaHate benchmark was chosen for training in our experiments since it was constructed using a humanand-machine-in-the-loop setup designed to reduce dataset biases and improve model generalizability.It also includes examples of implicit hate, which rely less on lexical cues.For this task, the input to the classifier model is a text document like the one shown below x = [CLS] all I want is to not be treated like a second-class citizen [SEP].
Here [CLS] and [SEP ] denote classifier-specific special tokens.The output is a binary label (benign or harmful).

B.2 Natural Language Inference
For natural language inference, we train the generation model on the ANLI dataset (Nie et al., 2020), which was constructed similarly to DynaHate with an iterative human-and-machine-in-the-loop process.We test a DeBERTa-base classifier (He et al., 2021) trained on the Multi-Genre NLI (MNLI) corpus (Williams et al., 2018), which is a featured task in the CLUES and GLUE evaluation benchmarks (Mukherjee et al., 2021;Wang et al., 2019).
The MNLI corpus consists of ∼433k diverse sentence pairs, however recent work has shown that MNLI-trained models are highly susceptible to adversarial attacks from crowd-source workers (Nie et al., 2020).For this task, the input to the classifier model is a premise sentence s p and hypothesis sentence s h like the ones shown below she walks in front of me The output is a label specifying whether the premise contradicts the hypothesis, entails the hypothesis or is neutral (the hypothesis could either be true or false given the premise).

C Additional Implementation Details
For all explainability methods we took the top 20% of tokens with the highest attribution scores.For LIME, we set the maximum number of features to 20 and generate 2,000 samples for training.All models are trained on a single Quadro RTX 8000 GPU.Training time averages 1 hour per epoch with a batch size of 32 for NLI data and 1 minute per epoch with a batch size of 16 for hate speech data.Inference time is approx.10 minutes.We use the 124M parameter GPT-2 model for all generators.Generations are sampled after one training and evaluation run.We used a learning rate of 2e-4 for all models, determined by a hyperparameter search tuned on the dev.loss with the DynaHate trained model and potential learning rates in the range of lr ∈ [.2, 2e-6].For the hate speech domain we use top-k decoding with k = 10, while for natural language inference (NLI) we use beam search decoding with a beam size of 10 (we tested k and beam size values from 3-10).We also use a max length context size of 150 tokens for hate speech and 500 for NLI.For both domains, we use a learning rate of lr = .0004and train until early stopping on the dev.set.
We train the hate speech domain model using round 1 of DynaHate, and combine test sets from all rounds.We use round 3 for the NLI model and sample from all rounds for the test set.

C.1 Data Processing
For human evaluation, we remove non-ASCII characters and special tokens like "<|endoftext|>."We also noticed that the HotFlip baseline tends to generate extra spaces, and removed unncessary spaces.Newline characters were replaced by "<linebreak>."All the data is in English.

C.2 Annotator Instructions
Hate speech Read a text fragment (potentially generated by an AI system) and tell us whether it is potentially toxic/harmful to any reader.Note that there is no right or wrong answer, and we are interested in your opinion.We use a broad definition of harmful speech, including but not limited to use of inappropriate language, insults, threats, stereotyping, hate speech, microaggressions, derogatory speech and obscene content.
Please look carefully at the text fragment and consider whether or not it is harmful.During each HIT you will be asked to answer several questions, listed and explained below.
1) Is the text fragment harmful/toxic?Our purpose is to understand how harm/toxicity can show up in language, we are not making statements about the actual content of the posts.Try to answer this question while considering a broad set of people from different backgrounds, not just your own.
2) Does the text fragment read naturally (Is it coherent, well-formed and likely human-written)?
NLI Read a premise text fragment and a hypothesis text fragment (potentially generated by an AI system) and tell us what type of relationship there is between the two text fragments (entailment, contradiction, neutral).We provide detailed definitions and examples for these three types of relationships below.
If you do not understand the text fragments, please indicate so using the given checkbox, but still answer the questions to the best of your abilities. Entailment: The two text fragments have this type of relationship if the information in the hypothesis follows from the premise (and no information contradicts the premise), e.g. the hypothesis "the woman is playing basketball" is entailed by the premise "the woman and the man are playing basketball in the park." Contradiction: The two text fragments have this type of relationship if at least some information in the hypothesis is contradicted by the premise, e.g."the woman is playing soccer" is contradicted by the premise "the woman and the man are playing basketball in the park." Neutral: The two text fragments have this type of relationship if the information in the hypothesis is neither entailed by or contradicted by the premise (the two text fragments may be completely unrelated), e.g. the relationship between "the woman and the man are playing basketball in the park" and "the woman loves basketball" is neutral. Questions: 1) What is the relationship (entailment, contradiction or neutral) between the two text fragments?
2) Do the premise and hypothesis text fragments read naturally (are they coherent, well-formed and likely human-written)?

D Generation Examples
We provide examples of generated examples in Table 3.

Table 2 :
. Further details are provided in Table 1 and Appendix B. Human evaluation (Natural H