Achieving Model Robustness through Discrete Adversarial Training

Discrete adversarial attacks are symbolic perturbations to a language input that preserve the output label but lead to a prediction error. While such attacks have been extensively explored for the purpose of evaluating model robustness, their utility for improving robustness has been limited to offline augmentation only. Concretely, given a trained model, attacks are used to generate perturbed (adversarial) examples, and the model is re-trained exactly once. In this work, we address this gap and leverage discrete attacks for online augmentation, where adversarial examples are generated at every training step, adapting to the changing nature of the model. We propose (i) a new discrete attack, based on best-first search, and (ii) random sampling attacks that unlike prior work are not based on expensive search-based procedures. Surprisingly, we find that random sampling leads to impressive gains in robustness, outperforming the commonly-used offline augmentation, while leading to a speedup at training time of ~10x. Furthermore, online augmentation with search-based attacks justifies the higher training cost, significantly improving robustness on three datasets. Last, we show that our new attack substantially improves robustness compared to prior methods.


Introduction
Adversarial examples are inputs that are slightly, but intentionally, perturbed to create a new example that is misclassified by a model (Szegedy et al., 2014). Adversarial examples have attracted immense attention in machine learning (Goodfellow et al., 2015;Carlini and Wagner, 2017;Papernot et al., 2017) for two important, but separate, reasons. First, they are useful for evaluating model robustness, and have revealed that current models are over-sensitive to minor perturbations. Second, adversarial examples can improve robustness: training on adversarial examples reduces the brittleness and over-sensitivity of deep learning models to Figure 1: Robust accuracy vs. slowdown in training time, comparing different methods to Baseline (purple pentagon); x-axis in logarithmic scale. The popular ADVOFF (blue squares, offline augmentation with adversarial example) is 10x slower than our simple augmentation of 4 (8) random samples (triangles, RAND-OFF-4, RANDOFF-8) and achieves similar or worse robust accuracy. Our online augmentation of adversarial examples (ADVON, yellow circles) significantly improves robust accuracy, but is expensive to train. such perturbations (Alzantot et al., 2018;Jin et al., 2020;Li et al., 2020;Lei et al., 2019;Wallace et al., 2019;Garg and Ramakrishnan, 2020;Si et al., 2020a;Goel et al., 2021).
Training and evaluating models with adversarial examples has had considerable success in computer vision, with gradient-based techniques like FGSM (Goodfellow et al., 2015) and PGD (Madry et al., 2018). In computer vision, adversarial examples can be constructed by considering a continuous space of imperceptible perturbations around image pixels. Conversely, language is discrete, and any perturbation is perceptible. Thus, robust models must be invariant to input modifications that preserve semantics, such as synonym substitutions (Alzantot et al., 2018;Jin et al., 2020), paraphrasing (Tan et al., 2020), or typos (Huang et al., 2019).
Due to this property of language, ample work has been dedicated to developing discrete attacks that generate adversarial examples through combinatorial optimization (Alzantot et al., 2018;Ren et al., 2019;Jin et al., 2020;Zhou et al., 2020;Zang et al., 2020) . For example, in sentiment analysis, it is common to consider the space of all synonym substitutions, where an adversarial example for an input "Such an amazing movie!" might be "Such an extraordinary film" (Fig. 2). This body of work has mostly focused on evaluating robustness, rather than improving it, which naturally led to the development of complex combinatorial search algorithms, whose goal is to find adversarial examples in the exponential space of perturbations.
In this work, we address a major research gap in current literature around improving robustness with discrete attacks. Specifically, past work (Alzantot et al., 2018;Ren et al., 2019;Jin et al., 2020) only considered offline augmentation, where a discrete attack is used to generate adversarial examples and the model is re-trained exactly once with those examples. This ignores online augmentation, which had success in computer vision (Kurakin et al., 2017;Perez and Wang, 2017;Madry et al., 2018), where adversarial examples are generated in each training step, adapting to the changing model. Moreover, simple data augmentation techniques, such as randomly sampling from the space of synonym substitutions and adding the generated samples to the training data have not been investigated and compared to offline adversarial augmentation. We address this lacuna and systematically compare online augmentation to offline augmentation, as well as to simple random sampling techniques. To our knowledge, we are the first to evaluate online augmentation with discrete attacks on a wide range of NLP tasks. Our results show that online augmentation leads to significant improvement in robustness compared to prior work and that simple random augmentation achieves comparable results to the common offline augmentation at a fraction of the complexity and training time.
Moreover, we present a new search algorithm for finding adversarial examples, Best-First search over a Factorized graph (BFF), which alleviates the greedy nature of previously-proposed algorithms. BFF improves search by incorporating backtracking, and allowing to re-visit previouslydiscarded search paths, once the current one is revealed to be sub-optimal. We evaluate model robustness on three datasets: BoolQ (Clark et al., 2019), IMDB (Maas et al., 2011), andSST-2 (Socher et al., 2013), which vary in terms of the target task (question answering and sentiment analysis) and input length. Surprisingly, we find across different tasks ( Fig. 1) that augmenting each training example with 4-8 random samples from the synonym substitution space performs as well as (or better than) the commonly used offline augmentation, while being simpler and 10x faster to train. Conversely, online augmentation makes better use of the extra computational cost, and substantially improves robust accuracy compared to offline augmentation. Additionally, our proposed discrete attack algorithm, BFF, outperforms prior work by a wide margin. Our data and code are available at https://github.com/ Mivg/robust_transformers.

Problem Setup and Background
Problem setup We focus in this work on the supervised classification setup, where given a training set {x j , y j } N j=1 sampled from X × Y, our goal is to learn a mapping A : X → Y that achieves high accuracy on held-out data sampled from the same distribution. Moreover, we want the model A to be robust, i.e., invariant to a set of pre-defined labelpreserving perturbations to x, such as synonym substitutions. Formally, for any natural language input x, a discrete attack space of label-preserving perturbations S(x) ⊂ X is defined. Given a labeled example (x, y), a model A is robust w.r.t x, if A(x) = y and for anyx ∈ S(x), the output A(x) = A(x). An examplex ∈ S(x) such that A(x) = A(x) is called an adversarial example. We assume A provides not only a prediction but a distribution p A (x) ∈ ∆ |Y| over the possible classes, where ∆ is the simplex, and denote the probability A assigns to the gold label by [p A (x)] y . Fig. 2 shows an example from sentiment analysis, Robustness is evaluated with robust accuracy (Tsipras et al., 2019), i.e., the fraction of examples a model is robust to over some held-out data. Typically, the size of the attack space S(x) is exponential in the size of x and it is not feasible to enumerate all perturbations. Instead, an upper bound is estimated by searching for a set of adversarial attacks, i.e., "hard" examples in S(x) for every x, and estimating robust accuracy w.r.t to that set.
Improving robustness with discrete attacks Since language is discrete, a typical approach for evaluating robustness is to use combinatorial optimization methods to search for adversarial examples in the attack space S(x). This has been repeatedly shown to be an effective attack method on pre-trained models (Alzantot et al., 2018;Lei et al., 2019;Ren et al., 2019;Li et al., 2020;Jin et al., 2020;Zang et al., 2020). However, in terms of improving robustness, discrete attacks have thus far been mostly used with offline augmentation (defined below) and have led to limited robustness gains. In this work, we examine the more costly but potentially more beneficial online augmentation.
Offline vs. online augmentation Data augmentation is a common approach for improving generalization and robustness, where variants of training examples are automatically generated and added to the training data (Simard et al., 1998). Here, discrete attacks can be used to generate these examples. We consider both offline and online data augmentation and focus on improving robustness with adversarial examples.
Given a training set {(x j , y j )} N j=1 , offline data augmentation involves (a) training a model A over the training data, (b) for each training example (x j , y j ), generating a perturbation w.r.t to A (using some discrete attack) and labeling it with y j , and (c) training a new model over the union of the original training set and the generated examples. This is termed offline augmentation because examples are generated with respect to a fixed model A.
Online data augmentation is this setup, examples are generated at training time w.r.t the current model A. This is more computationally expensive, as examples must be generated during training and not as pre-processing, but examples can adapt to the model over time. In each step, half the batch contains examples from the training set, and half are adversarial examples generated by some dis-crete attack w.r.t to the model's current state.
Online augmentation has been used to improve robustness in NLP with gradient-based approaches (Jia et al., 2019;Shi et al., 2020;Zhou et al., 2020), but to the best of our knowledge has been overlooked in the context of discrete attacks. In this work, we are the first to propose model-agnostic online augmentation training, which uses automatically generated discrete adversarial attacks to boost overall robustness in NLP models.

The Attack Space
An attack space for an input with respect to a classification task can be intuitively defined as the set of label-preserving perturbations over the input. A popular attack space S(x), which we adopt, is the space of synonym substitutions (Alzantot et al., 2018;Ren et al., 2019). Given a synonym dictionary that provides a set of synonyms Syn(w) for any word w, the attack space S syn (x) for an utterance x = (w 1 , . . . , w n ) contains all utterances that can be obtained by replacing a word w i (and possibly multiple words) with one of their synonyms. Typically, the number of words from x allowed to be substituted is limited to be no more than D = d · |x| , where d ∈ {0.1, 0.2} is a common choice.
Synonym substitutions are context-sensitive, i.e., substitutions might only be appropriate in certain contexts. For example, in Fig. 3, replacing the word "like" with its synonym "similar" (red box) is invalid, since "like" is a verb in this context. Consequently, past work (Ren et al., 2019;Jin et al., 2020) filtered S syn (x) using a context-sensitive filtering function Φ x (w i ,w i ) ∈ {0, 1}, which determines whether substituting a word w i from the original utterance x with its synonymw i is valid in a particular context. For instance, an external model can check whether the substitution maintains the part-of-speech, and whether the overall semantics is maintained. We define the filtered synonyms substitutions space S Φ (x) as the set that includes all utterancesx that can be generated through a sequence of no more than D single-word substitutions from the original utterance that are valid according to Φ(·, ·). In §5.2, we describe the details of the synonym dictionary and function Φ.

Best-first Search Over a Factorized Graph
Searching over the attack space S Φ (x) can be naturally viewed as a search problem over a directed acyclic graph (DAG), G = (U, E), where each node ux ∈ U is labeled by an utterancex, and edges E correspond to single-word substitutions, valid according to Φ(·). The graph is directed and acyclic, since only substitutions of words from the original utterance x are allowed (see Fig. 3). Because there is a one-to-one mapping from the node ux to the utterancex, we will use the latter to denote both the node and the utterance. Discrete attacks use search algorithms to find an adversarial example in S(x). The search is guided by a heuristic scoring function s A (x) := [p A (x)] y , where the underlying assumption is that utterances that give lower probability to the gold label are closer to an adversarial example. A popular choice for a search algorithm in NLP is greedy search, illustrated in Fig. 3. Specifically, one holds in step t the current node x t , where t words have been substituted in the source node x 0 = x. Then, the model A(·) is run on the frontier, that is, all out-neighbor nodes N (x t ) = {x t+1 | (x t ,x t+1 ) ∈ E}, and the one that minimizes the heuristic scoring function is selected: While greedy search has been used for characterflipping (Ebrahimi et al., 2018), it is ill-suited in the space of synonym substitutions. The degree of nodes is high -assuming n rep words can be replaced in the text, each with K possible synonyms, then the out degree is O(n rep · K). This results in an infeasible number of forward passes through the attacked model even for a small number of search iterations.
To enable effective search through the search space, we (a) factorize the graph such that the outdegree of nodes is lower, and (b) use a best-first search algorithm. We describe those next.
Graph factorization To reduce the out-degree of a node in the search space and thus improve its efficiency, we can split each step into two. First, choose a position to substitute in the utterance; Second, choose a substitution for that position. This reduces the number of evaluations of A per step from O(n rep · K) to O(n rep + K). To estimate the score of a position i, one can mask the word w i with a mask token τ and measure s A (x w i →τ ), where x w i →τ is the utterance x where the word in position i is replaced by the mask τ .
We can describe this approach as search over a bi-partite DAGĜ = (U ∪ W,Ê). The nodes U are utterances like in G, and the new nodes are utterances with a single mask token W = {x w i →τ |x ∈ S(x)∧w i is a word in x}. The edges comprise two types:Ê = E 1 ∪ E 2 . The edges E 1 are from utterances to masked utterances: Figure 3, the two rightmost nodes in each row would be factorized together as they substitute the same word, and the algorithm will evaluate only one of them to estimate the potential benefit of substituting "movie".
Best-first search A factorized graph makes search possible by reducing the out-degree of nodes. However, greedy search is still sub-optimal. This is since it relies on the heuristic search function to be a good estimate of the distance to an adversarial example, which can often be false. Consider the example in Fig. 3. The two adversarial examples (with p = 0.4 or p = 0.45) are not reachable from the best node after the first step (p = 0.6), only from the second-best (p = 0.65).
Best-first search (Pearl, 1984) overcomes this at a negligible cost, by holding a min-heap over the nodes of the frontier of the search space (Alg. 1). In each step, we pop the next utterance, which assigns the lowest probability to the gold label, and push all neighbors into the heap. When a promising branch turns out to be sub-optimal, search can resume from an earlier node to find a better solution, as shown in the blue path in Figure 3. To bound the cost of finding a single adversarial example, we bound the number of forward passes through the model A with a budget parameter B. To further reduce "greedyness", search can use a beam by popping more than one node in each step, expanding all their neighbors and pushing the result back to the heap. Our final approach uses Best-First search over a Factorized graph, and is termed BFF.

Experiments
We conduct a thorough empirical evaluation of model robustness across a wide range of attacks and training procedures.

Experimental Setup
To evaluate our approach over diverse settings, we consider three different tasks: text classification, sentiment analysis and question answering, two of which contain long passages that result in a large attack space (see Table 1). 1. SST-2: Based on the the Stanford sentiment treebank (Socher et al., 2013), SST-2 is a binary (positive/negative) classification task containing 11,855 sentences describing movie reviews. SST-2 has been frequently used for evaluating robustness.
2. IMDB (Maas et al., 2011): A binary (positive/negative) text classification task, containing 50K reviews from IMDB. Here, passages are long and thus the attack space is large (Table 1).
3. BoolQ (Clark et al., 2019): contains 16,000 yes/no questions over Wikipedia paragraphs. This task is perhaps the most interesting, because the attack space is large and answering requires global passage understanding. We allow word substitutions in the paragraph only and do not substitute nouns, verbs, or adjectives that appear in the question to avoid non-labelpreserving perturbations. Further details can be found in App. A.2. Models We consider a wide array of models and evaluate both their downstream accuracy and ro-bustness. In all models, we define a budget of B = 1000, which specifies the maximal number of allowed forward passes through the model for finding an adversarial example. All results are an average of 3 runs.
To demonstrate the effectiveness of BFF for both robustness evaluation as well as adversarial training, we compare it to a recent state-of-the-art discrete attack, TEXTFOOLER (Jin et al., 2020), which we denote in model names below by the prefix TXF. The models compared are: These baselines are on par with current state-of-theart to demonstrate the efficacy of our method.
• RANDOFF-L: We compare search-based algorithms to a simple and efficient approach that does not require any forward passes through the model A. Specifically, we randomly sample L utterances from the attack space for each example (without executing A) and add them to the training data.
• RANDON: A random sampling approach that does use the model A. Here, we sample B random utterances, pass them through A, and return the attack that resulted in lowest model probability.
• FREELB: For completeness, we also consider FREELB (Zhu et al., 2020), a popular gradientbased approach for improving robustness, which employs virtual adversarial training (see §6). This approach uses online augmentation, where examples are created by taking gradient steps w.r.t the input embeddings to maximize the model's loss.
Other gradient-based approaches (e.g., certified robustness) are not suitable when using pre-trained transformers, which we further discuss in §6.
In a parallel line of work, Garg and Ramakrishnan (2020) and Li et al. (2020) used pre-trained language models to both define an attack space and to generate high-fidelty attacks in that space. while successful, these approaches are not suitable for our setting, due to the strong coupling between the attack strategy and the attack space itself. We further discuss this in §6 Evaluation We evaluate models on their downstream accuracy, as well as on robust accuracy, i.e. the fraction of examples against which the model is robust. Since exact robust accuracy is intractable to compute due to the exponential size of the attack space, we compute an upper-bound by attacking each example with both BFF and TEXTFOOLER (TXF) with a budget of B = 2000. An example is robust if we cannot find an utterance where the prediction is different from the gold label. We evaluate robust accuracy on 1000/1000/872 samples from the development sets of BoolQ/IMDB/SST-2.

Attack Space
Despite the myriad of works on discrete attacks, an attack space for synonym substitutions has not been standardized. While all past work employed a synonym dictionary combined with a Φ(·, ·) filtering function (see §3), the particular filtering functions vary. When examining the attack space proposed in TXF, we observed that attacks result in examples that are difficult to understand or are not labelpreserving. Table 6 in App. A.4 shows several examples. For instance, in sentiment classification, the attack replaced "compelling" with "unconvincing" in the sentence "it proves quite unconvincing as an intense , brooding character study" which alters the meaning and the sentiment of the sentence. Therefore, we use a more strict definition of the filtering function and conduct a user study to verify it is label-preserving.
Concretely, we use the synonym dictionary from Alzantot et al. (2018). We determine if a word substitution is context-appropriate by computing all single-word substitutions (n rep · K) and disallowing those that change the POS tag according to spaCy (Honnibal et al.) or increase perplexity according to GPT-2 (Radford et al., 2019) by more than 25%. Similar to Jin et al. (2020), we also filter out synonyms that are not semantics-preserving according to the USE (Cer et al., 2018) model. The attack space includes any combination of allowed single-word substitutions, where the fraction of allowed substitutions is d = 0.1. Implementation details are in App. A.2. We find that this ensemble of models reduces the number of substitutions that do not preserve semantics and are allowed by the filtering function.
We check the validity of our more restrictive attack space with a user study, where we verify that our attack space is indeed label-preserving. The  details of the user study are in §5.6. Comparing different attacks, online augmentation (BFFON), which has been overlooked in the context of discrete attacks, leads to dramatic robustness gains compared to other methods, but is slow to train -20-270x slower than BASELINE. This shows the importance of continuous adaptation to the current vulnerabilities of the model.

Robustness Results
Interestingly, adding offline random samples (RANDOFF−L) consistently improves robust accuracy, and using L = 12 leads to impressive robustness gains without executing A at all, outperforming BFFOFF in robust accuracy, and being ∼5x faster on IMDB and BoolQ. Moreover, random sampling is trivial to implement, and independent from the attack strategy. Hence, the common practice of using offline augmentation with searchbased attacks, such as BFFOFF, seems misguided, and a better solution is to use random sampling. Online random augmentation obtains impressive results, not far from BFFON, without applying any search procedure, but is very slow, since it uses the entire budget B in every example.
Comparing BFF to TXF, we observe that BFF, which uses best-first search, outperforms TXF in both the online and offline setting. Last FREELB, which is based on virtual adversarial training, improves robust accuracy at a low computational cost, but is dramatically outperformed by discrete search-   based attacks, including BFF. To summarize, random sampling leads to significant robustness gains at a small cost, outperforming the commonly used offline augmentation. Online augmentation leads to the best robustness, but is more expensive to train.

Robustness across Attack Strategies
A natural question is whether a model trained for robustness with an attack (e.g., BFF) is robust w.r.t to examples generated by other attacks, which are potentially uncorrelated with them. To answer that, we evaluate the robustness of our models to attacks generated by BFF, TXF, and random sampling. Moreover, we evaluate robustness to a genetic attack, which should not be correlated with BFF and TXF: we re-implement the genetic attack algorithm from Alzantot et al. (2018) (details in A.3), and examine the robustness of our model to this attack. All attacks are with a budget of B = 2000. Table 3 shows the result of this evaluation. We observe that BFFON obtains the highest robust accuracy results w.r.t to all attacks: BFF, TXF, random sampling, and a genetic attack. In offline augmentation, we observe again that BFFOFF obtains good robust accuracy, higher or comparable to all other offline models for any attack strategy. This result highlights the generality of BFF for improving model robustness.

Success Rate Results
To compare the different attacks proposed in §4, we analyze the success rate against BASELINE, i.e., the proportion of examples for which an attack finds an adversarial example as a function of the budget B. Fig. 4 compares the success rate of different attacks. We observe that BFF-based attacks have the highest success rate after a few hundred executions. TEXTFOOLER performs well at first, finding adversarial examples for many examples, but then its success plateaus. Similarly, a random approach, which ignores the graph structure, starts with a relatively high success rate, as it explores far regions in the graph, but fails to properly utilize its budget and then falls behind.
BFF combines backtracking with graph factorization. When removing backtracking, i.e., greedy search over the factorized graph, success rate decreases, especially in BoolQ. Greedy search without graph factorization leads to a low success rate due to the large number of neighbors of each node, which quickly exhausts the budget. Moreover, looking at BFF with beam size 2 (popping 2 items from the heap in each step) leads to lower performance when the budget B ≤ 2000, as executions are expended on less promising utterances, but could improve success rate given a larger budget.
Lastly, due to our more strict definition of the attack space, described in ( §5.2), success rates of BFF and TXF are lower compared to Jin et al. (2020). To verify the correctness of our attacks,  we run BFF and TXF in their attack space, which uses a larger synonym dictionary, a more permissive function Φ, and does not limit the number of substitutions D and budget B. We obtain a similar success rate, close to 100%. Nevertheless, we argue our attack space, validated by users to be label-preserving is preferable, and leave standardization of attack spaces through a broad user study to future work.

User Study
Since a model is considered to not be robust even if it flips the output label for a single adversarial sample, the validity of adversarial examples in the attack space is crucial. When we examined generated attacks based on prior works, we found many label-flipping attacks. This was especially noticeable when using BFF attacks over tasks not evaluated in prior works (see examples in Appendix A.4). In this work, our focus was on evaluating different methods for increasing model robustness, and thus over-constraining the attack space to guarantee its validity was acceptable. We stress that our attack search space is more conservative than prior work, and is a strict subset of prior attack spaces (see Appendix A.2), leading to higher validity of adversarial examples. We evaluate the validity of our attack space and the generated adversarial samples with a user study. We sample 100/100/50 examples from SST-2/BoolQ/IMDB respectively, and for each example create two adversarial examples: (a) by random sampling (b) using a BFF attack. We ask 25 NLP graduate students to annotate both the original example and the two adversarial ones. Each example is annotated by two annotators and each annotator only sees one version of an example. If human performance on random and adversarial examples is similar to the original task, this indicates the attack space is label-preserving. Table 4 shows the results. Human performance on random examples is similar to the original utterances. Human performance on examples generated with BFF is only mildly lower than the performance on the original utterances, overall confirming that the attack space is label-preserving.
Ideally, the validity of adversarial exmaples should be as high as the original examples. However, a small degradation in random vs. original is expected since the search space is not perfect, and similarly for BFF since it is targeted at finding adversarial examples. Nevertheless, observed drops were small, showing the advantage in validity compared to prior work. The minor irregularity in BoolQ between random and original is indicative of the noise in the dataset.

Related Work
Adversarial attacks and robustness have attracted tremendous attention. We discuss work beyond improving robustness through adversarial attacks.
Certified Robustness is a class of methods that provide a mathematical certificate for robustness (Dvijotham et al., 2018;Gowal et al., 2018;Jia et al., 2019;Huang et al., 2019;Shi et al., 2020). The model is trained to minimize an upper bound on the loss of the worst-case attack. When this upper bound is low, we get a certificate for robustness against all attacks. While this approach has had success, it struggles when applied to transformers, since upper bounds are propagated through many layers, and become too loose to be practical.

Gradient-based methods
In a white-box setting, adversarial examples can be generated by performing gradient ascent with respect to the input representation. Gradient-based methods (Goodfellow et al., 2015;Madry et al., 2018) have been empirically successful (Gowal et al., 2018;Ebrahimi et al., 2018), but suffer from a few shortcomings: (a) they assume access to gradients, (b) they lose their effectiveness when combined with sub-word tokenization, since one cannot substitute words that have a different number of sub-words, and (c) they can generate noisy examples that does not preserve the output label. In parallel to our work, Guo et al. (2021) proposed a gradient-based approach that finds a distribution over the attack space at the token level, resulting in an efficient attack.
Virtual adversarial training In this approach, one does not generate explicit adversarial examples (Zhu et al., 2020;Jiang et al., 2020;Li and Qiu, 2020;Pereira et al., 2021). Instead, embeddings in an -sphere around the input (that do not correspond to words) are sampled, and continuous optimization approaches are used to train for robustness. These works were shown to improve downstream accuracy, but did not result in better robust accuracy. Recently, Zhou et al. (2020) proposed a method that does improve robustness, but like other gradient-based methods, it is white-box, does not work well with transformers over subwords, and leads to noisy samples. A similar approach has been taken by Si et al. (2020b) to generate virtual attacks during training by interpolating offline-generated attacks.
Defense layers This approach involves adding normalization layers to the input before propagating it to the model, so that different input variations are mapped to the same representation Mozes et al., 2020;Jones et al., 2020) . While successful, this approach requires manual engineering and a reduction in model expressivity as the input space is significantly reduced. A similar approach (Zhou et al., 2019) has been to identify adversarial inputs and predict the original un-perturbed input.
Pretrained language-models as attacks In this work, we decouple the definition of the attackspace from the attack strategy itself, which is cast as a search algorithm. This allows us to systematically compare different attack strategies and methods to improve robustness in the same setting. An orthogonal approach to ours was proposed by Garg and Ramakrishnan (2020) and Li et al. (2020), who used the fact that BERT was trained with the masked language modeling objective to predict possible semantic preserving adversarial perturbations over the input tokens, thereby coupling the definition of the attack space with the attack strategy. While this approach showed great promise in efficiently generating valid adversarial examples, it does not permit any external constraint on the attack space and thus is not comparable to attacks in this work. Future work can test whether robustness transfers across attack spaces and attack strategies by either (a) evaluating the robustness of models trained in this work against the aforementioned works (in their attack space), or (b) combine such attacks with online augmentation to train robust models and compare to the attacks proposed in our work.

Conclusions
We examine achieving robustness through discrete adversarial attacks. We find that the popular approach of offline augmentation is sub-optimal in both speed and accuracy compared to random sampling, and that online augmentation leads to impressive gains. Furthermore, we propose BFF, a new discrete attack based on best-first search, and show that it outperforms past work both in terms of robustness improvement and in terms of attack success rate.
Together, our contributions highlight the key factors for success in achieving robustness through adversarial attacks, and open the door to future work on better and more efficient methods for achieving robustness in natural language understanding.  Table 2, we ran three different experiments with different random initialization, and reported the mean results. The respective standard deviations are given in Table 5. To finetune the models using the FreeLB (Zhu et al., 2020) method, we adapted the implementation from https://github.com/ zhuchen03/FreeLB and used the following parameters: SST-2: init-magnitude = 0.6, adversarial-steps =   2, adversarial-learning-rate = 0.1 and l 2 norm with no limit on the norm.
BFF implementation For the factorization phase of BFF, we use τ ∼ Syn(w) with uniform sampling. We find that while using an out-ofvocabulary masking token is useful to compute a word salience, it is less suitable here as we are interested in the model's over-sensitivity to perturbations in the exact phrasing of the word. Also, in contrast to TXF which is optimistic and factorizes the attack space only once, BFF factorizes the space after every step. Namely, Optimistic greedy search plans the entire search path by evaluating all permissible single-word substitutions. Let x w i →w denote the utterance x where the word w i is replaced with a synonym w ∈ Syn(w i ). The optimistic greedy algorithm scores each word w i in the utterance with s(w i ) := min w∈syn(w i ) s A (x w i →w ), that is, the score of a word is the score for its best substitution, and also stores this substitution. Then, it sorts utterance positions based on s(w i ) in ascending order, which defines the entire search path: In each step, the algorithm moves to the next position based on the sorted list and uses the best substitution stored for that position. Fig. 5 shows the benefit from each of those modifications.
Budget Effect Intuitively, higher budgets better approximate an exhaustive search and thus the robustness evaluation as an upper bound should approach its true value. However, due to lack of backtracking in some of the attack strategies, they may plateau early on. In this work, we used B = 1000 for all training phases and B = 2000 for the robustness evaluation. Empirically, this gives a good estimate on the upper bound of model's robust accuracy, while constraining the computational power needed for the experiments. A natural question is how much tighter the bounds may be if a larger budget is given. Fig. 6 depicts an evaluation of strategies' success-rates over the same models as in Fig. 4 with a larger budget. As can be seen, while the RANDOM attack and TXF plateau, BFF variants as well as GENATTACK are able to exploit the larger budget to fool the model in more cases. This is especially true in IMDB where the search space is considerably larger. We expect this trend of tighter bounds to continue with ever larger budgets, though we note that the rate of improvements decreases with budget and that the ranking between strategies remains unchanged. Therefore, we conclude that drawing conclusions about strategies comparison and robustness improvements by evaluating with a budget of 2, 000 suffices.

A.2 Attack Space Implementation Details
As described in §5.2, we use the synonyms dictionary defined by Alzantot et al. (2018). In particular, we use the pre-computed set of those synonyms given by Jia et al. (2019) as our bases for Syn(w). We pre-process the entire development and training data and store for each utterance, the set Syn Φ (w) and avoid the need to employ large language models during training and robustness evaluation. For every word in an utterance w i ∈ x, and for everȳ w i ∈ Syn(w i ) we evaluate Φ(w i ,w i ) as follows: 1. With the same sequences as above, we validate that P OS(w i ) ≡ P OS(w i ) according to spaCy's (Honnibal et al.) POS tagger.
2. With a window of size 101, we validate that PPL(x)/PPL(x) ≥ 0.8 where PPL(·) is the perplexity of the sequence as given by a pre-trained GPT-2 model (Radford et al., 2019) 3. For BoolQ only, we also use spaCy's POS tagger to tag all content words (namely NOUN, PROPN, ADV, and ADJ) in the question. We then restrict all those words from being perturbed in the passage.

A.3 Genetic Attack Implementation Details
Our implementation of Gen-Attack presented by Alzantot et al. (2018) was based on https:// github.com/nesl/nlp_adversarial_ examples/blob/master/attacks.py and used our attack space rather than the original attack space presented there. For evaluation, we used the distribution hyperparameters as defined by the paper. Namely, population-size: p := 20, maximum generations: g := 100 and softmax-temperature = 0.3. Note we did not need to limit the number of candidate synonyms considered as this was already done in the attack space construction. However, we have made two modifications to the original algorithm in order to adapt to our settings.
Maximal modification constraints While the original algorithm presented by Alzantot et al. (2019) contained a clipping phase where mutated samples where clipped to match a maximal norm constraint, the adapted version for discrete attacks presented in Alzantot et al. (2018) did not. As we wish to limit the allowed number of perturbation for any single input utterance and the crossover phase followed by the perturb sub-routine can easily overstep this limit, a post-perturb phase was added. Namely, in every generation creation, after the crossover and mutation (i.e. perturb) subroutines create a candidate child, if the total number of perturbed samples exceeds the limit, we randomly uniformly revert the perturbation in words until the limit is reached. This step introduced another level of randomness into the process. We experimented with reverting based on the probability to be replaced as used in the perturb sub-routine, but this resulted in sub-par results.
Improved Efficiency In addition to estimating the fitness function of each child in a generation which requires a forward pass through the attacked model, Alzantot et al. (2018) also used a greedy step in the perturb sub-routine to estimate the fitness of each synonym mutation for a chosen position. This results in an extremely high number of forward passes through the model, specifically O(g·p·(k+1)) which is orders of magnitude larger than our allowed budget of 2000. However, many of the passes are redundant, so by utilizing caching to previous results, the attack strategy can better utilize its allocated budget, resulting in significantly better success rate in with better efficiency.

A.4 Attack Space in Prior Work
Examining the attack space proposed in Jin et al.
(2020), which includes a larger synonym dictionary and a different filtering function Φ(·), we observe that many adversarial examples are difficult to understand or are not label-preserving. Table 6 shows examples from an implementation of the attack space of the recent TEXTFOOLER (Jin et al., 2020). We observe that while in IMDB the labels remain mostly unchanged, many passages are difficult to understand. Moreover, we observe frequent label flips in datasets such as in SST-2 example, as well as perturbations in BoolQ that leave the question unanswerable.