Better Robustness by More Coverage: Adversarial and Mixup Data Augmentation for Robust Finetuning

Pretrained language models (PLMs) perform poorly under adversarial attacks. To improve the adversarial robustness, adversarial data augmentation (ADA) has been widely adopted to cover more search space of adversarial attacks by adding textual adversarial examples during training. However, the number of adversarial examples for text augmentation is still extremely insufficient due to the exponentially large attack search space. In this work, we propose a simple and effective method to cover a much larger proportion of the attack search space, called Adversarial and Mixup Data Augmentation (AMDA). Specifically, AMDA linearly interpolates the representations of pairs of training samples to form new virtual samples, which are more abundant and diverse than the discrete text adversarial examples in conventional ADA. Moreover, to fairly evaluate the robustness of different models, we adopt a challenging evaluation setup, which generates a new set of adversarial examples targeting each model. In text classification experiments of BERT and RoBERTa, AMDA achieves significant robustness gains under two strong adversarial attacks and alleviates the performance degradation of ADA on the clean data. Our code is available at: https://github.com/thunlp/MixADA .

To improve adversarial robustness, two types of defense strategies have been proposed. The first type targets at specific attacks, such as spelling correction modules and pretraining tasks to defend character-level attacks (Pruthi et al., 2019;Jones et al., 2020;Ma et al., 2020) and certified robustness for word-substitution attacks (Huang et al., 2019;Jia et al., 2019). However, they are limited in practice as they are not generally applicable to other types of attacks. The other type of defense is Adversarial Data Augmentation (ADA), which augments the training set by the adversarial examples and is widely used in the training (finetuning) process to enhance model robustness (Alzantot et al.;Ren et al., 2019;Jin et al., 2020;Li et al., 2020;Tan et al., 2020;Yin et al., 2020;Zheng et al., 2020;Zou et al., 2020;Wang et al., 2020b). ADA is generally applicable to any type of adversarial attacks but is not very effective in improving model performance under attacks. In this work, we aim to improve ADA and devise a general defense strategy to effectively improve model robustness during finetuning. 1 ADA has two major limitations for NLP models. Firstly, unlike images, it is harder to create new augmented textual data due to their discrete nature. Moreover, for textual adversarial attacks, the attack search space is prohibitively large. For example, the search space of word-substitution attacks consists of all combinations of the synonym replacement candidates, which is exponentially large. Consequently the number of adversarial training examples for augmentation is very insufficient. Secondly, ADA usually causes significant performance degradation on the clean data because the distribution of adversarial examples is very different from that of the clean data (Ren et al., 2019).
In order to solve these two limitations, we create additional training samples via interpolating existing samples (Figure 1). How to interpolate discrete textual inputs is non-trivial. We propose to convert the discrete textual inputs into continuous representations and then perform both ADA and mixup augmentation (Zhang et al., 2018;Guo et al., 2019), which is an augmentation technique proven to be particularly effective on continuous image data (Lamb et al., 2019;Pang et al., 2020). We name our method Adversarial and Mixup Data Augmentation (AMDA). With AMDA, we can create a much larger number of augmented training samples that cannot be obtained via discrete perturbations on textual data. Moreover, AMDA's interpolated virtual training samples are closer to the distribution of the original data, which alleviates the performance degradation problem of ADA.
We experiment AMDA on three text classification datasets under two strong adversarial attacks and find that AMDA achieves significant robustness gains in all cases, notably restoring RoBERTa after-attack accuracy from 6.35% to 51.84% on IMDB, outperforming all other baselines by large margins. Moreover, we also examine the evaluation method for adversarial robustness. Specifically, we find that the widely adopted Static Attack Evaluation where a fixed set of adversarial examples are used to test all models is not reliable. In order to test model robustness under targeted attacks (i.e., not model-agnostic), we adopt the more challenging Targeted Attack Evaluation where we generate a new set of targeted adversarial examples to evaluate each model. We encourage future defense works to also adopt this more reliable and challenging evaluation setting.

Method
In AMDA, we first augment training samples with ADA and then perform mixup during model training, where mixup augmentation is applied on the ADA-augmented training set.

Adversarial Data Augmentation
Given a victim model f v and the original training , we employ an attacker to construct label-preserving adversarial We then train the model on the augmented training data D ADA = D ori ∪ D adv .

Mixup Data Augmentation
To better defend against the large number of possible adversarial examples, we propose to perform additional mixup augmentation during training. Specifically, we linearly interpolate the representations and labels of pairs of training samples to create different virtual training samples, which can be formulated as: where (x i , y i ) and (x j , y j ) are two labeled examples, and λ ∈ [0, 1] comes from a beta distribution λ ∼ Beta(α, α), where α is a hyperparameter. On textual data, we cannot directly mix the discrete tokens. Instead, we can either interpolate the word embedding vectors or models' hidden representations of textual inputs. Meanwhile, we directly interpolate the labels, which are represented as one-hot vectors.
When applied together with adversarial data augmentation, we allow the mixing of different types

AMDA
In our proposed Adversarial and Mixup Data Augmentation (AMDA), we train the new model f on the augmented training data D AM DA , which is obtained by performing both adversarial data augmentation and mixup data augmentation. We minimize the sum of the standard training loss and the mixup loss: where (x i , y i ) is from D ADA and (x,ŷ) is the virtual example obtained by applying mixup on the random pair of training data sampled from D ADA . We use cross-entropy to compute loss on (x i , y i ) and use KL-divergence for loss on (x i ,ŷ i ).

Robustness Evaluation
There are two different ways of robustness evaluation under adversarial attacks used in previous works. In this work, we explicitly differentiate them as Static Attack Evaluation (SAE) and Targeted Attack Evaluation (TAE): SAE generates a fixed set of adversarial examples on the original model as the victim model. This fixed adversarial test set will then be used to evaluate all the new models. This evaluation setup has been adopted in (Ren et al., 2019;Tan et al., 2020;Yin et al., 2020;Wang et al., 2020b;Zou et al., 2020;Wang et al., 2021, inter alia.).
TAE re-generates a new set of adversarial examples to target every model being evaluated. This is adopted in Huang et al., 2019;Jia et al., 2019;Li et al., 2020;Zang et al., 2020;Zheng et al., 2020;Li and Qiu, 2021, inter alia.) We observe that some authors did not explicitly specify the mode of evaluation in their papers 2 , leading to confusion and even conflicting conclusions. Thus, we explicitly differentiate the two modes of evaluation and provide a comparison in our experiments.

Experiment Setups
Datasets. We evaluate our methods on three text classification datasets: two sentiment analysis datasets: SST-2 (Socher et al., 2013) and IMDB (Maas et al., 2011), where both datasets are binary classification tasks; as well as a multi-class news classification dataset AGNews (Zhang et al., 2015), which consists of four different classes. For SST-2, we attack the entire test set (1821 samples) and report the accuracy under attacks. For IMDB, we find that it is prohibitively slow to attack the whole test set (25k samples) and hence we use the subset of the original test set as released in Gardner et al. for faster evaluation, which consists of 488 test instances. Similarly, on AGNews, we randomly sampled 10% of the original test set and hold out as the test samples for attack evaluation. We also include these data splits in our released code base for easy reproduction and fair comparison for future works.
Victim models and attack methods. We experiment with both BERT-base-uncased (Devlin et al., 2019) and RoBERTa-base (Liu et al., 2019) as the victim models. We use PWWS (Ren et al., 2019) and TextFooler (Jin et al., 2020) as our attack methods, which have been shown to effectively attack state-of-the-art NLP models including PLMs such as BERT. Both attack algorithms have access to model predictions but not gradients, and iteratively search for word synonym substitutes that flip model predictions without drastically changing the original semantic meanings and golden labels.
Details of mixup. When performing mixup, we mix hidden representations of upper layers of BERT. The vectors used for mixup are hidden representations of the input examples at layer i of the Transformer encoder, where i is randomly sampled from {7, 9, 12}, which was found to be empirically effective . Furthermore, we explore two different ways of obtaining the hidden representations of input examples from PLMs like BERT: (1) We use the vector of the [CLS] token at the ith-layer of BERT as the hidden representation for mixing. We name this approach SMix.
(2) We perform mixup on every token's vector representation at the ith-layer. We name this approach TMix, which is the approach taken by .
Details of ADA and AMDA. For both ADA and AMDA, we generate and add the corresponding adversarial examples of PWWS and TextFooler into   training. For comparison, we also experiment with mixup alone without adding the adversarial examples. In this case, the model would only interpolate pairs of original training examples. We perform a greedy hyper-parameter search for the amount of augmented adversarial training samples and mixup parameter α as described in the Appendix. We also report average word modification rates, which indicate the percentage of words being replaced for attacking. Higher word modification rates indicate that the model is harder to attack and hence needs more words to be replaced.

Comparison of SAE and TAE
To compare SAE and TAE, we attack the finetuned model (BERT v ), RoBERTa v as the victim on SST-2 and IMDB, and then use the generated adversarial test set as the fixed test set for SAE. We then change the random seeds and re-finetune the models on the same data (BERT r1 , BERT r2 , RoBERTa r1 , RoBERTa r2 ) with all other hyperparameters being the same. We evaluate all these models using both SAE and TAE. The results are shown in Table 1. We find that by simply changing the random seeds, models achieve significant improvement under SAE. However, when we re-generate the adversarial test set for each model, their performances under TAE stay consistently poor. Moreover, we train BERT and RoBERTa with ADA and find that although BERT ADA and RoBERTa ADA perform well under SAE, they still perform poorly under TAE. This shows that conventional ADA is actually ineffective in improving model robustness under the challenging TAE setting. We conclude that the adversarial examples found by the attackers target specifically at the victim models, hence they cannot fully reveal weaknesses of new models even if they only differ in random seeds. We believe that TAE is the more challenging and meaningful evaluation method to measure model robustness under targeted attacks. We adopt TAE for the rest of the experiments in this paper and encourage future works to do so for fair comparison.

Mixup Improves Robustness
The comparison of AMDA and baseline methods under attacks for SST-2 and IMDB is shown in Table 2. The results on the AGNews dataset with RoBERTa model is shown in Table 3. We observe that: (1) Mixup alone (both TMix and SMix) can often improve model robustness. For example, TMix and SMix improve the robust accuracy significantly under both attacks when using RoBERTa on IMDB.
(2) AMDA (both AMDA-TMix and AMDA-SMix) can achieve further robustness improvement as compared to ADA and mixup in all cases. This proves that mixup and ADA can complement each other to better improve model robustness under adversarial attacks. (3) Compared to ADA, our AMDA method does not incur significant performance degradation on the original test sets while improving robustness. In some cases, for example, BERT+TMix and BERT+AMDA-TMix even improve the model performance on the original test sets. This benefit is likely because that mixup creates virtual examples that are closer to the empirical data distribution. (4) We find that models trained with AMDA also incur higher word modification rates under both attacks. For example, RoBERTa+AMDA-TMix incurs 59.68% word modification rate under PWWS attack, while the RoBERTa baseline only needs 37.48% words to be replaced. This further demonstrates that our proposed method improves robustness.

Conclusion
In this work, we propose AMDA as a generally applicable defense strategy by combining both adversarial and mixup data augmentation to cover more of the attack space. We show that AMDA greatly improves PLMs' robustness under the chal-lenging TAE evaluation setting under two strong adversarial attacks. We leave a more thorough theoretical analysis of AMDA's effectiveness on textual data as future work. 3 We believe that our work can establish the appropriate evaluation protocol and offer a competitive baseline for future works on improving the robustness of PLMs.

Hyper-parameter Analysis
In this section, we perform further analysis to examine the effects of different hyper-parameters. There are two hyper-parameters involved in MixADA: the amount of adversarial data added for training, and the α parameter in the beta distribution of mixup coefficient. We also experiment with an alternative ADA strategy -iterative ADA.

Amount of Adversarial Training Data
We vary the ratio of the training dataset that we generate adversarial training samples on and add to the MixADA fine-tuning. We experiment with SMixADA with the hyper-parameter of mixup being fixed. On SST-2, we vary the ratio in {25%, 50%, 75%, 100%}. On IMDB, since the average sequence length is significantly longer and the adversarial example generation process becomes much slower, we experiment with a set of smaller ratios: {5%, 10%, 15%, 20%}. The results are plotted in in Figure 2. Interestingly, we find that higher ratio of adversarial training samples does not necessarily bring in additional robustness gains.

Interpolation Coefficient in Mixup
We also analyse the hyper-parameter of mixup: the α parameter in the beta distribution, from which the interpolation coefficient is sampled. We fix the ratio of adversarial training data and vary α in the range of {0.2, 0.4, 2.0, 4.0, 8.0}. The results are plotted in Figure 3. We find that there is no consistent pattern across different datasets on what is the optimal α. Hence, for our main experiments in the paper, we perform a greedy hyper-parameter search: we first tune the ratio of adversarial training samples, then fix the ratio and tune the α parameter for mixup. A more exhaustive hyper-parameter search might bring additional performance gains but would also incur extra computation costs.

Iterative ADA
For our MixADA experiments in the paper, we generate all adversarial training samples at one shot and mix them with the original examples before fine-tuning. An alternative is to generate a new batch of adversarial training samples dynamically with the current model at each epoch. We compare this iterative approach with our MixADA and use the same ratio of adversarial training samples and  mixup parameter α. We evaluate RoBERTa on the SST-2 dataset. The results are in Table 4  We find that the iterative approach is far worse than our one-shot approach. We hypothesize that in the one-shot approach, we generate the adversarial examples on a fully-fine-tuned model while the iterative approach generates adversarial examples on the not-well-fine-tuned model in the first few epochs, and hence the adversarial examples generated in the iterative approach are not as challenging and useful as those in our one-shot approach.