Impact of Adversarial Training on Robustness and Generalizability of Language Models

Adversarial training is widely acknowledged as the most effective defense against adversarial attacks. However, it is also well established that achieving both robustness and generalization in adversarially trained models involves a trade-off. The goal of this work is to provide an in depth comparison of different approaches for adversarial training in language models. Specifically, we study the effect of pre-training data augmentation as well as training time input perturbations vs. embedding space perturbations on the robustness and generalization of transformer-based language models. Our findings suggest that better robustness can be achieved by pre-training data augmentation or by training with input space perturbation. However, training with embedding space perturbation significantly improves generalization. A linguistic correlation analysis of neurons of the learned models reveals that the improved generalization is due to 'more specialized' neurons. To the best of our knowledge, this is the first work to carry out a deep qualitative analysis of different methods of generating adversarial examples in adversarial training of language models.


Introduction
Language Models (LMs) have emerged as the backbone of many tasks in AI and have extended their reach beyond NLP applications into vision and even reinforcement learning (Brown et al., 2020;Reed et al., 2022;Ramesh et al., 2022).Thus it is imperative that the generalizability and robustness of LMs be carefully assessed and evaluated.
Generalizability is the ability of a model to perform well on unseen data.Transformer-based models that are pre-trained on large unlabeled text have shown remarkable generalization ability.However, when confronted with carefully designed adversarial samples, their robustness -the ability to gracefully deal with small perturbations, suffers significantly.For example, a recent study has shown that on a classification task on a YELP data set, accuracy dropped by almost 90%, when a standard test set was replaced by an adversarial counterpart (Jin et al., 2020;Yoo and Qi, 2021;Yuan et al., 2021).
Adversarial training is a pragmatic approach to attain both generalizability and robustness.The idea is straightforward.For a given model M , generate adversarial samples that target M and then use the samples to incrementally re-train the model.This can be done either at the pre-training or the fine-tuning stage (Liu et al., 2020).
Adversarial samples can be generated both in the input space and in the embedding space.The original work on the creation of adversarial samples for computer vision was in the input space.For example, the fast gradient sign method (FGSM) (Goodfellow et al., 2014) that perturbs a data point x along the direction of the sign gradient of the loss function with respect to the input is an example of a perturbation in the input space.In the context of natural language inputs, perturbing text is challenging due to its discrete nature.Unlike continuous data, there is no systematical way to guarantee an increase in the loss function when perturbing text.For instance, if we aim to make a small modification to the word "robust" we can choose to replace a single letter within the word or substitute it with a near synonym.However, both of these perturbations may seem ad-hoc and not sufficiently principled to intentionally increase the loss function.Therefore, in language settings, it is often more appropriate to perform perturbations in the embedding space, where continuous representations can be manipulated in a more structured manner.
Furthermore, despite the widespread use of adversarial training to increase the robustness of models, it is not clear what their impact is on downstream tasks beyond the model's overall accuracy.For example, a deeper analysis of language models has shown that different parts of the network are responsible for different parts of speech (Belinkov et al., 2017;Conneau et al., 2018;Liu et al., 2019;Dalvi et al., 2022;Durrani et al., 2020).In this regard, the change in the network due to adversarial training has not yet been investigated.
Overall our contributions in this paper are threefold.Firstly, we introduce two techniques in the context of adversarial training in the embedding space, representing the regularization-and gradient-based approaches commonly used by latent space techniques.We compare these techniques using a simple one-dimensional model and hypothesize their behavior in adversarial scenarios.Secondly, we evaluate the effectiveness of inputand embedding-space adversarial training methods in terms of their generalization ability and robustness against various types of adversarial attacks in sentiment analysis.Lastly, we conduct a thorough linguistic analysis of an adversarially trained model and demonstrate that incorporating robustness through adversarial training leads to more "focused" neurons that are associated with distinct Part of Speech (POS) tags.
The rest of the paper is organized as follows.In Section 2, we discuss adversarial attacks and defenses, with a specific focus on the NLP domain.Section 3 provides a detailed explanation of embedding space adversarial techniques.In Section 4, we conduct experiments to analyze the trade-off between robustness and generalization achieved by data augmentation, input-space training, and embedding space training approaches, considering various well-known adversarial attacks.Additionally, we present our findings from linguistic correlation analysis of neurons in robust models within the same section.Finally, we finalized the paper in the concluding section.

Related Work
Adversarial Attacks: The purpose of an adversarial attack is to cause a model to output conflicting decisions for an input and its 'imperceptibly' modified version.An adversarial sample is defined as: where x ′ is the adversarial sample, δ is the perturbation added to the original data x, ||δ|| is a generic norm, ϵ is the limit of the maximum norm of the perturbation, and f (x, θ) is the output of the model parameterized by θ for input x.The quality of an adversarial sample is typically evaluated depending on how well δ is minimized, i.e., the minimum distortion that changes the prediction of the model on a sample.
Obtaining an exact solution for the perturbation δ is a very challenging problem.Further, even when close approximations are considered, the solution gets computationally very expensive (Szegedy et al., 2013).To solve this problem more efficiently, gradient-based methods were introduced.Accordingly, the perturbation δ is computed by taking one (Goodfellow et al., 2014) or more steps iteratively (Madry et al., 2017;Dong et al., 2018) in the direction of the gradient to maximize the loss function.Then, this high loss point is projected back onto the input space to determine the norm-bounded perturbation.In practice, projected gradient descent (PGD) approaches that, take several small steps in the direction of the gradient, are used most frequently to create strong adversarial samples (Madry et al., 2017;Papernot et al., 2016).
Other than gradient based approaches, Jacobianbased Saliency Map Attack (JSMA) (Papernot et al., 2016) uses the Jacobian matrix created from forward derivation of input to identify to importance of each input component to the target attack.DeepFool (Moosavi-Dezfooli et al., 2016), alternatively, iteratively linearizes the classifier to identify the minimum perturbation that causes a change in the classification label.Carlini & Wagner Attack (C&W) proposed defensive distillation strategy (Hinton et al., 2015) based approach.
Adversarial Attacks in NLP: Running adversarial attacks against Natural language processing (NLP) models is more challenging than widely used vision models.The discrete nature of word representations, combined with the tokenization of words into word pieces, effectively invalidates any algorithm that applies differential changes on the model input when generating an adversarial sample.Moreover, quantification of the extent to which semantic similarity and contextual relations are preserved between a text input and its modified version is not trivial.
To circumvent these limitations, many adversarial sample generation algorithms adopted the approach of substituting one or more words in the input until a misprediction occurs.The crux of this attack lies in identification of alternative words or phrases that retain the semantic intactness of the original input.For this, several meth-ods based on word-embedding similarity (Jin et al., 2020), word synonymity (Ren et al., 2019;Zang et al., 2019), and masked language model predictions (Li et al., 2020) are proposed.However, finding appropriate word candidates may get computationally very intensive.For a sentence consisting of m words with n candidates to substitute each word, there are (n + 1) m possible combinations to test.To perform this search efficiently, greedy search (Ren et al., 2019), genetic algorithm (Alzantot et al., 2018), and particle swarm optimizationbased (PSO) (Zang et al., 2019) approaches are proposed and incorporated with word importance as determined by gradient measurements (Yoo and Qi, 2021) and word deletion (Ren et al., 2019).
An alternative approach to above substitutionbased approach is applying perturbations in the embedding space directly to word embeddings.This approach avoids the expensive search step to identify the best word substitution configuration, but it requires devising a mapping from perturbed embeddings to the text domain in order to create an adversarial sample.To realize this, recent work (Yuan et al., 2021) adapted a gradient-based adversarial sample generation method to compute perturbations associated with each word embedding.Perturbed embeddings are then translated to input domain using a pre-trained masked-language modeling (MLM) head, as in (Li et al., 2020;Garg and Ramakrishnan, 2020), to create an adversarial sample that is semantically similar to the original input.
Adversarial Defence in NLP: The most commonly deployed method for attaining robustness against an adversarial attack is through addition of adversarial samples into the training set (Szegedy et al., 2013).This approach is known to increase model robustness in both computer vision and NLP domains.Further, it is also reported that this defence approach decreases the generalization error of a model in the absence of any attack (Yuan et al., 2021), which contradicts the commonly held opinion that there is a trade-off between generalization and robustness E: (Tsipras et al., 2019).This finding can essentially be attributed to the use of a larger training set enhanced with adversarial samples.The second approach augments the training set with newly constructed, synthetic samples.While this may seem equivalent to adding adversarial samples to the training set, data augmenta-tion methods do not need to have an adversarial nature.Common data augmentation methods include word replacement, i.e., substituting words with their synonyms or inserting random words, random word deletions, and swapping of words between sentences (Wei and Zou, 2019).Rather than using manually-designed heuristics, the power of existing NLP models can also be harnessed for data augmentation.Reverse translation, which involves re-translation of samples from a target language back to their source language constitutes one such method that ideally preserves the semantic similarity of original and augmented samples (Edunov et al., 2018;Xie et al., 2020).The use of MLM via masking words in a sentence and replacing them with model predictions (Ng et al., 2020) is another augmentation method.
The third approach to adversarial training involves applying perturbations in the latent space (Zhu et al., 2019;Liu et al., 2020;Li and Qiu, 2021;Pan et al., 2022).This yields a simpler training procedure as it removes the need for generating adversarial samples in the input space.In (Zhu et al., 2019), a model is incrementally fine-tuned on sets of adversarially perturbed word embeddings computed after each fine-tuning step.Li et al. (2021) demonstrate that this method performs better when no constraint on the amount of perturbation is imposed.In Li and Qiu (2021), it is observed that rather than initializing the PGD step with random noise when computing perturbations for each token, using a token-dependent random noise that is fixed across all inputs is more effective.Recently, Pan et al. (2022) proposed the use of contrastive objective (Oord et al., 2018) for ensuring invariant representations by forcing the model to learn the differences between the normal input and its adversarial version.
In addition to empirical methods, certified defense methods are proposed to identify and eliminate adversarial samples.These techniques minimize misclassification within an l ∞ ball bound, particularly in the vision domain (Raghunathan et al., 2018;Wong and Kolter, 2018).In the NLP domain, two main categories of certified defense methods have emerged: Interval Bound Propagation (IBP) (Jia et al., 2019;Huang et al., 2019;Shi et al., 2020) and randomized smoothing (Ye et al., 2020;Zeng et al., 2021).IBP techniques estimate the output range by iteratively applying interval constraints from the input layer to subsequent lay-ers.However, the requirement to modify the model structure poses challenges in incorporating these methods into pre-trained models.
Randomized smoothing-based methods offer an alternative approach that is independent of the model structure.These methods utilize stochastic ensembles of input texts and leverage the statistical properties of these ensembles to offer provable robustness certification.A common approach to achieve this is by generating a few randomly modified versions of the original sample.This can be done through techniques such as random word substitutions using synonyms, as demonstrated in SAFER (Ye et al., 2020), or by employing a mask language model to substitute words, as shown in RanMASK (Zeng et al., 2021).The final prediction is then made based on the decisions made by these randomly generated samples.
Throughout the rest of the paper, we do not delve into a detailed discussion of these techniques for several reasons.Firstly, the main focus of this paper is on empirical methods and evaluating their impact.Secondly, randomized smoothing methods can be integrated into various techniques, making them applicable in different contexts.Lastly, previous findings suggest that while randomized smoothing methods demonstrate strong defense performance, they tend to underperform compared to latent space adversarial training (Li et al., 2021).

AT with Embedding Space Perturbations
Among all adversarial defenses developed for language processing models, moving the adversarial training from the input space to the embedding space offers the most advantage.This essentially allows the adoption of gradient-based adversarial training approaches that are computationally less demanding than input space methods.Although a plethora of such adversarial training methods exists, they are all essentially guided by two main principles in their approach.The first one essentially sets the training objective to minimize the loss due to worst-case perturbation induced on the training samples, instead of the average loss computed from training samples by the standard training.This group of methods essentially differ in the way they approximate the worst-case perturbation (Madry et al., 2017;Miyato et al., 2018;Zhang et al., 2019) as well as the extent and nature of perturbation applied during generation of adversarial samples (Ding et al., 2018;Wang et al., 2019;Liu et al., 2020).
The second approach primarily relies on the premise that smoothness is an important requirement of a robust model.To this objective, these methods focus on minimization of a regularized version of the loss instead of optimizing only the standard, training loss.The regularization term here ensures that there is a wide enough margin around each training data point with the decision boundary of the model through minimizing the difference between the predictions of natural and adversarial samples.Methods following this approach are distinguished based on their formulation of regularization (Szegedy et al., 2016;Zhang et al., 2019) and their coupling with the training loss described above (Gan et al., 2020;Pan et al., 2022).
In our analysis, we consider two representative methods that most effectively exemplify each approach.In practice, due to its computational efficiency, the PGD attack is most frequently used for the creation of adversarial samples.We will refer to this generic adversarial training approach as PGD-AT.The latter approach is also best characterized by the use of PGD in ensuring local distribution smoothness around natural samples.This alternative method will be referred to as LDS.We must note that improved variants of the two base methods should be expected to perform better.In this regard, robustness-generalization performance of the PGD-AT and LDS can be interpreted as lowerbounds.
The steps of both methods are presented in Algorithm 1 where the lines that differ between the two methods are highlighted as pink for PGD-AT and blue for LDS.Both methods start by randomly initializing δ with normal distribution with a mean of zero and standard deviation of σ.The loss is then calculated between the model's output of the perturbed input depending on the method, PGD-AT or LDS.The δ value is then updated by the gradient and clipped to within ±ϵ by the projection function Π.These steps are repeated for S times.The loss value is then updated by combining the standard loss with the loss associated with each method.Gradient update is then applied to model parameters.
To better examine the behavior of the two methods, we analyze a simple one-dimensional linear

Algorithm 1 PGD-AT and LDS based adversarial training
Input: E: the number of epochs, D = {(x (i) , y (i) )} n i=1 : the dataset, f (x, θ): the machine learning model parametrized by θ, δ: the perturbation initialized by σ and limited by ϵ, τ : the global learning rate, µ: the adversarial learning rate, S: the number of PGD step, and Π is the projection function.for e = 1, .., E do for (x, y) ∈ D do δ ∼ N (0, σ 2 ) for s = 1, .., S do g adv = ∇ δ l(f (x + δ, θ), y) %PGD-AT Table 1: Closed form solutions of the model parameter of a one-dimensional linear regression model under various loss functions regression model: Assuming a fixed perturbation δ, we determine how the two loss functions, given in Algorithm 1, estimate the model parameter θ under noisy observations.Table 1 presents the loss functions corresponding to PGD-AT and LDS as well as the one corresponding to the standard ordinary least squares (OLS) estimation in the absence of δ.The estimates for the parameter θ for the three loss functions are also given in the table (third column).
Comparing PGD-AT and LDS, it can be deduced that LDS will converge to OLS only as the noise ϵ gets severe, suppressing the effect of δ in the denominator.Whereas PGD-AT can be expected to follow OLS more closely at all noise levels as δ appears both at the numerator and the denominator, thereby absorbing its effect on the estimate.We also designed an experimental setup to test these hypotheses.A single neuron is trained based on randomly generated (x, y) pairs as defined above  assuming θ = 1 2 and for two different noise distributions, (σ = 0.01 and σ = 0.1) for each loss function.The models are trained for 2K epochs at a learning rate of 0.005 starting with the OLS loss.For PGD-AT and LDS models, the OLS loss is substituted by their loss function after epoch 1750 and δ values are computed as defined in Algorithm 1.
The distributions of the estimated scalar model parameter θ obtained after 25 runs is displayed in Fig. 1.Essentially, the spread of the distribution signifies the robustness of a model against adversarial samples and the distribution mean relates to the generalizability of the model.In this regard, PGD-AT is seen to perform better than LDS as it yields a tighter spread in both cases.However, at higher noise levels, it can be seen that LDS provides a more accurate estimate of θ.Overall, we can expect that a model trained with PGD-AT to be more robust while yielding a generalizability behavior closer to that of LDS.

Experiments
We first compare the robustness, generalization and run-time complexity of different AT strategies, following the pipeline in Fig. 2.Then, we perform a Linguistic Correlation Analysis (LCA, Dalvi et al.,  2019) as implemented in the NeuroX toolkit (Dalvi et al., 2023) to gain better insights into the dynamics of the learned models, as illustrated in Fig. 3.
Baselines: We compare standard BERT (Devlin et al., 2018) with seven versions of adversarially trained BERT models using methods from three families of AT approaches: (1) AT with pre-training data augmentation (AT-DA), (2) AT with input space perturbations (AT-IP) and ( 3) AT with embedding space perturbations (AT-EP), on the task of sentiment classification.Specifically, for AT-DA, we experiment with SSMBA (Ng et al., 2020) and BackTranslation (Xie et al., 2020).For AT-IP, we use A2T, A2T_MLM (Yoo and Qi, 2021) and BERT_attack (Li et al., 2020).For AT-EP, we report results on LDS (Szegedy et al., 2016;Zhang et al., 2019) and PGD-AT (Gan et al., 2020;Pan et al., 2022).Datasets: We fine-tune all models on the Internet Movie Database (IMDB, Maas et al., 2011) and Movie Reviews (MR, Pang and Lee, 2005) datasets and test on the corresponding testing splits, as well as on YELP dataset (Zhang et al., 2015) for out-of-distribution assessment of the models.
Attack methods: We assess the robustness of the models under four different attacks which replace words in the input space using different strategies.
(1) TextFooler (Jin et al., 2020) first searches for the word that results in the highest change in the senti-ment score, when removed, then replaces it with the nearest neighbouring word in the embedding space.
(2) BAE (Garg and Ramakrishnan, 2020) masks a portion of the text and using a BERT masked language model to generate alternatives for the masked words.
(3) A2T (Yoo and Qi, 2021) selects the word with the largest loss gradient w.r.t its embedding and replaces it with a synonym generated from a counterfitted word embedding (Mrkšić et al., 2016).( 4) PSO (Zang et al., 2019) uses sememe-based word substitution and particle swarm optimizationbased search algorithm to find good adversarial examples for a given input text.
Evaluation metrics: we assess (1) generalization via computing the accuracy values on indistribution and out-of-distribution datasets, (2) robustness using the Attack Success Rate (ASR) representing the ratio of the number of successful attacks to the number of samples, as well as (3) the time complexity measured via the fine-tuning run-time of the BERT model over 4 epochs.Implementation details: For AT-DA and AT-IP methods, we use the parameters proposed by the corresponding papers.For our PGD-AT and LDS approaches, we limit the number of PGD steps to 3 and the perturbations L2-norm to 0.003.All experiments are conducted on Nvidia v100 Tensor Core GPU.
Run-time results: We report the time for fine-tuning the models over 4 epochs in Tab. 3. The AT-DA approaches results in the shortest fine-tuning time as adversarial examples are generated once for every sample before the training, unlike in AT-IP and AT-EP where adversarial examples are generated at every training iteration.AT-EP methods, are around 1.5 times slower to fine-tune than the standard BERT model as generating the adversarial examples requires an additional backward pass for computing the gradient of the loss, at every training iteration.As expected, AT-IP methods are the most time consuming as they involve a combinatorial search over a large number of input space configurations.For example, the fastest approach in this class, A2T, needs 6 seconds for a single adversarial example generation, which is around 10 times slower than the other approaches.Robustness results are shown in Tab. 2. The lower the ASR the better is the model in withstanding the attack.As expected, the most effective methods against adversarial attacks are the AT-IP ones.This is due to the fact that the only class of approaches were it's possible to match the attack and the defense strategies, i.e., train on perturbations generated from the attack strategies, is AT-IP, as attacks in language models operate in the input space.Among AT-AD methods, BackTranslation is the most robust method on the IMDB dataset.We found that this is due to IMDB having in average long sentences which makes it easier to generate good and diverse adversarial examples to train on, via back translation.Our results show that AT-EP methods are the least robust.In particles, LDS-AT struggle in the sentiment classification task due to noisy ground-truth label, i.e., sentiments are mostly not binary but the ground truth labels are.To the best of our knowledge, we are the first to report this in language models trained with embedding space perturbation.In order to gain a better understanding of the reasons behind this phenomena, we investigate the learned dynamics of deepnets trained with AT-EP methods using Linguistic Correlation Analysis (next paragraph).Specifically, we want to validate that the achieved accuracy was due to better learning to solve of the task at hand and not just due to memorizing the training data.
Linguistic Correlation Analysis (LCA, Dalvi et al., 2019) is used to identify the most salient neurons for a given linguistic property like a Parts-of-Speech (POS) tag (Sajjad et al., 2022).To achieve this, we first match words to neurons, then assess if the matched words have the linguistic property of interest.As the sentiment prediction task is not appropriate for word level analysis, i.e., same words can be part of different sentiment classes, we focus on POS tagging task.We fine-tune BERT models using AT-EP methods on the publicly available Penn Treebank dataset (Marcinkiewicz, 1994).We use LCA to generate a list of the top-5 firing neu- POS tags obtained from the models trained with the BERT B , the LDS, and PGD-AT methods.For the second analysis, i.e., the neural ablation study, we create a linear regression model using only activations of the top 10 ranked neurons.Results are shown in Tab. 7. PGD-AT and LDA achieve a significantly higher performance than BERT, which further support the observation that AT helped better learn the intricacies of the tasks and explains the improvement of the generalization abilities of the AT-EP approaches (e.g., in Tab. 4).

Conclusions
In this paper we have carried out an extensive study of adversarial training methods (ATMs) to understand their impact on robustness and generalizability for transformer-based deep language models.We can draw the following conclusions from our study.First, non-adversarial data augmentation improves both generalization and robustness over the baseline BERT model.Adversarial training in the input space yields better robustness compared to both non-adversarial data augmentation and embedding space adversarial training.In contrast, adversarial training in the embedding space exhibits best generalization behavior.Among PGD-AT and LDS methods, our results show that the PGD-AT is consistently more robust and generalizable.Overall, our results show that unlike in computer vision domain where gradient-based adversarial training yields the best robustness and generalization tradeoff, for language processing models input-space training methods are indispensable.
For future work we will consider combining data augmentation, input-space training, and embedding space training approaches together.We would also like to extend our theoretical understanding of the trade-off between robustness and generalizability for language models.In connection, the impact of ATMs for other downstream applications needs to be studied.

Limitations
All our experiments are performed using the BERTsmall language model due to the computational requirements of generating and testing models considering many configurations of adversarial training and attack methods.Although using larger language models might have provided different performance measurements, our findings that compare input-and embedding-space adversarial training  methods are expected to remain unchanged.Another limitation of our work is the semantic gap between attacks in input and embedding space needs further research.Specifically, how do perturbations in the embedding space get translated in the input space?Finally, other forms of robustness techniques, besides adversarial training, in the context of large language models require examination.

Figure 1 :
Figure 1: The resulting distribution for θ values related to three different models, trained using OLS, LDS, and PGD-AT methods, when σ is set to (a) 0.01 and (b) 0.1.A small standard deviation indicates the model's robustness and clustering around 0.5 implies better generalizability.

Figure 2 :
Figure 2: Evaluation pipeline of models learned using different adversarial training approaches.

Figure 3 :
Figure 3: LCA pipeline of models learned using different adversarial training approaches.

Table 2 :
Robustness results.Models are evaluated using ASR (lower is better) on the MR and IMDB datasets.

Table 3 :
Run-time results.We report the fine-tuning runtime over 4 episodes on the MR and IMDB datasets.
Generalization results are reported in Tab. 4.

Table 4 :
Generalization results.We report the accuracy values on IMDB/MR (in-distribution) and YELP (outof-distribution) datasets for BERT models fine-tuned on IMDB/MR for the task of sentiment classification.

Table 5 :
LCA results.The association strength between POS tags and neurons.

Table 6 :
Examples of the most related words for different POS tags for models trained with the BERT B , the LDS, and PGD-AT methods.The words are bolded when their actual tags match with the associated tag, where the actual tags correspond to the most frequent tags of the words based on the POS-tagged training data.