Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models

Traditional (fickle) adversarial examples involve finding a small perturbation that does not change an input’s true label but confuses the classifier into outputting a different prediction. Conversely, obstinate adversarial examples occur when an adversary finds a small perturbation that preserves the classifier’s prediction but changes the true label of an input.Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to fickle adversarial examples. We show that standard adversarial training methods focused on reducing vulnerability to fickle adversarial examples may make a model more vulnerable to obstinate adversarial examples, with experiments for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce Balanced Adversarial Training, which incorporates contrastive learning to increase robustness against both fickle and obstinate adversarial examples.


Introduction
Interpreted broadly, an adversarial example is an input crafted intentionally to confuse a model.Most research on adversarial examples, however, focuses on a definition of an adversarial example as an input that is constructed by making minimal perturbations to a normal input that change the model's output, assuming that the small perturbations preserve the original true label (Goodfellow et al., 2015).Such adversarial examples occur when a model is overly influenced by small changes in the input.Attackers can also target the opposite objective-to find inputs with minimal changes that change the ground truth label but for which the model retains its prior prediction (Jacobsen et al., 2019b).
Various names have been used in the research literature for these two types of adversarial examples including perturbation or sensitivity-based and invariance-based examples (Jacobsen et al., 2019b,a), and over-sensitive and over-stable examples (Niu and Bansal, 2018;Kumar and Boulanger, 2020).To avoid confusions associated with these names, we refer them as fickle adversarial examples (the model changes its output too easily) and obstinate adversarial examples (the model doesn't change its output even though the input has changed in a way that it should).
In NLP, synonym-based word substitution is a common method for constructing fickle adversarial examples (Alzantot et al., 2018;Jin et al., 2020) since synonym substitutions are assumed to not change the true label for an input.These methods target a model's weakness of being invariant to certain types of changes which makes its predictions insufficiently responsive to small input changes.Attacks based on antonyms and negation have been proposed to create obstinate adversarial examples for dialogue models (Niu and Bansal, 2018).
Adversarial training is considered as the most effective defense strategy yet found against adversarial examples (Madry et al., 2018;Goodfellow et al., 2016).It aims to improve robustness by augmenting the original training set with generated adversarial examples in a way that results in decision boundaries that correctly classify inputs that otherwise would have been fickle adversarial examples.Adversarial training has been shown to improve robustness for NLP models (Yoo and Qi, 2021).Recent works have also studied certified robustness training which gives a stronger guarantee that the model is robust to all possible perturbations of a given input (Jia et al., 2019;Ye et al., 2020).
While prior work on NLP robustness focuses on fickle adversarial examples, we consider both fickle and obstinate adversarial examples.We then further examine the impact of methods designed to improve robustness to fickle adversarial examples on a model's vulnerability to obstinate adversarial examples.Recent work in the vision domain demonstrated that increasing adversarial robustness of im-Figure 1: Distance-oracle misalignment (Tramer et al., 2020).While the model is trained to be robust to ϵbounded perturbation, it becomes too invariant to small changes in the example (obstinate example x) that lie on the other side of the oracle decision boundary.
age classification models by training with fickle adversarial examples may increase vulnerability to obstinate adversarial examples (Tramer et al., 2020).Even in cases where the model certifiably guarantees that no adversarial examples can be found within an L p -bounded distance, the norm-bounded perturbation does not align with the ground truth decision boundary.This distance-oracle misalignment makes it possible to have obstinate adversarial examples located within the same perturbation distance, as depicted in Figure 1.In text, fickle examples are usually generated with a cosine similarity constraint to encourage the representations of the original and the perturbed sentence to be close in the embedding space.However, this similarity measurement may not preserve the actual semantics (Morris et al., 2020) and the model may learn poor representations during adversarial training.
Contributions.We study fickle and obstinate adversarial robustness in NLP models with a focus on synonym and antonym-based adversarial examples (Figure 2 shows a few examples).We evaluate both kinds of adversarial robustness on natural language inference and paraphrase identification tasks with BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) models.We find that there appears to be a tradeoff between robustness to synonym-based and antonym-based attacks.We show that while certified robust training increases robustness against synonym-based adversarial examples, it increases vulnerability to antonym-based attacks (Section 3).We propose a modification to robust training, Balanced Adversarial Training (BAT), which uses a contrastive learning objective to help mitigate the distance misalignment problem by learning from both fickle and obstinate examples (Section 4).We implement two versions of BAT with different contrastive learning objectives, and show the effectiveness in improving both fickleness and obstinacy robustness (Section 4.2).

Constructing Adversarial Examples
We consider a classification task where the goal of the model f is to learn to map the textual input x, a sequence of words, x 1 , x 2 , ..., x L , to its ground truth label y ∈ {1, ..., c}.We assume there is a labeling oracle O that corresponds to ground truth and outputs the true label of the given input.We focus on word-level perturbations where the attacker substitutes words in the original input x with words from a known perturbation set (which we show how we construct it in the following sections).The goal of the attacker is to find an adversarial example x for input x such that the output of the model is different from what human would interpret, i.e. f (x) ̸ = O(x).

Fickle Adversarial Examples
For a given input (x, y) correctly classified by model f and a set of allowed perturbed sentences S x , an fickle adversarial example is defined as an input xf such that: There are many different methods for finding fickle adversarial examples.The most common way is to use synonym word substitutions where the target words are replaced with similar words found in the word embedding (Alzantot et al., 2018;Jin et al., 2020) or use known synonyms from Word-Net (Ren et al., 2019).Recent work has also explored using masked language models to generate word replacements (Li et al., 2020;Garg and Ramakrishnan, 2020;Li et al., 2021).
We adopt the synonym word substitution method as in Ye et al. (2020).For each word x i in an input x, we create a synonym set S x i containing the synonym words of x i including itself.S x is then constructed by a set of sentences where each word in x can be replaced by a word in S x i .We consider the case where the attacker does not have a constraint on the number of words that can be perturbed for each input, meaning the attacker can perturb up to L words which is the length of x.The underlying assumption for fickle examples to work is that the perturbed sentence xf ∈ S x should have the same ground truth label as the original input x, i.e.O(x f ) = O(x) = f (x).However, common practice for constructing fickle examples does not guarantee this is true.Swapping a word with its synonym may change the semantic meaning of the example since even subtle changes in words can have a big impact on meaning, and a word can have different meanings in different context.For instance, "the whole race of human kind" and "the whole competition of human kind" describe different things.Nonetheless, previous human evaluation has shown that synonym-based adversarial examples still retain the same semantic meaning and label as the original texts most of the time (Jin et al., 2020;Li et al., 2020).

Obstinate Adversarial Examples
For a given input (x, y) correctly classified by model f and a set of allowable perturbed sentences A x , an obstinate adversarial example is defined as an input xo such that: While it is challenging to construct obstinate adversarial examples automatically for image classifiers (Tramer et al., 2020), we are able to automate the process for NLP models.We use a similar antonym word substitution strategy as proposed by Niu and Bansal (2018) to construct obstinate adversarial examples.Similar to synonym word substitutions, for each word x i in an input x, we construct an antonym set A x i that consists of the antonyms of x i .Since we would like to change the semantic meaning of the input in a way that is likely to flip its label for the task, the attacker is only allowed to perturb one word with its antonym for each sentence.
The way we construct obstinate adversarial examples may not always satisfy the assumption where the ground truth label of the obstinate example would be different from the original input.The substituted word may not affect the semantic meaning of the input depending on the task.For example, in natural language inference, changing "the weather is great, we should go out and have fun" to "the weather is bad, ..." does not effect the entailment relationship with "we should have some outdoor activities" since the main argument is in the second part of the sentence.However, we find that antonym substitutions are able to change the semantic meaning of the text most of the time and we choose two tasks that are most likely to change the label under antonym-based attack.

Robustness Tradeoffs
Normally, adversarial defense methods only target fickle adversarial examples, so there is a risk that such methods increase vulnerability to obstinate adversarial examples.According to the distance-oracle misalignment assumption (Tramer et al., 2020) as depicted in Figure 1, the distance measure for finding adversarial examples and labeling oracle O is misaligned if we have

Setup
Our experiments are designed to test our hypothesis that optimizing adversarial robustness of NLP models using only fickle examples deteriorates the model's robustness on obstinate adversarial examples.We use the SAFER certified robust training method proposed by Ye et al. (2020).The idea is to train a smoother model by randomly perturbing the sentences with words in the synonym substitution set at each training iteration.While common IBP-based certified robust training methods do not scale well onto large pre-trained language models (Jia et al., 2019;Huang et al., 2019), SAFER is a structure-free approach that can be applied to any kind of model architectures.In addition, it gives stronger robustness than traditional adversarial training method (Yoo and Qi, 2021).
We train BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) models on two different tasks with SAFER training for 15 epochs.We then test the attack success rate for both fickleness and obstinacy attacks at each training epoch.We use the same perturbation method as described in Section 2.1 for both the training and the attack.For each word, the synonym perturbation set is constructed by selecting the top k nearest neighbors with a cosine similarity constraint of 0.8 in GLOVE embeddings (Pennington et al., 2014), and the antonym perturbation set consists of antonym words found in WordNet (Miller, 1995).We follow the method of Jin et al. (2020) for finding fickle adversarial examples by using word importance ranking and Part-of-Speech (PoS) and sentence semantic similarity constraints as the search criteria.We replace words from the ones with the highest word importance scores to the ones with the least and make sure the new substituted words have the same PoS tags as the original words.For antonym attack, we also use word importance ranking and PoS to search for word substitutions.For comparison, we set up baseline models with normal training on the original training sets.

Tasks
We choose two different tasks from the GLUE benchmark (Wang et al., 2018) that are good candidates for the antonym attack.Antonym-based attacks work well on these tasks since both tasks consist of sentence pairs and changing a word to an opposite meaning is likely to break the relationship between the pairs.Natural Language Inference.We experiment with Multi-Genre Natural Language Inference (MNLI) dataset (Williams et al., 2018) which contains a premise-hypothesis pair for each example.The task is to identify the relation between the sentences in a premise-hypothesis pair and determine whether the hypothesis is true (entailment), false (contradiction) or undetermined (neutral) given the premise.We consider the case where both premise and hypothesis can be perturbed, but only one word from either premise or hypothesis can be substituted for antonym attack.We exclude examples with a neutral label when constructing obstinate adversarial examples since antonym word substitutions may not change their label to a different class.
Paraphrase Identification.We use Quora Question Pairs (QQP) (Iyer et al., 2017) which consists of questions extracted from Quora.The goal of the task is to identify duplicate questions.Each question pair is labeled as duplicate or non-duplicate.For our antonym attack strategy, we only target the duplicate class since antonym word substitutions are unlikely to flip an initially non-duplicate pair into a duplicate.
We also conducted experiments using the Wiki Talk Comments (Wulczyn et al., 2017) dataset, a dataset for toxicity detection, by adding or removing toxic words for creating obstinate examples.However, we found adding toxic words can reach almost 100% attack success rate, so there did not seem to be an interesting tradeoff to explore for available models for this task, and we do not include it in our results.

Results
We visualize the attack success rates for fickleness (synonym attack) and obstinacy (antonym attack) attacks in Figure 3.The results are consistent with our hypothesis that optimizing adversarial robustness of NLP models using only fickle examples can result in models that are more vulnerable to obstinacy attacks.Robustness training for the BERT model on MNLI improves fickleness robustness, reducing the synonym attack success rate from 36% to 11% (a 69% decrease) after training for 15 epochs (Figure 3a), but antonym attack success rate increases from 56% to 63% (a 13% increase).The antonym attack success rate increases even more for the RoBERTa model (Figure 3b), increasing from 56% to 67% (a 20% increase) while the synonym attack success rate decreases from 31.2% to 10% (a 68% decrease).The RoBERTa model is pre-trained to be more robust than the BERT model with dynamic masking, which perhaps explains the difference.We observe a robustness tradeoff for QQP dataset as well (see Appendix A.1).In addition, the fickle adversarial training does not sacrifice the performance on the original examples When the model is trained with smaller batch size, the synonym attack success rate is lower and the antonym success rate is higher.and increases consistently throughout training (see Figure 9 in the appendix).
Impact of Batch Size.We experiment with different batch sizes for fickle-based robust training.
Figure 4 shows the results on MNLI dataset.When the model is trained with a smaller batch size, the synonym attack success rate becomes lower, but the antonym success rate gets higher.This means that the model may overfit on the fickle examples due to smaller training batch size, exacerbating the impact of the unbalanced adversarial training.Similar observation is found on the QQP task (see Figure 8 in the appendix).We found similar evidence on the evaluation accuracy on the original validation set (see Figure 9 in the appendix).While models with smaller batch sizes converge faster, they lead to lower performance and poorer generalization.In Appendix C.4, we show that our proposed method is not affected by the training batch size.

Balanced Adversarial Training
In previous section, we argued that the tradeoff between fickleness and obstinacy can be attributed to distance-oracle misalignment.This section proposes and evaluates a modification to adversarial training that balances both kinds of adversarial examples.

Approach
The most intuitive way to make the semantic distance in the representation space align better with human perception is to move the fickle example closer to the original input and push the original input apart from the obstinate example in the representation space.The idea is to minimize the distance between the positive pairs and maximize the distance between the negative pairs.We construct positive pairs by pairing the original input with a corresponding fickle example, and negative pairs as the original input paired with an obstinate example.We generate fickle examples by applying synonym transformations, and obstinate examples by applying antonym transformations.
We combine normal training with a contrastive learning objective and experiment with two different approaches for contrastive loss: pairwise and triplet loss.While recent contrastive learning incorporates multiple positive and negative examples for each input, we use these two methods as they consider the simplest case where only a positive and a negative example is needed for each input.Similarly to SAFER certified robust training, we use an augmented approach without querying the model to check if the attack succeeds.We choose this approach over traditional adversarial training since it is computationally less expensive.
Given an input (x, y), we generate an example xo by applying synonym perturbations and an example xu by applying antonym perturbations.Let d(x 1 , x 2 ) denote the distance measure between x 1 and x 2 in the representation space.
BAT-Pairwise.For the pairwise approach, we inde-pendently optimize the distance for the fickle pair (x, xf ) and the obstinate pair (x, xo ): The hyperparameters α and β control the weighting of the fickle and obstinate pairs, and m is the margin.The L pair loss term is designed to minimize the distance to the fickle adversarial example and maximize the distance to the obstinate adversarial example.The margin m penalizes the model when the obstinate example within m distance of the original input (d(x, xo ) < m).We use cosine similarity for distance measure (ranges from 0 to 1), and set the margin as 1 as we find it gives the best performance (see Appendix C.2.2).For the case where we are unable to find a valid fickle and obstinate adversarial example, we set the corresponding term, either d(x, xf ) or m − d(x, xo ), to 0. BAT-Triplet.For the triplet approach, the original input x acts as an anchor and a triplet, (x, xf , xo ), is considered instead of pairs.The triplet loss aims to make the distance between the obstinate pair larger than the distance between the fickle pair, with at least a margin m: d(x, xo ) > d(x, xf ) + m.
The training loss can be formalized as: where the hyperparameter λ controls the weight of the contrastive loss term.We show the training details and how we find the best hyperparameters in Appendix C.2.

Results
Table 1 shows BAT training results on the MNLI validation sets.We use normal training as the nonrobust baseline, and include two robust baselines: certified robust training (SAFER), and traditional adversarial training (A2T) (Yoo and Qi, 2021).Balanced Adversarial Training increases the model's adversarial robustness against both antonym and synonym attacks, while preserving its performance on the original validation set.
While both of the robust baselines (SAFER and A2T), which only consider fickle adversarial examples, perform best when evaluated solely based on fickleness robustness, they are more vulnerable to obstinate adversarial examples.We found that BAT-Triplet performs better than BAT-Pairwise in terms of improving robustness against antonym attacks.With BAT-Triplet, the antonym attack success rate on BERT decreases from 57% to 34% (a 40% decrease) comparing to normal training, and the synonym atack success rate decreases from 36% to 26% (a 28% decrease).
Results for the QQP dataset are shown in Table 2.While the antonym attack success rates drop more than half (around 67% decrease) after BAT training, the synonym attack success rate has a 24% decrease on BERT and only 10% on RoBERTa, as the synonym attack success rate is already low on the model with normal training.

Representation Analysis
We compare the learned representations of models trained with BAT to normal training and SAFER.We sample 500 examples from MNLI dataset (ex-cluding the neutral class) and apply synonym and antonym perturbations for each input.We then project the model representations before the last classification layer to 2 dimensional space with t-SNE (van der Maaten and Hinton, 2008) and visualize the results in Figure 5.
When training with normal training or SAFER, we can see that both fickle and obstinate adversarial examples are fairly close to the original examples.However, with BAT-Pairwise or BAT-Triplet, obstinate examples are pushed further away from both original and fickle examples.This matches with BAT's training goal where the distance between obstinate and original examples is maximized and the distance between fickle and original examples is minimized.This also shows how BAT is able to fix the distance-oracle misalignment, making the semantic distance in the representation space aligns better with human perception, and further improve robustness against both types of adversarial examples.

Conclusion
We demonstrate the tradeoff between vulnerability to synonym-based (fickle) and antonym-based (obstinate) adversarial examples for NLP models and show that increasing robustness against synonym based attacks also increases vulnerability to antonym-based attacks.To manage this tension, we introduce a new adversarial training method, BAT, which targets the distance-oracle misalignment problem and can help balance the fickleness and obstinacy in adversarial training.

Limitations
We showed robustness tradeoffs exist between synonym and antonym-based adversarial examples.Since there are numerous ways to construct adversarial examples for NLP models, further investigation is needed to show if this holds true for any kind of fickleness and obstinacy attacks for NLP models and we will leave it for future work.In order to launch antonym attack automatically, we are also limited to sentence pair tasks that are more prone to changing the ground truth label when replacing a word with its antonym.In addition, BAT sacrifices the performance on the synonym-based attack success rate for robustness to antonym-based attack when comparing to fickle adversarial training methods.We show that there is a tradeoff between robustness against synonym and antonym based attacks and our goal is to achieve a better tradeoff between them.

A.2 Synonym and Negation Attack Robustness Tradeoffs
We test how negation attack success rate would change as the model robustness against synonym attack increases.For negation attack, we add negation to a verb in the sentence, i.e., "I can do it" to "I can't do it", or remove negation from a sentence, i.e., "I am not going" to "I am going".We follow similar setup as in Section 3. We found that there exists a tradeoff between synonym-based adversarial examples and negation-based adversarial examples on QQP task, but found no significant tradeoff on MNLI task, as shown in Figure 7.We implement BAT similarly to the SAFER training method as described in Section 3.1 where we randomly perturb the inputs with words from the synonym/antonym substitution sets.We train the BERT and RoBERTa models for 2 or 3 epochs with a learning rate of 2 × 10 −5 or 3 × 10 −5 and batch size of 32.For contrastive loss weights and margin in BAT-Pairwise and BAT-Triplet, we perform hyperparmeter search and choose the ones with best performance (see Appendix C.2).

C.2.1 Contrastive Loss Weights
In Figure 10, we show varying fickle loss weights (α) with obstinate loss weight fixed (β = 1.0) and vice versa when training with BAT-Pairwise method.As the value of α increases, antonym attack success rate increases.On the other hand, as the value of β increases, synonym attack success rate has a small increase as well.We found α = 1.0 and β = 1.2 gives the best performance for BERT MNLI model.We test different contrastive loss weights (λ) when training with BAT-Triplet and show the results in Figure 11.We found that as we increase λ, model accuracy on the validation set decreases.

Figure 2 :
Figure 2: Fickle and obstinate adversarial examples for BERT model fine-tuned on natural language inference (left) and paraphrase identification (right) tasks.Words in red are substituted with their synonyms and words in blue are replaced by their antonyms.

Figure 3 :Figure 4 :
Figure 3: Fickleness and obstinacy tradeoff where fickleness attack success rate increases as obstinacy attack success rate decreases.The figure shows the results on MNLI matched validation set with average and standard deviation across three different runs.Dash lines show the synonym/antonym attack success rate on baseline model with normal training.

Figure 5 :
Figure 5: 2D projection of model representation for RoBERTa MNLI models trained with normal training, certified robust training with fickle adversarial examples (SAFER), BAT-Pairwise, and BAT-Triplet.

Figure 6 :
Figure 6: Robustness tradeoffs between synonym and antonym based attacks on QQP and MRPC dataset.The figure shows the average and standard deviation across 3 different runs.

Figure 7 :
Figure 7: Negation attack success rate on models at each epoch when training with SAFER.

Figure 9 :
Figure 9: The evaluation accuracy on original validation set at each SAFER training epoch with varying batch size.

Table 1 :
Balanced Adversarial Training evaluation results on MNLI matched validation set.Results shown with standard deviations are average across three different runs.
(Hadsell et al., 2006;objective of contrastive learning, a type of self-supervised learning that learns representations with positive (similar) examples close together and negative (dissimlar) examples far apart(Hadsell et al., 2006;

Table 2 :
Similarly to the pairwise loss, if no fickle and obstinate example is available, we mask out d(x, xf ) or m − d(x, xo ) in L triplet .Balanced Adversarial Training evaluation results on QQP validation set.
B Fickleness Robust Training with Varying Batch Size B.1 Synonym and Antonym Robustness Tradeoffs on QQP Task The synonym and antonym attack success rate at each SAFER training epoch with varying batch size.