[MASK] Insertion: a robust method for anti-adversarial attacks

Adversarial attack aims to perturb input sequences and mislead a trained model for false predictions. To enhance the model robustness, defensing methods are accordingly employed by either data augmentation (involving adversarial samples) or model enhancement (modifying the training loss and/or model architecture). In contrast to previous work, this paper revisits the masked language modeling (MLM) and presents a simple yet efficient algorithm against adversarial attacks, termed [MASK] insertion for defensing (MI4D). Specifically, MI4D simply inserts [MASK] tokens to input sequences during training and inference, maximizing the intersection of the new convex hull (MI4D creates) with the original one (the clean input forms). As neither additional adversarial samples nor the model modification is required, MI4D is as computationally efficient as traditional fine-tuning. Comprehensive experiments have been conducted using three benchmark datasets and four attacking methods. MI4D yields a significant improvement (on average) of the accuracy between 3.2 and 11.1 absolute points when compared with six state-of-the-art defensing baselines.


Introduction
Pretrained Language Models (PLMs) have rapidly advanced the performance of the Natural Language Processing (NLP) tasks, such as text/document classification. Yet, abundant evidences also indicate that PLMs are vulnerable to adversarial attacks, and the model performance can be dramatically impacted by (even) small perturbations to the model input (Gao et al., 2018;Li et al., 2019;Li et al., 2020;. As a result, adversarial defenses have received significant attention, with the ultimate goal of achieving the robust model accuracy on both the clean (original) and polluted (adversarial) inputs. * Corresponding author.
A large amount of research effort has been dedicated to adversarial defenses, ranging from the data augmentation, the model enhancement, to the randomized smoothing. Among data augmentation studies, recent works introduce small but controllable perturbations to pollute clean data and produce adversarial samples (Yoo and Dong et al., 2021;Zhou et al., 2021;Meng et al., 2022), while the model is later trained on both the clean and polluted inputs. However, due to the additional adversarial samples, dataaugmentation methods suffer from the requirement of enormous computational resources for training. Additionally, model-enhancement approaches focus on polishing the vanilla model via manipulating the training loss or network architecture, without acquiring additional adversarial data Le et al., 2022;Liu et al., 2022). Yet, those methods often require extensive search among numerous candidates to determine optimal hyperparameters. Another line of work is to apply ensemble-based randomized smoothing techniques (Ye et al., 2020;. Unfortunately, they induce substantial overhead due to the ensemble classification; more importantly, their performance are unstable to different types of attacks Xu et al., 2022).
Our aim is then to explore a robust adversarial defensing algorithm, which neither relies on additional adversarial data (as data augmentation), nor adjusts the training loss and network architecture (as model enhancement), nor requests ensemblebased training (as randomized smoothing). By contrast, this paper revisits the masked language modeling (MLM) and further proposes a compact and performance-preserving algorithm, termed [MASK] insertion for defensing (MI4D). Specifically, MI4D only requires to insert [MASK] tokens at the beginning of input sequences to produce masked inputs. During training, (only) masked inputs are employed for the model fine-tuning, while later polluted samples are masked in the same manner for inference. In contrast to the traditional MLM, the prediction task of [MASK] tokens is less emphasized in the proposed MI4D. Yet, the injected [MASK] plays a role of maximizing the intersection between the convex hull after attacking and that of the original (clean) input, thereby enhancing the defensing performance.
The main contributions of the proposed work are summarized as follows: • A novel [MASK] insertion for defensing (MI4D) algorithm is proposed, neither relying on additional adversarial data nor modifying the training model nor requesting ensemblebased classification; • MI4D is characterized by simply injecting [MASK] tokens at the beginning of input sequences during training and inference. Accordingly, the span of the convex hull (after injecting) is critical to retain more solution space as the clean one to enhance successful defense; • Empirically, our proposed method outperforms six recent baselines on a combination of three standard benchmarks and four attacking methods, advancing the best state-of-the-arts by on average 3.2-11.1 absolute points in accuracy.

Related work
Constructing misleading samples to fool the trained neural-network models, adversarial attacks in the text domain can be mainly classified into two categories of the character-and word-level perturbation. The work of (Gao et al., 2018;Li et al., 2019) belongs to the character-level attack, from which the input is polluted by removing, substituting or inserting letters. On the other hand, word-based attacks usually involve the step of determining word importance, and replacing with their synonyms to maximize the prediction error of the model (Li et al., 2020;. Adversarial defense, by contrast, aims to form a robust model with high accuracy on both clean (original) and polluted (adversarial) samples. One of the most effective approaches is through the data augmentation (as shown in Fig. 1(a)), where adversarial samples are produced and fed into the model training. Specifically, A2T (Yoo and Qi, 2021) generates adversarial samples via employing a gradient-based method to identify important words, and iteratively substitutes with their synonyms using a DistilBERT similarity. FreeLB (and its variants) (Zhu et al., 2020; imposes norm-bounded noises on embeddings of input sentences to produce adversarial samples. ADFAR (Bao et al., 2021) applies a frequencyaware randomization on both original and adversarial samples (by other attacking methods) to form a randomized adversarial set. This augmentation set is then combined with original and adversarial samples to train the model. More recently, in (Meng et al., 2022), ADCL generates adversarial examples using a geometry attack, which are later utilized as hard positive samples to train the model following a self-supervised contrastive learning. Xu et al. propose WETAR-D (Xu et al., 2022) as a sample reweighting method, in which the sample weight is adjusted by minimizing the loss from the validation set mixed of both original and adversarial examples.
Besides the data augmentation, another line of studies is proposed for the model enhancement to refine the model architecture and/or training loss, without acquiring additional adversarial samples (as shown in Fig. 1(b)). Among them, SHIELD (Le et al., 2022) modifies the last layer of an trained model and transforms it into an ensemble of multiple-expert predictors with stochastic weights. Flooding-X (Liu et al., 2022) introduces a regularization technique to prevent the overfitting of training samples.  propose InfoBERT to employ two mutual-information-based regularizers for suppressing noisy information between the input and the latent representation, and for increasing the correlation between local and global features. A similar work is found in , where an information bottleneck layer (IB) is inserted between the encoder and the final classifier. This IB layer is utilized to extract robust and task-related features.
Additionally, a set of ensemble-based randomized smoothing methods have been proposed, shown in Fig. 1(c). SAFER (Ye et al., 2020), for instance, constructs stochastic input ensembles and leverages statistical properties of ensembles to classify testing samples. In RanMASK , few input tokens are randomly substituted using [MASK] for fine-tuning, while testing samples are also masked (at different locations) to form several masked versions. The final prediction is then made by a majority vote from the ensemble of these masked versions.
The proposed method is different from existing approaches: (1) compared to adversarial based augmentation, no additional samples are required, and significant computational overhead is avoid accordingly; (2) compared to the model-enhancement ones, the proposed method is hyperparameter insensitive, which maintains the vanilla model training (loss and architecture) but only changes input formats; (3) compared to randomized smoothing, the ensemble based inference is no longer required.

Proposed method
Adversarial attack in text domain perturbs input sequences to maximally mislead the classification model, while this section presents a simple yet effective algorithm to reduce the model vulnerability, termed [MASK] insertion for defensing (MI4D).

[MASK] insertion for defensing
The proposed method MI4D is characterized by the normal fine-tuning process (the same network architecture and training loss as the vanilla model), while the only difference lies in the inserted [MASK] tokens at the beginning of input sequences during training and inference. Specifically, given the tokenized input sequence x (i.e., x= [CLS]x 1 · · · x |x| [SEP]), where x i represents the i-th token from x. For the text classification task, we aim to optimize an encoder Enc(·) and a Multilayer Perceptron (MLP) layer f (·) to map x to a desirable label y, i.e., f (Enc(x)) → y.
Let b M be the pre-defined masking budget (or the fraction of masked tokens). Then, MI4D injects M consecutive masks after [CLS] within x to form a masked sequence, that is, Next, only x ′ (instead of x) is utilized for training, while Enc(·) is leveraged to extract the latent representation for x ′ and the normal loss (such as the cross-entropy based) function L(x ′ , y) is adopted. During inference, with an unseen sequencex, the insertion procedure is repeated to inject M consecutive masks tox, i.e., The label ofx is finally produced by f (Enc(x ′ )).

Analysis
Notably, RanMASK  substitutes input tokens with [MASK], while MI4D injects [MASK]. Despite its simplicity, conceptually and computationally, MI4D has strong theoretical results as the following claim: RanMASK is the lower bound of MI4D in terms of adversarial defensing performance.
To prove the claim, given the tokenized input x, the output of a self-attention module Y is derived by d is the hidden dimension, and W k (∀k ∈ [1, 3]) are projection matrices with compatible dimensions. The property of softmax dictates that each row of Y (written as y i ) is a convex construction of XW 3 (written asX), i.e., y i ∈ C(X), where C(X) stands for the convex hull ofX (see Fig. 2 for the area enclosed by thick dashed lines). The same process happens in multi-head attention modules. They operate in different projected spaces but the observation of the convex construction still holds.
We hypothesize that the successful defense rate (against attacks) is determined by the intersection of the new convex hull (a defensing method creates) with the original convex hull (the clean data forms), and the larger intersection results in the better defensing performance. This leads to the following assumption. Assumption 1. Given the latent representation of the clean input and its adversarial version in X and X ′ , for a very small ϵ ≈ 0, , [MASK] [CLS] [MASK] where Vol(·) is a function to estimate the volume of a geometric object, and X ϵ is the ϵ ball centred at X.
The ϵ ball spans the convex hulls to the dimension of the ambient space so that volume always exists. More importantly, it also reflects the model tolerance to the variation of vector representations, indicating small disturbance will not affect the model output. To ease notation, we omit the ϵ in later development.
MI4D differs from RanMASK at no random elimination of input tokens, but a simple insertion of [MASK] while keeping clean and polluted tokens. This choice leads to the fact that the convex hull formed by MI4D always contains those by RanMASK as guaranteed by the following lemma.
Lemma 1. Given a set X, we have C(S) ⊆ C(X) for any subset S ⊆ X. The equality holds when S = X trivially or otherwise S contains all the anchor points of C(X), i.e., the convex hull vertices.
Proof. Let X be the index set for X and a subset S ⊆ X gives the indices for S. For any point p ∈ C(S), p = i∈S λ i x i such that λ i ≥ 0 and λ i = 1, i.e., the convex condition. Apparently p ∈ C(X) as well by setting λ j = 0 for j ∈ X \S.
For any point x i ∈ X, it is either an anchor point or an internal point referring to C(X). If S contains only anchor points, C(S) = C(X) as the internal points can be "absorbed". To see this, assume x 1 is an internal point, then x 1 = i>1 β i x i and all β i s for i > 1 satisfying convex condition. Then Therefore, C(X) = C(X −1 ) where X −1 is the set of vectors after removing x 1 . After eliminating internal points, the convex hull will still be the same.
The immediate result from above lemma is the following corollary stating the relations between convex hulls generated by MI4D and RanMASK.
Corollary 1. Convex hull generated by MI4D always contains those by RanMASK.
Proof. Let X be the latent representation of input tokens for MI4D (including [MASK] tokens), and X the corresponding indices set. RanMASK runs several, say n, times of random eliminations but keeping [MASK] tokens, i.e., leading to index subsets S i (S i ⊂ X (i = 1 ∼ n)). Clearly, from Lemma 1, we have ∀i, C(S i ) ⊆ C(X), where S i is the corresponding representations in X indexed by S i . Equality holds only when S i contains all anchor points set in C(X).
Next, we are ready to formalize and prove the claim as the following proposition.
Proposition 1. Given Assumption 1, MI4D has at least the same successful defensing rate as that of RanMASK. In other words, MI4D has at least equally good adversarial defense performance as RanMASK.
Proof. Let X ′ (X) be the adversarial (original) representations in latent space, and S i be the i-th subset of X ′ . The successful defensing probability of MI4D and RanMASK at the i-th run is defined as p m and p r i , respectively. We have For RanMASK to succeed, the successful S i s have to be chosen and become majority and hence the final success probability of RanMASK p r = P(∃i, S i success ∧ successful sets are majority). Clearly, where B(k; n, p) is the probability of k out of n trials successes with probability p, i.e., n k p k (1 − p) n−k . The last inequality comes from Corollary 1 as p m ≥ p r i (∀i) and hence p m ≥ max i (p r i ).
Overall, an illustration is shown in Fig. 2 with the convex hull of MI4D (C m ) and those of RanMASK (C r ) with two different random eliminations. The gray area shows the intersection of MI4D convex hull with the original convex hull C, i.e., C ∩ C m , while stripe areas are the intersections of those of RanMASK, i.e., C ∩ C r . We know that C ∩ C m always contains C ∩ C r . As such, the span of the convex hull after [MASK] insertion is critical to retain more solution space to enhance successful defense. Additionally, we also infer that the position of inserted [MASK] tokens and the number of insertions are less important (multiple of them differ only at the positional encoding), as they may well be in the ϵ ball of the same [MASK] token itself. These inferences are verified in the ablation study. Attacking algorithms. Four adversarial attacking methods are implemented using TextAttack (Morris et al., 2020) to pollute input sequences, that is, • DeepWordBug (Gao et al., 2018) deletes, replaces, and inserts characters to inputs; • TextBugger (Li et al., 2019) performs perturbations of space insertion, char deletion/swapping, and synonym substitution; • BERT-Attack (Li et al., 2020) substitutes key words using a pre-trained masked model; • TextFooler  replaces important words with their synonyms.
Evaluation Metrics. Three measurements are considered to evaluate the model robustness against adversarial attacks. Specifically, Cln% refers to the model classification accuracy on the original clean data. Aua% is the classification accuracy under certain adversarial attacks, and higher Aua% means better defensing performance. Suc% is defined as the number of examples successfully being fooled against the number of all attempted attacks; accordingly, lower Suc% indicates the higher model robustness.
All experiments are performed five trials with random seeds for each dataset. For each run, the training is performed with batches of 32 sequences of length 512. The maximal number of training epoch is 10. Meanwhile, 10% samples are randomly selected from the training set to form the validation set, and the training stops if the validation accuracy fails to improve for one epoch. On the other hand, 1,000 testing examples are randomly selected for the evaluation purpose. This is the typical experimental setting as . More details are provided in Appendix A.1.

Main results
The following state-of-the-art defensing methods are employed to compare with the proposed MI4D, including WETAR-D (Xu et al., 2022), FreeLB++ , IB , InfoBERT , Flooding-X (Liu et al., 2022), and RanMASK . Among them, the first two methods are based on adversarial data augmentation, while IB, InfoBERT and Flooding-X are for the model enhancement. The last one represents the randomized smoothing method. The RoBERTa-base model (Liu et al., 2019) is employed as the Baseline. All contender methods are re-implemented using their released codes, and their key configurations are summarized in Appendix A.1. Their results are competing with those reported. Additionally, for MI4D most of the hyperparameters, such as learning rate, are consistent with the vanilla RoBERTa-base, while the masking budget b M is set as 30%. The comparison results over five trails are shown in Table 1.
To begin with, the proposed MI4D achieves comparative results of averaged Cln% (94.35%) across all three clean testing datasets. The performance is only second to that of Flooding-X (averaged 94.40%), while a consistent improvement is observed in comparison with other existing methods. Importantly, MI4D achieves the state-of-theart defensing accuracy in terms of Aua (60.18%) and Suc (36.40%) outperforming all contenders. Notably, all methods seemingly perform better against character-level attacks (Deepwordbug and TextBugger), which demonstrates the difficulty of defensing word-based attacks. Yet, MI4D still achieves the largest improvement (in comparison with the Baseline) and secures averaged 43.85 and 46.85 absolute percent points on Aua% and Suc% for the TextFooler and BERT-Attack, respectively.
By contrast, another [MASK] based approach (i.e., RanMASK) scores the worst performance across three datasets. The main difference between RanMASK and ours lies in the usage of [MASK] tokens (substitution or insertion). By replacing residual tokens after attacking, RanMASK could further destroy the original semantic of input sequences. However, MI4D spans the semantic convex hull to increase the chance of including original anchor points, as Lemma 1 and Corollary 1 indicated, so as to enhance the defensing performance.

Ablation study
To better understand the effectiveness of the proposed method, a series of careful analysis is carried out. Again, all results are reported as an averaged accuracy over five trials. On the masking location. To start with, the ablation experiment is performed to understand the impact from the location of inserted [MASK]   after the [CLS] (labeled as Head hereafter), the Random one is implemented to randomly insert [MASK] following a uniform sampling until b M is met (where b M is the masking budget). Similarly, we also consider to insert at the end of the input sequence (labeled as Tail).
With b M =0.3, the comparison result using three datasets and the TextFooler attack is shown in Fig. 3 (results from other attack methods can be found in Appendix A.2). Clearly, the proposed method is insensitive to the masking location, due to the similar performance achieved by either Head, Random or Tail insertion. This shows positional encoding has negligible effect as we asserted in analysis, as the position embedding is less important compared to the token embedding. In MI4D context, exactly same [MASK] tokens are inserted and they do not change the relative order of existing tokens. Therefore position embeddings can be seen as a disturbance to "tag" on token embeddings to create the small variation, and have less impact on MI4D.
On the masking budget. The following experiments are to evaluate the impact of the masking budget (b M ) on the proposed method. Obviously, with a higher value of b M , more [MASK] tokens will be inserted that could lead to more perturbed samples. Specifically, experiments are conduced by varying b M from 0.1 to 0.9. We need to point out that for the dataset of SST2, with b M =0.1 it is equivalent to injecting only 1 [MASK] token due to the average length of input sequences. As such, we are particularly interested in the model performance with/out [MASK].
The comparison is shown in Fig. 4 for the MI4D performance against the TextFooler attack on three datasets (results from other attack methods can be found in Appendix A.3). Notably, the results demonstrate the robustness of the proposed method to different masking budgets. That is, MI4D observes a stable defensing performance across all three evaluation metrics for different masking budgets. As the span of the convex hull is utterly important rather than its multiplicity, this observational experiment once again confirms our inference in the Analysis.

Discussion
In this section, we investigate different strategies of utilizing [MASK] tokens, and further seek for a reasonable explanation for the result. Again, experiments are conducted with [MASK] being inserted after [CLS] and b M =0.3. When to insert. First, we discuss the [MASK] insertion whether for training and/or inference. That is, three scenarios are considered to insert [MASK]: (1) only during the training, (2) only during the inference, and (3) both training and testing (equivalent to MI4D).
The comparison is shown in Table 2. The "Train only" variant is observed with the worst performance for the mostly collapsed convex hull, while others have more "developped" convex hull to embrace the original solution space. We highlight that including [MASK] in training is to fine-turning token embedding as a semantic place holder, and hence a "wild card". Accordingly, the capacity to span the convex hull to more likely intersect with original one is further enhanced, although [MASK] is employed for extensive pre-training of PLMs before. What to substitute. Hereafter the impact from substituting/masking different types of tokens is discussed, where tokens are cast as polluted (be-  ing attacked) and normal (remaining unchanged).
The following experiment then involves MI4D and three other variants for comparison, that is to (1) only substitute polluted (labeled as Mask_Pol), (2) only substitute normal (labeled as Mask_Normal), and (3) substitute randomly (explicitly as Ran-MASK). Fig. 5 illustrates the defensing accuracy of masking different types of tokens from the SST2 dataset (results from other datasets can be found in Appendix A.4). The comparison results clearly imply that the best performance is achieved via substituting/masking polluted tokens (i.e., Mask_Pol), while Mask_Normal is the worst. In our hypothe-sis, when polluted tokens are replaced by [MASK], it generates the convex hull that has the maximum overlap with the original one and hence leads to the best chance to defense. By contrast, Mask_Normal introduces more noise by maintaining perturbed but removing normal tokens. Notably, as there is no clue about adversarial attacking on which specific tokens in reality, Mask_Pol and Mask_Normal then reveals the theoretically best and worst defensing outcome (or the upper and lower bound), respectively.
RanMASK is then a special combination of Mask_Pol and Mask_Normal, as tokens of either polluted or normal are randomly masked out with a predefined probability. On the other hand, the proposed MI4D becomes an effective solution for masking inputs (due to the uncertainty of which tokens being polluted during testing), that is consistently better than RanMASK (randomly mask tokens). Again, the reason is that MI4D includes all by exploiting the fact that polluted tokens are still informative, to some extent, when they are combined with residuals ones, to create a larger convex hull overlapping (compared to RanMASK) with the original one. That is shown clearly in Proposition 1.
[MASK] or others. The last experiment aims to investigate the possibility of injecting different tokens, instead of [MASK]. Specifically, the [PAD] token is selected and further inserted into the original input sequence. Note that, in this regard, all other configurations (such as the masking budget and the random insertion) remain explicitly the same, but only to replace [MASK] with [PAD] for the injection. Table 3 reports the averaged performance using the SST2 dataset with four attacks. As observed, the performance using [PAD] is similar to that of [MASK], indicating we can insert [MASK] (or similar) as a "wild card" to increase the span of the convex hull.

Conclusion
We propose a novel adversarial defensing algorithm (MI4D), that is hyperparameter insensitive and structure free. The proposed method simply inserts [MASK] tokens at the beginning of input sequences, and follows the normal fine-tuning to train the model. Theoretically speaking, we have argued that adding additional [MASK], while remaining other residual tokens, creates a large convex hull overlapping with that of the clean one to increase the defensing probability. Empirically, in comparison to existing state-of-the-arts, the proposed algorithm exhibits superior performance on three benchmark datasets with four attack methods.
In future work, we could combine with external knowledge for more strategical masking. More importantly, MI4D is agnostic to downstream tasks, i.e., we could incorporate it into other applications.

Limitations
Our theoretic analysis is constructed on a crucial assumption asserting that the successful defense probability is determined by the volume of the convex hull formed by the input. Although our empirical study results confirmed the inferences based on this assumption repeatedly (shown in Section 4.4), we are still seeking direct dynamics of the convex hull to the prediction/classification probability where a more rigorous result may be derived. We envisage that the understanding of the current model behavior can lead to more robust models against adversarial attacks and hence further improvement to text classification.

A.1 Training Details
The RoBERTa-base model (Liu et al., 2019) is employed as the contextual encoder. The dropout rate across all layers is set as 0.1. The Adam optimizer with a dynamic learning rate is adopted, for which the learning rate is warmed up for 10 thousand steps to a maximum value of 2e −5 before decaying linearly to a minimum value of 1e −6 (by the cosine annealing) and a gradient clip of (−1, 1). Additionally, for WETAR-D, 50% of samples are polluted in the validation set (the size of 256); for FreeLB++, the number of search steps (for adversarial samples) is 30; for IB, the hidden dimension for the IB layer is set as 150 and the penalty of the IB loss is 0.1; for InfoBERT, the penalty of the mutual-information loss is 5 × 10 −2 ; for Ran-MASK, the masking budget is set as 30%, while the majority vote is adopted for the final classification stage. At last, all models are performed using a machine with NVIDIA Tesla V100 PCIe of 32G GPU memory.

A.2 Impact from the location of inserting [MASK]
The model accuracy is evaluated by adding the [MASK] token in different locations, i.e., either after [CLS] (termed Head), randomly (termed Random) across the input sequence, or at the end (termed Tail) . The comparison is shown in Fig 6, and the result illustrates that the proposed method achieves a similar performance regardless of the inserted [MASK] location.

A.3 Impact from the [MASK] budget
The model accuracy is also evaluated as a function of the masking budget. The comparison is shown in Fig 7,