Model-tuning Via Prompts Makes NLP Models Adversarially Robust

In recent years, NLP practitioners have converged on the following practice: (i) import an off-the-shelf pretrained (masked) language model; (ii) append a multilayer perceptron atop the CLS token's hidden representation (with randomly initialized weights); and (iii) fine-tune the entire model on a downstream task (MLP-FT). This procedure has produced massive gains on standard NLP benchmarks, but these models remain brittle, even to mild adversarial perturbations. In this work, we demonstrate surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP), an alternative method of adapting to downstream tasks. Rather than appending an MLP head to make output prediction, MVP appends a prompt template to the input, and makes prediction via text infilling/completion. Across 5 NLP datasets, 4 adversarial attacks, and 3 different models, MVP improves performance against adversarial substitutions by an average of 8% over standard methods and even outperforms adversarial training-based state-of-art defenses by 3.5%. By combining MVP with adversarial training, we achieve further improvements in adversarial robustness while maintaining performance on unperturbed examples. Finally, we conduct ablations to investigate the mechanism underlying these gains. Notably, we find that the main causes of vulnerability of MLP-FT can be attributed to the misalignment between pre-training and fine-tuning tasks, and the randomly initialized MLP parameters.


Introduction
Pre-trained NLP models (Devlin et al., 2019;Liu et al., 2019) are typically adapted to downstream tasks by (i) appending a randomly initialized multilayer perceptron to their topmost representation layer; and then (ii) fine-tuning the resulting model on downstream data (MLP-FT).More recently, work on large language models has demonstrated comparable performance without fine-tuning, by just prompting the model with a prefix containing several examples of inputs and corresponding target values (Brown et al., 2020).More broadly, prompting approaches recast classification problems as sequence completion (or mask infilling) tasks by embedding the example of interest into a prompt template.The model's output is then mapped to a set of candidate answers to make the final prediction.Prompting has emerged as an effective strategy for large-scale language models (Lester et al., 2021), and its utility has also been demonstrated for masked language models (Gao et al., 2021).
While fine-tuned models perform well on indistribution data, a growing body of work demonstrates that they remain brittle to adversarial perturbations (Jin et al., 2020;Li et al., 2020;Morris et al., 2020a).Even small changes in the input text, such as replacement with synonyms (Ebrahimi et al., 2018b), and adversarial misspellings (Ebrahimi et al., 2018a;Pruthi et al., 2019) drastically degrade the accuracy of text classification models.While prompting has become a popular approach for adapting pretrained models to downstream data, little work has considered interactions between adaptation strategies and adversarial robustness.
In this work, first, we demonstrate surprising benefits of Model-tuning Via Prompts (MVP) in terms of robustness to adversarial substitutions, as compared to the standard approach of fine-tuning models with an MLP head (MLP-FT).Notably, MVP, which does not utilize any sort of adversarial training or prompt optimization/engineering already yields higher adversarial robustness compared to the stateof-the-art methods utilizing adversarial training by an average of 3.5% across five datasets (classification, boolean question answering, and paraphrase detection), 3 models (BERT, RoBERTa, and GPT-2) and four attacks (word and character-level substitutions) ( §5).Moreover, we find that combining  MVP with single-step adversarial training can further boost adversarial robustness, resulting in combined robustness gains of more than 10% over the baselines.This happens without any loss in accuracy on unperturbed inputs, indicating how the objective of adversarial training couples well with MVP.
So far, prior works have not explored the idea of fine-tuning all the parameters of a model via prompts (we call this setup full-model full-data finetuning).We only see instances of (i) fine-tuning the full model via prompts in a few-shot setting (Gao et al., 2021), or (ii) fine-tuning additional tunable parameters using prompts on top of a frozen model by utilizing the complete training set (Li and Liang, 2021).We believe the idea of full-model full-data fine-tuning via prompts has not been used until now because clean accuracy improvements for MVP over MLP-FT are negligible, and the robustness advantages of MVP were previously undiscovered.
Second, we show that MVP as a method for classification is more (i) sample efficient, and (ii) has higher effective robustness than MLP-FT ( §5.1).That is, MVP requires fewer training samples to achieve the same clean accuracy; and for any given clean accuracy, the robust accuracy of MVP is higher than MLP-FT.Through ablation studies ( §5.3), we find that (i) adding multiple prompt templates makes it harder to fool the model; and (ii) having multiple candidate answers has a small but positive impact on the robustness.
Third, to explain our observations, we test a set of hypotheses ( §6), including (i) random parameter vulnerability-is adding a randomly initialized linear head the source of adversarial vulnerability for MLP-FT?; (ii) pretraining task alignment-can the gains in robustness be attributed to the alignment between the fine-tuning and pretaining tasks in MVP?; and (iii) semantically similar candidates-are predictions by MVP more robust because the candidate answer is semantically similar to the class label?
Through experiments designed to test these hypotheses, we find that (i) in the absence of pretraining, MVP and MLP-FT have similar robustness performance, supporting the hypothesis of pretraining task alignment; (ii) adding extra uninitialized parameters to MVP leads to a sharp drop in robustness, whereas removing the dense (768,768) randomly initialized weight matrix from MLP-FT improves the robustness of the model significantly; (iii) even random candidate answers such as 'jack', and 'jill' result in similar robustness gains, suggesting that when finetuning through prompts, the choice of candidate answers is inconsequential (in contrast, this choice is known to be crucial for few-shot classification).
Fourth, we perform a user study ( §7) to assess the validity of adversarial examples.We find that human annotators were 23% more likely to find adversarial examples to have been perturbed as opposed to clean examples.Moreover, humans achieved 11% lower accuracy on adversarial examples as compared to clean examples with average confidence on the label of perturbed examples being 15% lower.This highlights that a large fraction of adversarial examples are already detected by humans, and often change the true label of the input, signifying that MVP is more robust than crude statistics discussed in §5.Future work will benefit from developing better evaluation strategies for the robustness of NLP models.
Fifth, going beyond adversarial robustness, we investigate the robustness gains of MVP over MLP-FT on out-of-distribution (OOD) tasks.We find that MVP improves robustness by 2% across 5 different OOD sentiment analysis tasks ( § 5.2).
In summary, we demonstrate that models tuned via prompts (MVP) are considerably more robust than the models adapted through the conventional approach of fine-tuning with an MLP head.Our findings suggest that practitioners adopt MVP as a means of fine-tuning, regardless of the training data size (few-shot or full data) and model capacity.
The discovery of such adversarial examples span several tasks such as classification (Zhang et al., 2015b;Alzantot et al., 2018), NMT (Belinkov andBisk, 2018), andquestion-answering (Jia andLiang, 2017), but they are restricted to small models such as LSTMs and RNNs.Among others, Jin et al. (2020);Li et al. (2020) show that despite producing massive gains on standard NLP benchmarks, BERT style pretrained models are susceptible to adversarial attacks when finetuned on downstream tasks.Subsequently, multiple works have attempted at developing fast and semantically meaningful attacks (Li et al., 2018) and scalable defenses (Wang and Bansal, 2018;Jia et al., 2019;Wang et al., 2021b;Si et al., 2021b;Zhu et al., 2020) for masked language models.Yang et al. (2022) leverage prompts to generate adversarial examples that they train their model on using MLP-FT .Despite these efforts, NLP models suffer a significant drop in robust accuracy, when compared to clean accuracy on the same task.
Prompting NLP Models Prompting gained traction from GPT-3 (Brown et al., 2020) where it was primarily used in the zero-shot and few-shot settings and required manual trials to increase performance.In the zero-shot setting, no labeled examples are provided to the model and the language model is kept frozen.The model needs to output its prediction using the prompt that is provided.Whereas, in the fewshot setting, a few task-specific labeled examples are also provided for the frozen model in addition to the prompt (also known as in-context learning) (Rubin et al., 2022;Levine et al., 2022).A lot of work has gone into improving the prompts that are used in the zero-shot and few-shot settings, including miningbased methods to automatically augment prompts (Jiang et al., 2020), gradient-based search (Shin et al., 2020), using generative language models (Gao et al., 2021) and others (Hu et al., 2022;Schick and Schütze, 2021b,a).In the full data setting, previous works have explored prompting via prompt tuning (Liu et al., 2022;Li and Liang, 2021;Qin and Eisner, 2021;Lester et al., 2021) where the model is injected with additional tunable parameters.None of these works discuss the robustness advantages of prompting (especially in the adversarial context) when compared to standard fine-tuning approaches.
Robust Fine-tuning and Adaptation In the vision literature, prior works have also tried to use prompting to improve out-of-distribution robustness in the zero-shot and few-shot settings (Zhou et al., 2022a,b).Kumar et al. (2022) observed that fine-tuning worsens the out-of-distribution (OOD) performance of models due to the bias introduced via a randomly-initialized head on top of the CLIP model, and instead suggest a procedure (LPFT) that first fits the linear head and then finetunes the model.Later works have shown that this ID/OOD performance trade-off could be mitigated by averaging model weights between the original zero-shot and fine-tuned model (Wortsman et al., 2022) and/or by finetuning using an objective similar to that used for pretraining (Goyal et al., 2022).However, this work has been applied only to vision-language models, and secondly only deals with "natural" robustness evaluations rather than the adversarial manipulations we consider here.

Method
We consider the task of supervised text classification, where we have a dataset S = {x (i) ,y (i) } n , with x (i) ∈ X and y (i) ∈ {1,...,k} for a k-class classification problem.We train a classifier f to predict y based on input x.We follow the terminology by Schick and Schütze (2021a).The input (x) can be decomposed as a sequence of words {x 1 ,x 2 ,...,x l }, and the output (y) is a positive integer, with each value corresponding to a particular class.The prompt template (t) is the input string we append at the beginning or end of the input.For example, we may append the following template at the end of a movie review-"This movie is [MASK]".The candidate answers (A) is a set of words corresponding to each class.For example, the positive sentiment class can have the following candidate answers-{great, good, amazing}.

Adversarial Attacks
We are concerned with perturbations to the input x that change the model prediction.In the case of adversarial attacks confined to synonym substitutions, we confine the model to searching for xi in the synonym set of every word x i in the input.Whereas, in the case of character level substitution, we consider substitutions of characters that compose each x i in the input.

Model-tuning Via Prompts (MVP)
We present the overall pipeline of MVP in Input Modification Consider a prompt template t = t 1 ,t 2 ,...[MASK],...t m .For any input x, the prompt input (x t ) can be constructed by appending the template at the beginning or end of the input.The final output is based on the most likely substitution for the [MASK]token, as given by the language model.Typically, we use a set of prompt templates denoted by T .
Inference For every class label, we have a set of candidate answers associated with it.During inference, we do the following: (i) for every class label, select the candidate corresponding to the largest logit value among its candidate set; (ii) take the mean of the logits corresponding to the selected candidates over all the templates to compute the final logit of the given class label; (iii) predict the class having the highest final logit.

MVP + Single-step Adv
Based on the Fast Gradient Sign Method (FGSM) by Goodfellow et al. (2014), we perform singlestep adversarial training.Note that the input tokens are discrete vectors, and hence it is not possible to perturb the inputs directly.Instead, we pass the inputs through the embedding layer of the model and then perform adversarial perturbations in the embedding space.We do not perturb the embeddings corresponding to the prompt tokens.We find that performing single-step perturbations with the ℓ 2 constraint leads to more stable training than in the ℓ ∞ norm ball, and use the same for all our experiments.Similar (but not equivalent) approaches have also been studied in literature (Si et al., 2021a).

Experimental Setup
Datasets and Models We perform our experiments on five different datasets-AG News (Zhang et al., 2015b) (4-class topic classification), SST-2 (Socher et al., 2013) (binary sentiment classification), BoolQ (Clark et al., 2019) (boolean question answering), DBPedia14 (Zhang et al., 2015a) (14-class topic classification), and MRPC (Dolan and Brockett, 2005) (paraphrase detection).Results on DBPedia14 and MRPC are presented in Appendix C.1.All models are trained with the RoBERTa-Base (Zhuang et al., 2021) backbone.Experiments on GPT-2 and BERT-Base (Devlin et al., 2019)  Baseline Methods We now describe the terminologies used to denote training schemes corresponding to various fine-tuning strategies.MLP-FT is the "base" model for classification via standard non-adversarial training, and is utilized by all the baselines.Given a pre-trained model, we perform downstream fine-tuning by adding an MLP layer to the output corresponding to [CLS] token as illustrated in Figure 1(a).This hidden representation is of size 768 × 1.In the case of the BERT model, there is a single dense layer of dimension 768 × 2, whereas in the case of RoBERTa model, we have a two-layer MLP that is used to make the final prediction.MLP-FT + Adv is is identical to the method used for adversarial training in Section 3.2, wherein we perform adversarial perturbations in the embedding space of the MLP-FT model, rather than MVP.To compare with state-of-art adversarial training-based defenses we consider FreeLB++ (Li et al., 2021) (free large batch adversarial training using projected gradient descent), InfoBERT (Wang et al., 2021a) (information bottleneck regularizer to suppress noisy information), and AMDA (Si et al., 2021b) (adversarial and mixup data augmentation for creating new training examples via interpolation).We provide complete details pertaining to each baseline method in Appendix B.1.

Results
We first evaluate the impact of using MVP on the adversarial robustness of NLP models.For the task of Boolean question answering (BoolQ), we find that fine-tuning a RoBERTa model with an MLP head (MLP-FT) achieves an accuracy of 28.2% on adversarial examples obtained through the TextFooler attack strategy (Table 1).Whereas, the corresponding accuracy for tuning the model via prompts (MVP) is 42.9% which is a considerable improvement over MLP-FT.Additionally, MVP leads to more robust models compared to adversarial training baselines like MLP-FT + Adv and InfoBERT that attain accuracies of 39.0% and 38.1% respectively.Further, MVP can be combined with adversarial training (MVP + adv), and doing so leads to an accuracy of 52.2% which is about a 10% improvement over MVP, without any loss in clean performance.Similar to boolean question answering, the robustness advantages of MVP hold across the three tasks we examine.The individual performance statistics are detailed in Table 1.Overall, across four attack strategies, and three datasets, we report that MVP improves over MLP-FT by 8%.Remarkably, even in the absence of any adversarial training MVP achieves the state-of-the-art adversarial performance improving baseline adversarial training methods by 3.5%.Moreover, it can be coupled with single-step adversarial training, resulting in an overall 7% improvement over state-of-art methods.Lastly, the robustness benefits come only at a 2x computation cost of standard training, as opposed to past works which need 5-10x computation cost of standard training due to additional adversarial training.Results on BERT-Base are in Table 7.

Sample Efficiency & Effective Robustness
We investigate the sample efficiency and effective robustness of MVP through experiments on the BoolQ and AG-News datasets using the RoBERTabase model.We train models on randomly sampled fractions of the dataset, ranging from 5•10 −4 to 0.1.

Sample Efficiency
We compare the performance of MVP and MLP-FT in low-data regimes.We find that MVP results in models are consistently more robust compared to models trained through MLP-FT in the low data setups (see Figure 2a).In fact, we observe that in extremely low resource case (only 60 examples), it is hard to learn using MLP-FT , but model trained through MVP performs exceedingly well.We note that the relative bene-  fit of MVP over MLP-FT peaks around 5-10% of the data.Interestingly, the model trained through MVP requires only 5% of samples to achieve similar robustness levels as models trained with MLP-FT on the full dataset.In addition to robustness benefits, we find that MVP achieves considerably higher clean accuracy in low-data regimes (i.e., with < 200 examples).Results on BoolQ are in C.3.
Effective Robustness Effective robustness (Taori et al., 2021) measures the robust accuracy of models that have the same clean accuracy.This can help determine which training time design decisions will be valuable when scaled up.We measure the effective robustness for models trained through MVP and MLP-FT by training them on different data sizes.We find that even when both MLP-FT and MVP achieve the same clean accuracy, models trained through MVP are more robust (Figure 2b).Results on AG News are presented in C.3.

Out of Distribution Robustness
Going beyond adversarial robustness, we now perform experiments to assess the out-of-distribution robustness of MVP, MLP-FT, and LPFT.We use 5 sentiment classification datasets, namely SST2, Amazon Polarity (Zhang et al., 2016), IMDb (Maas et al., 2011), Movie Rationales (Zaidan et al., 2008), and Rotten Tomatoes (Pang and Lee, 2005).We fine-tune a Roberta model on 1000 examples of each of these datasets and evaluate all the datasets.Since all of these datasets are binary sentiment analysis datasets, we use the same template and candidate words across all the models (for both training and evaluation).Based on our investigation, we see that across 5 different models (and 20 evaluations) the average accuracy for MVP (89.65%) is 2% more than MLP-FT and 1.3% more than that of LPFT.These results in Table 3 show that MVP is superior to MLP-FT and LPFT for both adversarial and OOD robustness.In summary, LPFT helps reduce the impact of random parameter vulnerability, but MVP additionally allows pre-training task alignment (the second hypothesis) hence resulting in superior performance and no fundamental trade-off be it OOD or adversarial robustness.

Ablation Studies
Number of Candidate Answers A larger candidate answer set is shown to improve clean performance in the few-shot setting (Hu et al., 2022).Here, we investigate the impact of the size of the candidate answer set on the adversarial performance of models tuned via prompts.The adversarial accuracy of the model with a single candidate answer is 42.9%, and it increases to 46.2% upon using an answer set com- Table 3: OOD Robustness: The results report the standard accuracy (in %) of a model trained on the dataset in the leftmost column, and evaluated on 5 different OOD datasets.We see that across 5 different models (and 20 evaluations), the average accuracy for MVP (89.65%) on OOD tasks is 2% more than MLP-FT and 1.3% more than LPFT.
prising 4 candidates. 3These results correspond to the RoBERTa-base model on BoolQ dataset against adversarial perturbations from the TextFooler attack.Overall, we observe an improvement of 1.0-3.5% in adversarial accuracy when we use a larger candidate set across different settings (Table 2).A more detailed analysis of the same with a single prompt template is provided in Appendix D.4.

Number of Prompt Templates
Another design choice that we consider is the number of prompt templates used for prediction.We conjecture that the adversary may find it difficult to flip the model prediction when we average logits across multiple templates.To evaluate this, we train MVP with different number of prompt templates (ranging from 1 to 4), and compare the adversarial robustness.We notice a steady improvement in the adversarial accuracy as we increase the number of templates which supports our initial conjecture (see Table 2).While increasing the number of templates improves the robustness of the downstream model, MVP achieves large robustness gains even with a single template (compared to MLP-FT).Hence, using multiple prompt templates is not the fundamental reason for the improved robustness of MVP.Further, in order to assess the impact of the 'choice' of prompt templates used, we perform a more details analysis on the impact of prompt tuning for adversarial robustness of MVP in Appendix D.2.We find that even empty or random templates perform nearly similar to well-crafted prompts, and retain the robustness advantages of MVP over MLP-FT.
6 Why Does MVP Improve Robustness?
We test three hypotheses to explain the robustness gains achieved by MVP compared to MLP-FT in the context of adversarial attacks.
3 Details about candidates and templates are in Appendix A Random Parameter Vulnerability One plausible explanation for the observed adversarial vulnerability of MLP-FT is the randomly-initialized linear head used for downstream classification.The intuition behind this effect is that fine-tuning a set of randomly-initialized parameters may lead to feature distortion of the pretrained model as is demonstrated in Kumar et al. (2022).This phenomenon has also been observed in CLIP models (Radford et al., 2021), where the authors found that finetuning the model using a randomly initialized linear prediction head reduces the out-of-distribution robustness of the model.The phenomenon is unexplored in the context of adversarial robustness.
We study this effect through three experiments.
1. ProjectCLS: First, we reduce the number of random parameters by removing the dense layer of weights (768 × 768 parameters) from the standard MLP architecture.We call this ProjectCLS, and only use a projection layer of dimensions 768×C parameters, with C being the number of classes (see Figure 3(a)).We find that ProjectCLS is on average ∼ 8% more robust than MLP-FT which suggests that reducing the number of randomly initialized parameters helps to increase model robustness (see Table 4).
2. CLSPrompt: Second, we train another model, CLSPrompt, where instead of using the probabilities corresponding to the [MASK] token as in MVP, we use the probabilities of the candidate answers corresponding to the [CLS] token (see Figure 3(b)).The key difference between CLSPrompt and MLP-FT is that there are no randomly initialized MLP parameters in CLSPrompt, and we use the probabilities corresponding to the candidate answers, instead of projecting the representations with new parameters.From Table 4, we observe that CLSPrompt is once again on average ∼ 8% more robust than MLP-FT which provides strong evidence in favor of our hypothesis of ran-  dom parameter vulnerability.
3. LPFT (linear probe, then fine-tune): For our third experiment, we train two new models namely LPFT and DenseLPFT (see Figure 3(c,d)).In both these models, we do the following: (i) fit a logistic regression to the hidden states corresponding to the [CLS] token (linear probing); (ii) initialize the final layer of the classification head with the learned 768×C (where C is the number of classes) matrix of the fitted logistic regression model; and (iii) fine-tune the whole model as in MLP-FT.The only difference between LPFT and DenseLPFT is that DenseLPFT has an additional randomly initialized dense layer of dimensions 768×768 unlike LPFT.In contrast to Kumar et al. (2022), we test LPFT against adversarial manipulations.We note from Table 4 that DenseLPFT is more robust than MLP-FT (by over 10%) but it demonstrates lower robustness as compared to LPFT (by over 2%).This provides further evidence that randomly initialized parameters add to the vulnerability.
Pretraining Task Alignment The task of mask infilling aligns more naturally with the pretraining objective of the language model and we posit that finetuning via mask infilling as in MVP results in robustness gains.To test this hypothesis, we use an untrained RoBERTa model, and measure the clean accuracy and robustness of MVP and MLP-FT models.We observe that in the absence of pre-training, MVP trained with a single template does not achieve any additional robustness over the baseline, and in fact, MLP-FT performs better than MVP (Table 4) whereas in the presence of pretraining, MVP outperforms MLP-FT (Table 2) in all the settings.Note that this does not contradict the hypothesis about vulnerability due to randomlyinitialized parameters, as that hypothesis only applies for pretrained models.This suggests that the alignment of MVP with the pre-training task is crucial for adversarial robustness on downstream task.
Semantically Similar Candidates We question whether the improvement in robustness can also be attributed to the semantic relatedness between candidate answers and the class labels.To answer this question, we change the candidate answers to random proper nouns ('jack', 'john', 'ann', 'ruby') for the 4-class classification problem of AG-News and ('jack', 'john') for the 2-class classification task of BoolQ.All of these words are unrelated to the class labels.We find that irrespective of whether we use semantically related candidates or not, the robust accuracy of the model is within 1% of each other, thereby implying that using semantically similar candidates is not a factor behind the robustness gains of MVP (Table 4).While the choice of candidate answers is crucial in the pre-train, prompt, and predict paradigm (Hu et al., 2022), it is irrelevant in the pre-train, prompt, and finetune paradigm.With sufficient fine-tuning over the downstream corpus, a model can learn to associate any candidate word with any class, irrespective of its semanticity.
However, one may wonder why using 'random' candidate words doesn't hurt the model robustness, since this also leads to modifying a 'parameter' in the model's embedding space, which was initially uncorrelated to the class label.We analyze this question in detail in Appendix D.3 and conclude that the main reason for the preserved robustness is the 'pretraining task hypothesis' and the fact that the modified word embeddings have a much smaller dimension of size 768 x C (where C is the number of candidate words), as opposed to modifying a dense layer.

Human Study
We conduct a human study to assess the viability of the adversarial attacks.More specifically, we provide machine learning graduate students 250 input examples and ask the following questions: (a) What is the perceived label of the sentence; (b) What is their confidence about this label; and (c) Was this sentence adversarially manipulated?We use the BoolQ dataset and strictly instruct our annotators to not use any external knowledge but the context of the given passage only.We use samples that were successfully attacked by TextFooler for MVP + Adv model with a RoBERTa backbone.As a control for the study, we provide the original sentence rather than the adversarially perturbed one 33% times.The underlying model achieves a clean accuracy of 81.7% and a robust accuracy of 54.0%.
We find that human annotators identify 29% of adversarial examples to be perturbed as opposed to only 6% of clean examples.Moreover, we also discover that humans achieved 11% lower accuracy on adversarial examples as compared to clean examples (85% → 74%) with average confidence on the label of perturbed examples being 15% lower (90% → 75%).This study highlights that a fraction of adversarial attacks either manipulate the input so significantly that it is easily detectable, or change the label, signifying that MVP is more robust than what crude statistics suggest in §5.Details related to the human study are available in Appendix F.1.

Conclusion
In this work, we benchmark the robustness of language models when adapted to downstream classification tasks through prompting.Remarkably, model tuning via prompts-which does not utilize any sort of adversarial training or prompt engineeringalready outperforms the state-of-the-art methods in adversarially robust text classification by over 3.5% on average.Moreover, we find that MVP is sample efficient and also exhibits high effective robustness as compared to the conventional approach of fine-tuning with an MLP head (MLP-FT).We find that the lack of robustness in baseline methods can largely be attributed to the lack of alignment between pre-training and finetuning task, and the introduction of new randomly-initialized parameters.

Limitations
This work considers models that are under 1B parameters in size.While larger models are becoming popular in the NLP community, developing practical attacks that scale to such large models is an extremely challenging task.For instance, for the evaluation considered in this paper, each attack takes approximately a day on a single A6000 GPU to run (across multiple seeds of the model).Furthermore, the scope of our work is limited to tasks where finetuning with an MLP head is commonplace.This includes boolean question answering, sentence classification, and paraphrase detection tasks.Finally, using multiple templates for MVP comes with a tradeoff with latency which is discussed in Appendix D.1.
Broader Impact Our work does not pose any immediate negative impacts to society, except for the carbon emissions owing to the training and evaluation of big models.We emphasize that the adversarial robustness conferred via MVP is a desirable property for deployed systems, and our work contributes towards making NLP models more reliable and safe when deployed in real-world settings.We describe training schemes corresponding to various fine-tuning strategies below.MLP-FT : This is the "base" model for classification via standard non-adversarial training and is utilized by all the baselines discussed in this section.Given a pre-trained model, we perform downstream fine-tuning by adding an MLP layer to the output corresponding to [CLS] token as illustrated in Figure 1(a).This hidden representation is of size 768×1.In the case of the BERT model, there is a single dense layer of dimension 768 × 2, whereas in the case of RoBERTa model, we have a two-layer MLP that is used to make the final prediction.
MLP-FT + Adv: This is identical to the method used for adversarial training in Section 3.2, wherein we perform adversarial perturbations in the embedding space of the MLP-FT model, rather than MVP.
FreeLB++ (Li et al., 2021): Free Large-Batch (FreeLB) adversarial training (Zhu et al., 2020) performs multiple Projected Gradient Descent (PGD) steps to create adversarial examples, and simultaneously accumulates parameter gradients which are then used to update the model parameters (all at once).FreeLB++ improves upon FreeLB by increasing the number of adversarial training steps to 10 and the max adversarial norm to 1.
InfoBERT (Wang et al., 2021a): InfoBERT uses an Information Bottleneck regularizer to suppress noisy information that may occur in adversarial attacks.Alongside, an 'anchored feature regularizer' tries to align local stable features to the global sentence vector.Together, this leads to improved generalization and robustness.InfoBERT can additionally be combined with adversarial training (we use Free LB++ for this purpose).
AMDA (Si et al., 2021b): Adversarial and Mixup Data Augmentation (AMDA) improves robustness to adversarial attacks by increasing the number of adversarial samples seen during training.This method interpolates training examples in their embedding space to create new training examples.The label assigned to the new example is the linear interpolation of the one hot encodings of the original labels.

B.2 Attack Details
In the main paper, we evaluated our method on three popular word substitution attacks and one character-level attack.These included the TextFooler, TextBugger and BertAttack attack strategies.TextFooler and TextBugger are word substitution attacks that replace words with "similar" neighboring words (where similarity is based on counterfitted GloVe embeddings).TextFooler greedily searches in a large set of neighbors (in the embedding space) for each word, so long as they satisfy some constraints on embedding similarity and sentence quality.An additional constraint requires the substituted word to match the POS of the original word.TextBugger, on the other hand, restricts the search space to a small subset of neighboring words and only uses sentence quality as a constraint.To control the amount of change made by an attack, we limit the adversary to perturbing a maximum of 30% words in the AG News dataset and 10% in all other datasets.We do not modify any other constraints (such as the query budget) and run the attacks on 1000 examples from the test set.We also evaluate on one character-level, and another word substitution attack.For character-level attack, we use the adversarial misspellings attack introduced by Pruthi et al. (2019), and we additionally evaluate the popular BertAttack (Li et al., 2020).notice that for MLP-FT and MLP-FT + Adv, it is difficult to achieve a good clean generalization performance whereas MVP and MVP + Adv perform much better on the clean test set.These observations are in line with the results in our main paper.On the AG News dataset, MVP performs significantly better than MLP-FT and MVP + Adv performs better than MLP-FT + Adv.These results show that MVP is not only a good way of finetuning BERT-like MLMs but can also improve Causal Language Models both in terms of clean accuracy and robustness to adversarial perturbations.

C.3 Sample Efficiency and Effective Robustness
We demonstrate the sample efficiency of MVP on the BoolQ dataset (Figure 4a) in addition to the discussion about AG News in §5.1.Interestingly we find that MLP-FT is unable to achieve better accuracy compared to even random classifiers with 200 examples but MVP performs much better in the low data regime (< 200 examples).We also provide more evidence on the effective robustness of MVP by presenting the effective robustness results on AG News (Figure 4b).Even for AG News, we notice that the curve is much steeper for MVP than MLP-FT.

D Extended Analysis D.1 Latency of Using Multiple Templates
We present the latency numbers and compare them with the latency of the standard MLP-FT approach.Specifically, the time required for 2000 forward passes of data from the IMDb dataset with a batch size of 1 is represented as T = 24.45±0.25 seconds.
The results are presented in Table 8.
In summary, using multiple templates makes predictions about 1.45x slower, however, this leads to improved robustness.

D.2 Benefits from Prompt Tuning
To assess the benefit of Prompt tuning, we conducted a series of experiments.Interestingly, even an empty template with just a [MASK] token, which would be considered a weak prompt, showed significant performance improvements over the standard technique of MLP-FT.We present these results for 4 different prompt choices in Table 9.The choice of prompts used has very little effect on model robustness in the fine-tuning regime.We tabulate the robustness results corresponding to different prompts below (for the BoolQ dataset).Here the first four prompts are the prompts we used and "Ruby emerald [MASK]" is a random prompt from vocabulary words.
We did not perform any dedicated prompt tuning for selecting the prompts.Instead, prompts were either chosen directly or inspired by the Open-Prompt repository.The selected prompts led to a marginal (2%) increase in model robustness during fine-tuning.Unlike typical few-shot or in-context learning methods, our approach aligns more closely with the idea of prompt tuning.For more advanced techniques and further potential improvements in prompt tuning, readers are referred to Hu et al. (2022).

D.3 Why does using "dummy candidate
words" not hurt model robustness?
In our paper, we note that using dummy candidate words like Jack and Ann, instead of class labels, does not hurt model robustness.However, this is very similar to random projection layers, so why does this not impact model robustness similarly?We note that using dummy candidate words leads to modifying an embedding of size 768 x C (where C is the number of candidate words) so that they now have a new "meaning".The effective number of "new parameters" is much lower than the parameters in the "dense 768x768 layer" in the Roberta model.However, in terms of new parameter complexity, this is similar to our ablation "ProjectCLS".As one may note, using ProjectCLS also improves robustness over MLP-FT.This is because we avoid the dense 768x768 layer.Additionally, we conducted a new experiment of using empty slots in the vocabulary of Roberta and compared it with using "dummy candidate words" and "class labels".For the BoolQ dataset, using a Roberta model, we summarize the results in Table 10.
Based on these accuracies above we find that: 1. Using class labels is better than using "dummy/untrained words" for both clean and robust accuracy, which supports the random parameter vulnerability hypothesis.
2. The robustness achieved upon using completely untrained slots is similar to that when using dummy candidate words.This suggests that when compared to class labels, modifying dummy words has a similar loss in robustness as with modifying untrained words.ProjectCLS (which already bridges the robustness gap from MLP-FT).These gains are explained by the pre-training task alignment hypothesis, where pre-training (and finetuning) the model with the task of [MASK] infilling helps make the downstream model robust.

D.4 Impact of Ensembling the Candidates
Recall that in the main paper, we ensemble multiple templates and aggregate their predictions.In this subsection, we also investigate the impact of ensembling candidate words rather than templates.
Based on the results in Table 11, we find that this is not as helpful as ensembling multiple templates.

E Hyperparameter Details
Attack Hyperparameters TextFooler and TextBugger use a word substitution attack that searches for viable substitutions of a word from a set of synonyms.We restrict the size of the synonym set to 50 for TextFooler which is the default value used by Jin et al. (2020) and to 5 which is the default value used by Li et al. (2018).Both TextFooler and TextBugger use a Universal Sentence Encoder (USE), that poses a semantic similarity constraint on the perturbed sentence.We use the default value of 0.84 as the minimum semantic similarity.Another important attack parameter is the maximum percentage of modified words (ρ max ).As discussed in (Li et al., 2021), we use ρ max = 0.3 for AG News and use ρ max = 0.1 for BoolQ and SST2 in all our experiments.We use a query budget of 100 for BERT-Attack and a query budget of 300 for adversarial misspellings as these attacks are very slow.

Training Hyperparameters & Model Selection
We train all models including the baselines with patience of 10 epochs, for a maximum of 20 epochs, and choose the best model based on validation accuracy.For the datasets that do not contain a publicly available validation set, we set aside 10% of the training set for validation.In the case of baseline defenses that use adversarial training, we perform model selection based on adversarial accuracy rather than clean accuracy.We use a candidate answer set containing only the class label names and we average over 4 prompt templates in all the MVP models.We use a batch size of 32 for MLP-FT and a batch size of 8 for MVP models.The learning rate is set as 1e−5 for all the models.We use the AdamW optimizer along with the default linear scheduler (Wolf et al., 2020).In all the MVP + Adv and MLP-FT + Adv models, we use a use 1-step adversarial training with max ℓ 2 norm of 1.0.For the state-of-the-art baselines, we use the same hyperparameters as prescribed by the original papers.
request the annotators only use the given context and refrain from using any external knowledge.

How confident are you about the label above?
Once the annotator has answered question 1, we ask them to rate how confident they feel about the label they assigned to the input.The options provided are "Uncertain", "Somewhat Certain" and "Certain".
Based on their response we assign a confidence of 1, if the annotator was certain, assign 0.5 if the annotator was somewhat certain, and assign 0 if the annotator was uncertain to calculate the average confidence.
3. Do you think that the sentence is adversarially perturbed?(using word substitutions) Do not use your own knowledge of the world to answer this question.We also ask the annotators, if the input was adversarially perturbed.The options provided to the user are "No", "Unsure" and "Yes".
The annotators helped us annotate 250 such examples out of which 167 were adversarially perturbed and 83 were clean.An overview of the responses from this study is presented in Table 12, and the key takeaways are discussed in Section F.

Figure 1 :
Figure 1: An illustration of (a) Standard Finetuning, and (b) Model-tuning via Prompts.The adjoining accuracy metrics correspond to a RoBERTa model trained on the BoolQ dataset.
Figure 1(b), and describe individual components below.

Figure 2 :
Figure 2: (a) Sample Efficiency: Clean and Robust Accuracy of RoBERTa-base model when trained using different data sizes of the AG News dataset.(b) Effective Robustness: Robust vs Clean Accuracy of RoBERTabase model on the BoolQ dataset We find that (a) MVP is more sample efficient as compared to MLP-FT , and (b) MVP yields more robustness compared to MLP-FT for the same clean accuracy (see §5.1 for details).

Figure 3 :
Figure 3: Various model tuning strategies for RoBERTa model trained on the BoolQ dataset.The corresponding clean and robust accuracies (under TextFooler attack) are also shown above each model paradigm.The left-most diagram shows the standard fine-tuning paradigm of MLP-FT , and each subsequent column modifies the architecture, helping us confirm the hypothesis that randomly initialized parameters are a cause of vulnerability.

MRPC
The prompt templates used for MLMs: 1.The two sentences are [MASK] 2. [SEP] First sentence is [MASK] to second sentence 3. Two [MASK] sentences 4. [SEP] The two sentences have [MASK] Clean vs adversarial performance of RoBERTa base model for the AG News dataset.We find that models tuned via prompts (MVP) yield more robust models compared to fine-tuning MLP heads for the same clean accuracy.

Figure 4 :
Figure 4: (a) Models trained with MVP are significantly more sample efficient as compared to those with MLP-FT .(b) We find that models tuned via prompts (MVP) yield more robust models compared to fine-tuning MLP heads for the same clean accuracy (details in §5.1).

Table 1 :
Adversarial Robustness: Performance of RoBERTa-base model on 3 different datasets averaged over 3 different seeds on a fixed test set of size 1000.The highest accuracies are bolded, and the second-best is underlined.We observe that models tuned via prompts (MVP) are the most robust while preserving (or improving) the clean accuracy.

Table 4 :
Adversarial performance of RoBERTa for experiments corresponding to the random parameter vulnerability and task alignment hypotheses averaged over 3 seeds ( §6).'TFooler' and 'TBugger' represent model robustness under TextFooler and TextBugger attacks respectively.'Clean' represents model accuracy on original test data.
Jianhan Xu, Jiehang Zeng, Linyang Li, Xiaoqing Zheng, Qi Zhang, Kai-Wei Chang, and Cho-Jui Hsieh.2021.Searching for an effective defender: Benchmarking defense against adversarial word substitution.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3137-3147, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.

Table 7 :
Adversarial performance of BERT-base model on 3 different datasets.All accuracy values are reported for a fixed test set of size 1000 and are averaged over 3 different seeds.The highest accuracies are bolded, and the secondbest are underlined.MVP is the most robust, and preserves (or improves) the clean accuracy.

Table 8 :
Inference latency comparison across different configurations.

Table 9 :
Model robustness per template chosen for the BoolQ dataset.

Table 10 :
Comparison of different choices of candidate words and their accuracies when training a Roberta model on the BoolQ dataset.