WARP: Word-level Adversarial ReProgramming

Transfer learning from pretrained language models recently became the dominant approach for solving many NLP tasks. A common approach to transfer learning for multiple tasks that maximize parameter sharing trains one or more task-specific layers on top of the language model. In this paper, we present an alternative approach based on adversarial reprogramming, which extends earlier work on automatic prompt generation. Adversarial reprogramming attempts to learn task-specific word embeddings that, when concatenated to the input text, instruct the language model to solve the specified task. Using up to 25K trainable parameters per task, this approach outperforms all existing methods with up to 25M trainable parameters on the public leaderboard of the GLUE benchmark. Our method, initialized with task-specific human-readable prompts, also works in a few-shot setting, outperforming GPT-3 on two SuperGLUE tasks with just 32 training samples.


Introduction
Language model pretraining has had a tremendous impact on solving many natural language processing tasks (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;Liu et al., 2019). The most popular two approaches take a pretrained model and use a straightforward supervised learning objective. In the first approach, the parameters of the language model are frozen and a task-specific head is trained on top of them (Peters et al., 2018). The second approach fine-tunes all model parameters (Radford et al., 2018). The latter can sometimes yield better results (Peters et al., 2019), while the first one usually offers better stability for smaller datasets. The approach based on frozen features does not require storing task-specific language models.
A recent alternative is based on so called adapters (Houlsby et al., 2019;Pfeiffer et al., 2021), a technique that adds new weights at every layer of the pretrained language model while the original parameters are kept frozen. This enables a smaller set of task-specific parameters while achieving results comparable to the fine-tuning approach.
Another approach of leveraging pretrained language models for downstream tasks, introduced by Radford et al. (2019), provides "task descriptions" without using any labeled examples. GPT-3 (Brown et al., 2020) demonstrates impressive few-shot learning performance with priming: by providing the language model a few inputs and outputs ("analogies") as a context. The language model contextually "learns" from these examples and outputs the answer with a single forward pass without any trainable parameters. These methods, however, require huge language models (1.5B and 175B parameters, respectively).
The success of task reformulation-based approaches suggest that language models are capable of solving various natural language processing tasks given a well-crafted prompt. We hypothesize that it is possible to find such prompts. In other words, we can discover extra tokens that, when added to the input, can exploit language model capabilities better than the manually-designed ones.
In this paper, we introduce a novel technique to find optimal prompts. We call our method WARP: Word-level Adversarial RePrograming 1 . The method is inspired by adversarial reprogramming (Elsayed et al., 2019) -a method of adding adversarial perturbations to an input image that reprograms a pretrained neural network to perform classification on a task other than the one it was originally trained for. We show that our method, using up to 25K trainable parameters per task, achieves 81.6 test score on the GLUE Leaderboard, outperforming all the other submissions that use up to three orders of magnitude more trainable parameters. We show that it is possible to inject knowledge into WARP models using manually designed initialization of the prompt, which is especially useful on tasks with a small number of examples. Moreover, WARP shows impressive few-shot performance on two tasks from the SuperGLUE benchmark with just 32 examples, outperforming GPT-3 results. Finally, we discuss the advantages of our method in real-life applications.
2 Related Work 2.1 Towards Fewer Trainable Parameters Jiao et al. (2020) show that knowledge distillation may help reduce the size of their model 7.5 times while almost preserving the performance, but finetuning such models still requires storage of separate task-specific models. As seen in Section 6, this approach does not scale when we want to apply it to many tasks at once.
Another approach, called Adapters (Houlsby et al., 2019;Pfeiffer et al., 2021), introduces new task-specific parameters that are added at every layer of the Transformer network. Only these newly initialized weights are trained, which allows separation of general and task-specific knowledge. In contrast, our method does not inject taskspecific knowledge inside the body of the pretrained language model. Instead, it focuses on learning task-specific input-level prompts.  Figure 2: WARP adds a few trainable embeddings around the input, which causes the masked language model to predict the sentiment of the sentence.

Task Reformulation
In GPT-2, Radford et al. (2019) introduce a completely unsupervised way for transferring knowledge to downstream tasks by reformulating various natural language understanding tasks into language modeling problems. This approach does not make use of the available training examples. Brown et al. (2020) demonstrate an effective fewshot transfer by reformulating downstream tasks into input-output analogies in the context without a need for further fine-tuning. Nonetheless, the number of training examples is limited to the context size and is not scalable to a traditional supervised learning scenario. Schick and Schütze (2021b) show the effectiveness of reformulating a number of tasks into Cloze-style tasks by fine-tuning masked language models (Devlin et al., 2019).
The method, called Pattern Exploited Training (PET), additionally uses training samples and performs few-shot learning even without huge models such as GPT-3.
Our method is also based on masked language models, but unlike PET, we focus on finding the best prompt using the training examples. This eliminates the need for manually-designed prompts, however, our method can also benefit from similar prior knowledge about the task by careful initialization of the prompts.

Adversarial Reprogramming
Adversarial Reprogramming (Elsayed et al., 2019) demonstrates the reprogramming of pretrained Im-ageNet classifiers by adding input-level adversarial perturbations to make them perform well on MNIST and CIFAR-10 image classification tasks. The adversarial perturbation is designed to be image padding added to the original input, as illus-  trated in Figure 1. Then the perturbation parameter is trained to optimize the target classification task objective using the annotated image data.
While in the case of image classification it is not obvious why adversarial reprogramming should ever work, e.g. why a network trained on Ima-geNet should have the capacity to solve MNIST when surrounded with a particular bitmap, for NLP tasks, there is more intuition. Many NLP tasks can be reformulated as language models, a shared space for both program and data.
Adversarial reprogramming has been adapted to text classification tasks with LSTM networks in (Neekhara et al., 2019). They operate in the vocabulary space and reprogram a model trained for one task to perform another task. More recently, AutoPrompt (Shin et al., 2020a) attempts to find prompts for large language models automatically without adding any parameters to the model. Unlike AutoPrompt, we perform gradient-based optimization in the space of word embeddings which gives our model more degrees of freedom and eventually better performance on the downstream tasks (Section 6.2).
In a more general sense, guiding an NLP model with special tokens appended to the input is an even older idea. In particular, multilingual neural machine translation models use special tokens in the input to control the target language (Ha et al., 2016;Johnson et al., 2017) or politeness of the translation (Sennrich et al., 2016). Another method to reprogram a BERT-based model is proposed by Artetxe et al. (2020), where a model tuned on an English version of a particular task is transformed to work in another language by changing only the embedding matrices.
In parallel work, Li and Liang (2021) propose a similar method and successfully apply it on two text generation tasks. Apart from the different types of tasks and our characterization of the task as a form of Adversarial Reprogramming, the main difference between their approach and ours is that they use an additional parameterization trick to stabilize the training.

WARP
We follow a setup similar to Elsayed et al. (2019) with some NLP-specific modifications depicted in Figure 2.
Our goal is to find the best prompt that will make a pretrained masked language model predict the desired answer (verbalizer token) for a training example's masked token 2 . We search for such prompts in the (continuous) embedding space. In other words, we want to find parameters Θ = {Θ P , Θ V } for prompt and verbalizer embed-dings, respectively, such that: and the probabilities are given by: where T Θ P (x) is the template that inserts the prompt embeddings Θ P into predefined positions, C is the set of classes, and f (x) is the masked language model output (without the last decoder layer, which is simply the transposed word embedding matrix). Both Θ P and Θ V are vectors in the same embeddings space as the word embeddings.
In Figure 2, the template T Θ P (x) prepends Θ P 1 and appends Θ P 2 , Θ P 3 , Θ P 4 parameters to the word embeddings and uses Θ V + and Θ V − to calculate the probabilities on the masked token position for positive and negative classes.

Method
Similar to Elsayed et al. (2019), we employ stochastic gradient descent to find the best adversarial perturbation on the text that will minimize the task objective. First, we insert special prompt tokens [P 1], [P 2], ... [P K] and an additional [MASK] token into the input sequence. These tokens might be placed before or after the sentences, depending on the prompt template.
We set the optimization objective to a crossentropy loss between the head output of the masked language model and the verbalizer tokens The only trainable parameters are the word embeddings for [P 1], ..., [P K] and [V 1], ...
In case we want to train models for multiple tasks, these are the only task-specific parameters we need to store. The entire "body" of the large language model (all attention layers, feedforward layers, and all other word embeddings) remains untouched.
Note that, unlike most adversarial attacks, we do not update the embeddings of the original tokens of the input. This follows the intuition from Elsayed et al. (2019), when the pixels of MNIST or CIFAR images are left untouched, and only padding pixels are updated.
We train these parameters by minimizing the loss on the training set of the downstream task.

Implementation Details
WARP is implemented in the AllenNLP framework. For all the GLUE benchmark tasks we use the roberta-large (Liu et al., 2019) model from the PyTorch implementation of huggingface transformers (Wolf et al., 2020) library. For the few-shot experiments, we use albert-xxlarge-v2 in order to directly compare to iPET (Schick and Schütze, 2021b). For the GLUE and SuperGLUE tasks we use dataset loaders and metrics implementations from the huggingface datasets library.
The prompt tokens are initialized either with word embeddings of [MASK] or similar to the vectors from the word embedding layer. For the answer prompts, we use the masked language model head, which usually consists of a feedforward network and a decoder on top of it, where the weights of the decoder are shared with the word embeddings used for the input. We calculate the softmax over the verbalizer tokens We choose the Adam optimizer with a slanted triangular schedule for the learning rate with 6% warm-up steps and train for 10-20 epochs on each task. Each batch consists of examples containing at most 1024 tokens and 8 examples.
In order to speed up the training, we disable the dropout of the pretrained language model. All the experiments are performed on two Titan Vs and two RTX 3080 GPUs, with mixed precision training. In practice, WARP is 2.5-3 times faster than regular fine-tuning and 2 times slower than frozenfeatures experiments in terms of epoch duration with the same batch sizes.
Details about the hyperparameters can be found in the Supplementary material.

Experiments on GLUE
Following prior work, we evaluate our method on the GLUE Benchmark (Wang et al., 2019b), which consists of 9 natural language understanding tasks. Generally, we perform single-task WARP training, with early stopping and model selection using the original validation sets, if not stated otherwise.

Tasks
Almost all the tasks from the GLUE Benchmark are either sentence classification or sentence pair classification tasks, so WARP requires very few modifications to adapt to each of the tasks. . The last column # shows the number of trainable parameters. WARP's average performance is higher than all models with up to three orders of magnitude more trainable parameters. Fully fine-tuned RoBERTa and the current state-of-the-art method (DeBERT) score higher by 6.5 and 9.2 points, respectively.
SST-2 (Sentence Sentiment Treebank, Socher et al., 2013) is a single sentence binary classification task. For the prompt, we put a [MASK] token after the sentence, and the trainable prompt tokens are both appended and prepended to the sentence.
CoLA (Corpus of Linguistic Acceptability, Warstadt et al., 2019) is a single sentence classification task as well, so we treat both the same way with the only difference that as a validation metric we use accuracy for SST-2, and Matthew's correlation for CoLA.
MNLI (MultiNLI, Multi-Genre Natural Language Inference, Williams et al., 2018), QNLI (Question Natural Language Inference, Rajpurkar et al., 2016) and RTE (Recognizing Textual Entailment, Bar Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009) are sentence pair classification tasks. Similar to Schick and Schütze (2021a), we may have prompt tokens before, after and between the two sentences, but the [MASK] token is always put between the sentences. For MNLI, we use matched accuracy as a validation metric and use the same model for the mismatched version. In our few-shot attempt for the RTE task, we use a different training and evaluation setup discussed in Section 5.2. QQP (Quora Question Pairs 4 ) and MRPC (Microsoft Research Paraphrase Corpus, Dolan and Brockett, 2005) follow the same prompt pattern as NLI tasks. As a validation metric F 1 score is used. We follow Liu et al. and train models for MRPC, STS-B, and RTE tasks initialized with the parameters from the best MNLI model but do not apply any task-specific tricks to WNLI (Winograd Schema Challenge NLI, Levesque et al., 2011) and always predict the majority label.

Results
Table 1 presents the results on the test set obtained from the GLUE evaluation server. Besides our best WARP models, we also include the human baselines, current state-of-the-art model (He et al., 2020), the regular fine-tuned pretrained model we use, and also include relatively small language models, including (Jiao et al., 2020), (Clark et al., 2020), (Houlsby et al., 2019).
With the GLUE Score, WARP outperforms all the models that train less than 25 million parameters on the leaderboard. We explain the relatively strong WARP results on textual entailment tasks by the easier reformulation of such tasks. Likewise, we explain the relatively weak performance on CoLA by the difficulties of reformulating the  Table 2: Dev set results on GLUE tasks. The last column shows the number of trainable parameters only. WARP i corresponds to WARP training with prompt consisting of i prompt tokens. WARP MNLI corresponds to WARP training initialized with the best MNLI parameters. All the models are based on pretrained roberta-large, and for Adapters and WARP-based approaches require to store 355 · 10 6 frozen parameters shared across all the GLUE tasks. We show the primary validation metric for each task, described at Subsection 4.1. The AVG column shows the average of shown metrics and is not comparable to the Test server GLUE Score. The number of parameters for WARP methods may vary because of a difference in the number of classes. Underlined numbers correspond to our GLUE submission. task into a Cloze task.
To further analyze WARP, we conduct several experiments and focus on dev set results. In order to directly compare WARP with existing methods, we report in Table 2 different methods that use RoBERTa, including fine-tuning, linear classifiers on top, AutoPrompt, and Adapters. 5 For WARP experiments, we compare performance with different numbers of prompt tokens. The WARP 0 model does not introduce any prompt parameters. The only difference between WARP 0 and Linear Classifier is that for WARP 0 , [MASK] is added to the input of each sample, and we get sentence representations from the MLM head at the masked position. By contrast, in the case of the Linear Classifier, we use the average of non-special token embeddings as sentence representations. As we can see, pooling with MLM is significantly better. Table 2 shows that, as we decrease the number of trainable prompt parameters, the performance decreases, but the model still works. Similar behavior was observed by Elsayed et al. (2019) in experiments with different padding parameter sizes. However, in contrast to WARP, the number of trainable parameters in that work are much greater than the size of the input.
An important benefit of using WARP is that 5 Unlike in

Few-Shot Experiments
The fact that WARP can be initialized using manually designed natural prompts suggests that we can similarly benefit from such human attribution similar to iPET (Schick and Schütze, 2021b), especially in scenarios with limited training data.

Setup
For our few-shot experiments we build WARP on top of ALBERT (Lan et al., 2020), the same pretrained model used by PET and iPET.

Tasks
In order to compare with GPT-3, PET, and iPET, we use two tasks from FewGLUE (Schick and Schütze, 2021b), which is a few-shot subset of the SuperGLUE benchmark (Wang et al., 2019a) consisting of 32 examples for each task. The dataset also provides 20000 additional unlabeled examples, however, we do not make use of them and work in a purely supervised setup.
CB: CommitmentBank (de Marneffe et al., 2019) is a textual entailment task which we treat like the other sentence pair classification tasks.
RTE: Unlike experiments on the RTE task for the full-sized training in the GLUE benchmark, we do not initialize the model with vectors from MNLI. Instead, the prompt is initialized exactly the same way as in the CB task. The only difference is that we have only the two tokens [V 1] and [V 2] initialized with yes and instead (for entailment and not entailment, respectively).

Model Selection
Although all trainable parameters are manually initialized in this setup, different random seeds can yield different results because of the order the training examples appear during an epoch.
In the few-shot setup we cannot access the original validation set. Thus, we disable early stopping and simply pick the last checkpoint.
In order to find the best initial learning rate, we conduct 20 runs of WARP with the same learning rate each time by randomly choosing 16 training examples and taking the rest for a development set. We repeat this for all candidate learning rates and choose the one with the best average validation performance across all the random seeds.
Finally, in order to eliminate the effect of different random seeds, we build an ensemble model from 20 WARP runs using simple majority vote.

Results
As seen in Table 3, WARP outperforms PET and GPT-3 baselines, but stays behind iPET on both tasks. GPT-3 has 170B parameters, but none of them is being trained for the given tasks. PET and iPET have 255M parameters, and all of them are trained for these tasks. Additionally, they leverage unlabeled examples using distillation. WARP has roughly the same 255M parameters, but only 1024 of them are trained for any single model. An ensemble of 20 WARP models has slightly more than 20K trainable parameters. 6 Discussion 6.1 Interpreting tokens learned by WARP WARP learns prompt embeddings in a continuous space. In this section, we explore those embeddings by looking at the nearby token vectors. Table 6 in the Supplementary material lists the closest tokens (in terms of cosine similarity) to the learned embeddings. All GLUE tasks are initialized with [MASK] token, except for RTE, MRPC, and STS-B, which are initialized from the pretrained MNLI model. The prompt tokens of the solutions for those three tasks are quite close to the ones from the MNLI solution. We have seen similar behavior on SuperGLUE experiments with manual initializations. The solution for CoLA (which is one of the worst-performing tasks) is close to the initialized point.
We do not see any prompt tokens that are meaningful in the context of the tasks. As expected, the verbalized tokens are more interpretable. For  (Shin et al., 2020b) . example, the embedding for the "contradiction" class of MNLI is close to the token "Unless". The embeddings for "negative" and "positive" classes of SST-2 task are close to "defective" and "important", respectively. Other verbalized tokens are non-interpretable (e.g. "470" or word pieces with non-Latin characters).

Comparison with AutoPrompt
AutoPrompt (Shin et al., 2020b) learns a prompt for the given task in the finite space of vocabulary tokens. Their best version uses 3 or 6 prompt tokens and reaches 91.2% accuracy on the development set of SST-2. The search space of WARP is significantly larger, which allows WARP to get better performance with just a single prompt token (93.8%).
AutoPrompt does not achieve meaningful results on RTE or CB tasks. WARP succeeds on both without manual initialization. Moreover, with manual initialization, WARP gets good performance on both tasks even with just 32 examples (Table 3). Figure 4 shows the dependence of the accuracy on SST-2 development set from the number of training samples. Both WARP and AutoPrompt use 10 prompt tokens. With a few hundred training samples or fewer, the difference between the two algorithms is not significant. WARP starts to perform better with more training samples.

Approach
# of parameters to store Linear probing M + ECN Full fine-tuning M N Single layer Table 4: The number of parameters to be stored to serve N text classification tasks with at most C classes each, using a pretrained language model with M parameters. E is the dimension of embeddings (1024 in the case of RoBERTa). In TinyBERT, M 0 can be up to 10 times less than M . In Adapters, E is roughly equal to E, as the number of layers to which adapters are attached roughly compensates the smaller size of the bottleneck layer. In WARP, K is the number of prompts (usually fewer than 10). Shin et al. (2020b) include results with a manually designed prompt 6 which performs pretty well (shown as a dashed line). We also compare with the manually initialized 7 version of WARP, which performs very well with just 100 examples.

Real-world applications
The importance of NLP systems like WARP can be demonstrated by the following application. Suppose we want to build a system that needs to serve N >> 1 classification tasks simultaneously. Let the number of classes for each task be bounded by C. The system can be based on a large pretrained language model with M parameters, using word embedding size E. How many parameters should the system store in the device memory to be able to serve all N tasks?
If we take the approach with frozen features, we can reuse M parameters for all tasks and store additional ECN task-specific parameters. This is optimal in terms of storage but will not perform well. The other extreme is to fine-tune the whole model for each task and store at least M N parameters. Table 4 shows the trade-offs offered by the other solutions. Methods like TinyBERT decrease the number of parameters from M N by only M . WARP, on the other hand, needs to store only M + N E(C + K) parameters, where K is the number of trainable prompt tokens.
In practice, WARP additionally allows performing inference on inputs for different tasks in parallel, using samples of multiple tasks in the same batch. Every input sentence can be concatenated with task-specific pretrained prompts in advance. Then, the forward pass of the network is identical for all tasks. The final task-specific linear layers can be concatenated to form a single large linear layer with at most N C output neurons.
This approach can be especially useful in the systems that provide machine learning models as a service. By storing one copy of a pretrained language model, it is possible to serve a large number of user-specific models in parallel with little overhead.

Conclusion
In this paper we have proposed an alternative way to transfer knowledge from large pretrained language models to downstream tasks by appending carefully optimized embeddings to the input text. The method outperforms existing methods with significantly more trainable parameters on GLUE benchmark tasks and shows an impressive performance in a few-shot setting on two SuperGLUE tasks. On the sentiment analysis task, the performance is comparable to the fully fine-tuned language models. This method can save a lot of storage in software applications designed to serve large numbers of sentence classification tasks.

Acknowledgments
This work is based in part on research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation therein. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Laboratory, DARPA or the U.S. Government.
The work was supported by the RA Science Committee, in the frames of the research project No. 20TTAT-AIa024. Most experiments were performed on GPUs donated by NVIDIA.
• Initialization is performed either with the embedding of the [MASK] token, or randomly initialized from a normal distribution, with the mean and variance taken from the matrix of RoBERTa's word embeddings.
The hyperparameter search took roughly 4 days on two Titan V GPUs. The final choices for each task are shown in Table 5. Initialization with [MASK] performed better than the random initialization.
We disable all dropouts inside Transformer. We use huggingface implementation of AdamW optimizer with weight decay disabled. The gradient is normalized to the value 1.0. For the batch sampling we use bucketing with padding noise of 0.1. In order to use the device memory more effectively, we also set maximum number of tokens per batch to 2048. The maximum sequence length is truncated to 512 tokens. We enable mixed precision and pad all sequence lengths to the multiples of 8 for the effective usage of TensorCores 8 .  Table 5: Hyperparameters of our best-performing models.
[MASK] means the prompts are intialized with the word embedding of same token, and MNLI means the prompt is initialized with the prompts of out best MNLI run. Table 6 lists the closest vocabulary words to the learned embeddings. Most tasks have two input sentences, so the prompts consist of three parts: one is added before the first sentence, the second one is added between the sentences and the third one is appended next to the second sentence. For the single-sentence tasks, the second and third parts of the prompt are simply concatenated. Each task has trainable verbalizer tokens, one per output class.

B Learned Tokens
The prompts of RTE, MRPC and STS-B are pretty similar to MNLI's prompts, as the models for these tasks were initialized from pretrained MNLI models. The other tasks were initialized with [MASK] tokens. The final model for CoLA didn't move too far from its initialization.

A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-repertoire inyl Idea dim
Verbalizers regression cH Table 6: The closest words to the prompt and verbalizer token embeddings for the best model for each task. We use cosine distance to measure the distance.
[MASK] tokens highlighted in bold indicate the positions we use to output the prediction.