AdaPrompt: Adaptive Model Training for Prompt-based NLP

Prompt-based learning, with its capability to tackle zero-shot and few-shot NLP tasks, has gained much attention in community. The main idea is to bridge the gap between NLP downstream tasks and language modeling (LM), by mapping these tasks into natural language prompts, which are then filled by pre-trained language models (PLMs). However, for prompt learning, there are still two salient gaps between NLP tasks and pretraining. First, prompt information is not necessarily sufficiently present during LM pretraining. Second, task-specific data are not necessarily well represented during pretraining. We address these two issues by proposing AdaPrompt, adaptively retrieving external data for continual pretraining of PLMs by making use of both task and prompt characteristics. In addition, we make use of knowledge in Natural Language Inference models for deriving adaptive verbalizers. Experimental results on five NLP benchmarks show that AdaPrompt can improve over standard PLMs in few-shot settings. In addition, in zero-shot settings, our method outperforms standard prompt-based methods by up to 26.35\% relative error reduction.


Introduction
Prompt-based methods (Brown et al., 2020;Liu et al., 2021;Schick and Schütze, 2021a;Li and Liang, 2021) have received increasing attention in Natural Language Processing (NLP) recently.The main idea is to make the most use of pretrained language models (PLMs) by adapting an NLP task into a natural language prompt, which can then be filled by PLMs.Take sentiment classification (Socher et al., 2013;Bai et al., 2021) for example.Given the sentence "I love the movie.", the standard task is to make a binary classification on its sentiment polarity (i.e., positive or negative).Prompt-based methods first transform the sentence into "I love the movie.The movie is ⟨mask⟩."(the underlined text is called prompt), and then identify its polarity by checking whether PLMs tends to predict "good" or "bad" for the ⟨mask⟩ token (where the predicted words are then verbalized into class labels).The prompt-based task formulation is close to masked language modeling (Schick and Schütze, 2021a,b), which is the mainstream pretraining strategy, allowing PLMs to provide rich language knowledge seamlessly.Prompt-based methods have been shown particularly useful in zero-shot and few-shot settings (Petroni et al., 2019;Yin et al., 2019;Min et al., 2022), where with limited direct task data, prompt-based inference benefits more from large-scale pretraining than taskoriented fine-tuning.Existing methods, however, still suffer from several potential limitations.First, large raw text data used for pretraining do not necessarily contain sufficient patterns that are directly related to task specific prompts (Illustrated in Figure 1).For instance, the prompt for a question classification task is "Can you tell me the ⟨mask⟩: What are the twin cities?", where ⟨mask⟩ should be a class label word, e.g., location, person and etc (the correct label for this sample is definition).However, LM pretraining data are typically BOOKCORPUS (Zhu et al., 2015) plus WIKIPEDIA corpus, where such prompts can occur scarcely in the literal or paraphrased form.As a result, directly using PLMs to fill such handcrafted prompts across domains can lead to poor performance.Second, to project label words to task labels, most existing work (Schick and Schütze, 2021a,b;Cui et al., 2021) uses a pre-defined verbalizer.However, it often requires expert knowledge to build a verbalizer that can thoroughly cover candidate words and a poorlydesigned verbalizer limits the accuracy of predictions.These problems become even more serious under zero-shot or very-few-shot settings, where prompt-based models highly rely on the generalization ability of PLMs to new tasks and domains.
We propose AdaPrompt, a framework that adapts PLMs for end tasks considering both the prompts and the verbalizer.We are interested in addressing the above issues under a zero-shot setting, where little or no labeled training data are available for a particular task.The main idea is to adapt a PLM to a strong prompt-based model for an end task by exploring knowledge from its raw input data.In particular, as shown in Figure 2, given a raw test set without labels, we first ask a PLM to fill a prompt template for each input (e.g., "In summary, the movie is great.",where "great" is filled by PLMs).Then, we use the resulting text (input text + prompt + PLM output) as a prompt-aware query to retrieve relevant data from a large unlabeled corpus.In this manner, we can obtain a large dataset that contain both task and prompt characteristics, and we adaptively continual pretrain (Gururangan et al., 2020) the PLM on the retrieved data, which can substantially benefit prompt-based methods on downstream NLP tasks.
Meanwhile, we found current way of building verbalizers is also not optimal.Given a specific task, different words can be verbalized into the same class labels.For example, a large number of adjectives can express the positive sentiment, and the best-performing candidates depend on the domain, PLM and context.In AdaPrompt, we propose to adaptively augment verbalizers by making use of knowledge from PLMs and Natural Language Inference (NLI) models.Take sentiment analysis for example, given "good" and "bad" as seed verbalizers, we first let PLMs to predict more candidate words, such as "amazing" and "great".Then, to identify if these candidates are suitable to verbalizer, we refer to an NLI model to predict whether "This movie is amazing."entails the meaning of "This movie is good.".In this way, we can automatically expand the verbalizers.
Experiments on five text classification tasks show that AdaPrompt outperforms baseline promptbased methods by 2.29%-5.79% in very-few-shot setting and 2.46%-15.00% in zero-shot setting on accuracy.To our knowledge, we are the first to consider how to bridge the gap between LM pretraining and NLP downstream tasks for promptbased NLP.We release our code and data at https://github.com/cylnlp/AdaPrompt.

Related work
2.1 Zero/Few-shot Prompt-based NLP Although prompt-based methods have been used for multiple NLP tasks (Brown et al., 2020;Raffel et al., 2020;Brown et al., 2020;Cui et al., 2021), most of existing work focus on text classification (Shin et al., 2020;Gao et al., 2021;Min et al., 2022;Hu et al., 2022).A typical related work is PET (Schick and Schütze, 2021a), where Schick and Schütze (2021a) formally define pattern-verbalizer pairs that have been widely adopted by successive works.By using such pairs, Schick and Schütze (2021a,b) develop a series of work to explore the potential of PLMs, including annotating soft labels for raw training data, and data augmentation iteratively.However, different from PET that assumes the availability of large silver training set for downstream tasks, we focus on zero and very-few-shot settings, where even unannotated task-relevant dataset is also limited (Perez et al., 2021).Therefore, following Hu et al. (2022), we simply focus on standard patternverbalizer pairs for text classification.
Prompt engineering (Jiang et al., 2020;Gao et al., 2021) focuses on how to create prompts that can better induce PLMs to make correct predictions.Discrete prompt engineering works by replacing, deleting, inserting or paraphrasing parts of the prompt (Wallace et al., 2019;Yuan et al., 2021).Those methods can efficiently adapt PLMs to end tasks, but they highly reply on annotated data for tuning parameters.Different from the above studies, we are interested in narrowing the gap between LM pretraining and NLP tasks for prompting learning in zero or very-few-shot settings.
It has been shown that using different verbalizers can also be a key factor for prompt learning (Hu et al., 2022;Cui et al., 2021).However, manually exploring label words is time-consuming and may neglect potential candidates.Recently, Hu et al. (2022) uses multiple external knowledge bases, such as related words and sentiment dictionaries, to augment verbalizers for corresponding tasks.Different from them, we focus on exploring knowledge in PLMs themselves.By making use of external NLI models AdaPrompt can select verbalizers automatically without the need of labeled task data, which is useful in zero-shot settings.

Continual Pretraining for Domain Adaptation
Continual pretraining (Gururangan et al., 2020) has shown benefit of optimizing a PLM to a target domain before further fine-tuning.It can be categorised into domain adaptive continual pretraining and task adaptive continual pretraining.The difference is that, domain adaptive pretraining (DAPT) uses domain relevant data while task adaptive pretraining (TAPT) uses task-specific data.Similar to continual pretraining, many recent methods highlight the merits of relying on language modeling objectives for domain adaptation.Chronopoulou et al. (2019) and Radford et al. (2018) propose to train task-specific parameters for PLMs by using an auxiliary LM loss on target domains.Models like SciBERT (Beltagy et al., 2019), DialogLM (Zhong et al., 2021), AMRBART (Bai et al., 2022a), SARA-BERT (Bai et al., 2022b) and Dict-BERT (Yu et al., 2022) are PLMs that are continually pretrained on large amounts of domain/task-specific corpora.
Data selection is a common practice in domain adaption for NLP models (Moore and Lewis, 2010;Ruder and Plank, 2017;van der Wees et al., 2017).It has been used in machine translation (van der Wees et al., 2017;Wang et al., 2018), parsing (Plank and van Noord, 2011;Ruder and Plank, 2017) and sentiment analysis (Ruder et al., 2017).The main idea is to have a selection model that can distinguish in-domain and out-of-domain data.The selection model can be a supervised classifier (Aharoni and Goldberg, 2020), similarity-based metric (Plank and van Noord, 2011) or language model perplexity (Moore and Lewis, 2010).Very recently, Yao et al. (2021) propose to retrieve a small set of training data from general corpora with labeled task data as queries, finding that using LM objective on this data as an auxiliary loss can help train task-specific NLP models without pretraining.

Method
Our method is based on prompt-based text classification methods (Section 3.1).The overall procedure of AdaPrompt is shown in Figure 2, which can be divided into two parts: PLM adaptation (Section 3.2) and verbalizer adaptation (Section 3.4).In Section 3.3, we introduce a method that adapts both PLMs and verbalizers in an iterative way for continual improvements.

Prompt-based Text Classification
Given an input text, x = (x 0 , x 1 , ..., x n ), we consider various tasks to classify the sentence into a class label l ∈ L. As mentioned in Section 1, the standard prompt-based method reformulates the input into a cloze-style question and identifies its label by checking PLMs' predictions.Table 1 shows the prompt templates and verbalizer patterns for the SST-2 (Socher et al., 2013), Yelp (Zhang et al., 2015), AGNews (Zhang et al., 2015), TREC (Voorhees and Tice, 2000) and DBPedia (Lehmann et al., 2015) datasets, which cover sentiment classification, topic classification and question classification tasks.Formally, let M be a language model pretrained on large-scale general data, and ⟨mask⟩ be the mask token.The prompt-based method first defines a pattern function, P rompt, that converts x into a cloze-style question containing ⟨mask⟩.Then, it defines a verbalizer function v, which maps a small set of pre-defined verbalizer words (Y) predicted at the position of <mask> into class labels, i.e., v : Y → L.
Take sentiment classification for movie review for instance.The task is to classify the sentiment polarity, where L = {positive, negative}.For an input x, we choose the pattern: P rompt ="x.In summary, the movie is ⟨mask⟩." Then we define a verbalizer that maps Y = {"good", "bad"} into L: v("good") = positive; v("bad") = negative It's a charming and often affecting journey.
It's a charming and often affecting journey.In summary, the movie is <mask>.

Prompt-aware query
It's a charming and often affecting journey.In summary, the movie is great.
It's a charming and often affecting journey.In summary, the movie is amazing.

Retrieve from General Data
really is a funny, charming movie.It's very sweet, and it's a great romantic comedy.
I first heard about this from Chelsi's.She gives a great summary of the movie which you can read about here.This was the best movie we had seen in a long time.
... Given an example: x = "It's a charming journey.",we can convert the input into a cloze-style question using P rompt: P rompt(x) = "It's a charming journey.In summary, the movie is ⟨mask⟩." Using such pattern-verbalizer pairs, we ask M to directly give scores s for each label l ∈ L as: where l = v(y).The predicted label is: (2)

Adaptively Retrieve Data for Continual Pretraining
As discussed in the Section 1, the lack of domain adaptation can be a potential challenge for promptbased NLP models, especially under zero-shot and very-few-shot settings.To tackle this problem, we propose to build a continual pretraining dataset by retrieving from general corpora, with unannoated test texts, designed prompts and label words as queries.In this way, we can obtain task-relevant data for any tasks or domains, using only test input.
Meanwhile, prompt and verbalzier information is also considered during the retrieval process, leading to a more comprehensive dataset for promptaware continual pretraining.Formally, given a retrieval query q, a retrieval engine E D indexed on a large general dataset D can return a set of similar text d q = E D (q).To obtain prompt-aware data that can not only adapt PLMs to target domains but also make PLMs more sensitive to prompts, we include both task and prompt characteristics when building queries.As shown in Figure 2, for a raw input text x in text data, we first convert it into P rompt(x), and obtain a set of predicted label words using a PLM M: where O = {o 1 , o 2 , ..., o |O| } are the top-|O| predictions.We replace the mask token in P (x) with o i , to form a list Q of queries.For example: where q i = "x.In summary, the movie is o i ." With this set of prompt-based queries, we retrieve prompt-aware data D p , which is a small subset of the general data.In this work, we use Elas-ticSearch1 indexed on a large general corpus as the search engine and we ask it to return a list of top-k texts that match the query.As shown in Figure 2, one test input can lead to multiple prompt-aware queries because the masked token in the prompt can be replaced by the |O| predictions.In addition, given one query, ElasticSearch can also give multiple returns with demanded k.
We continue to pretrain the PLM M on D p with masked language modeling loss and obtain

Iterative Adaptation
After obtaining M Dp , we can iterate the process by replacing M with M Dp in Eq. 3, and obtain an iterative set of predicted words and a list of queries marked as O ′ and Q ′ .Given that O ′ contains more in-domain knowledge, we can retrieve higher quality pretraining data with more task relevant information, using Q ′ to query the E D .In this way, we obtain a new version of D ′ p , and a new continual pretrained PLM M ′ Dp , which can also be used for zero-shot predictions using Eq. 1.In this work, we conduct this procedure twice.

Adaptive Verbalizer Augmentation
As described in Section 3.1, the regular promptbased method defines the verbalizer that maps predicted label word into task classes, such as "good" for positive and "bad" for negative.However, predefined verbalizer can be limited.To expand this verbalizer, we first infer top-|O| label words at mask token position over all inputs in test set.We filter the predicted words and obtain a set of high frequent words C as candidates for verbalizer augmentation.Then, we propose a new method for exploring useful verbalizer words by using knowledge from a Natural Language Entailment model.Specifically, given a seed verbalizer word y l ∈ Y l for label l, and a candidate word c ∈ C, we compare whether a prompt filled by y l is entailed with the prompt filled by c.The pseudo code is shown in Algorithm 1.If entailment relation holds for this pair, we add c add to Y l .And the new Y which can be considered as an augmented verbalizer.
After obtaining the augmented set of verbalizer words, Eq. 1 can be rewritten as: (5) and we can still use Eq. 2 for prediction.

Datasets and Prompts
To evaluate our methods, we conduct experiments on five benchmarks: SST-2 (Socher et al., 2013), Yelp (Zhang et al., 2015), AGNews (Zhang et al., 2015), TREC (Voorhees and Tice, 2000) and DBPedia (Lehmann et al., 2015) datasets.Table 1 shows prompt templates and seed verbalizer words that we use for each dataset.For AGNews and YELP, we adapt patterns and verbalizers from PET (Schick and Schütze, 2021a) since it is the basic promptbased method that has been mostly widely used.
AGNews is a text classification dataset in the domain of News.Given a headline and a main text body, the model is require to classify the news into one of the classes: (1) World, (2) Sports, (3) Business or (4) Science/Tech.
YELP is a sentiment analysis dataset.Given a restaurant review, the task is to predict whether the review is positive or negative.
SST-2 is a sentiment analysis dataset similar to YELP but its domain is movie reviews.Thus, we use the same seed prompt and verbalizer words as for YELP, but change "restaurant" in prompt template to "movie".
DBPedia 2014 is an ontology classification dataset, extracted from DBPedia 2014 with 14 nonoverlap classes, such as Educational Institution and Office Holder.We define two patterns for this task: P 1(x) = "Description to the ⟨mask⟩ x" P 2(x) = "Introduction to the ⟨mask⟩ x" and we use P 2 as the seed pattern.TREC-10 is a question classification dataset.Given a question, the task is identify the objective that the question asks, and classify it into one of six classes, such as a definition question or a numeric question.We define two patterns for this task: P 1(x) = "Tell me the ⟨mask⟩ x" P 2(x) = "Can you tell me the ⟨mask⟩: x" and P 2 as the seed prompt.
We conduct experiments in zero-shot and fewshot settings.In the zero-shot setting, we directly use PLMs to infer label words at masked positions.Under the few-shot setting, we follow Schick and Schütze (2021a) and Hu et al. (2022) and use prompt-tuning, which directly fine-tunes a LM given a small set of annotated data and prompts.
For zero-shot settings, the choice of hyperparameters is based on previous work (Gao et al., 2021;Schick and Schütze, 2021a,b).For all continual pretraining, we use a learning rate of 1e −5 , batch size of 96.We train each model for 3 epochs and use the checkpoint at 500 steps for evaluation.
For few-shot settings, we evaluate our models with 10, 50, 100 training samples.We follow previous work (Hu et al., 2022;Schick and Schütze, 2021a;Gao et al., 2021) and repeat the training and evaluation for 5 times using different seed, and report the averaged scores for each datasets.
Prompt-Aware Data Retrieval We take pretrain data of the ROBERTA model ( BOOK-CORPUS (Zhu et al., 2015), WIKIPEDIA, CC-NEWS (Nagel, 2016), STORIES (Trinh and Le, 2018), and OPENWEBTEXT (Gokaslan and Cohen, 2019)) as the general dataset to query from.We index them on sentence level with ElasticSearch and consider TF-IDF as the similarity metric.Verbalizer Augmentation To obtain possible verbalizers that can better represent classes, we first obtain top-N predicted words given a test sample (N = 20 for SST-2 and TREC, N = 10 for AGNews and N = 5 for YELP and DBPedia, considering their test set sizes).We set the number of candidate words |C| = 20 × |L|, where |L| is number of classes.We use a ROBERTA-large model fine-tuned on MNLI (Williams et al., 2018), as the entailment model for identifying potential verbalizer words for augmentation.Candidate with probability higher than a threshold t is then added to the augmented verbalizer.We set t = 0.4 by experiments.
For comparison, we also use Word2Vec (Mikolov et al., 2013) to obtain word vectors and explore potential verbalizer words by their similarity with the seed verbalizer words.

Main Results
Zero-shot Performance In zero-shot setting, we compare AdaPrompt with prompt-based methods using ROBERTA (Schick and Schütze, 2021a), GPT-2 (Gao et al., 2021) and GPT-3 (Zhao et al., 2021), respectively.The Channel refers to noisy channel model (Min et al., 2022) based on GPT-2.Table 3: Zero-shot results.We report average accuracy and standard deviation of different patterns here.Results of the best patterns are shown in brackets.The Avg. reports the overall averaged results.R. stands for ROBERTA-large.Ada and iAda denote to AdaPrompt and iterative AdaPrompt based on ROBERTA-large, respectively.The results of GPT-2 large and Channel are from (Min et al., 2022), and Channel is based on GPT-2 large.GPT-3 results are reported by Zhao et al. (2021), using .N A denotes to that results are not reported.For GPT-3 (Zhao et al., 2021), they only use a fixed prompt format.2021a,b), we report average accuracy, standard deviation and accuracy of the best pattern over different patterns.First, compared with our foundation model, ROBERTA-large, we see that AdaPrompt consistently outperforms regular prompt-based methods on all datasets with better average performance and best pattern performance, bringing a 2.46 ∼ 14.63 improvement.It is noticeable that AdaPrompt outperforms GPT-3 in zero-shot setting, which is a huge model with 175B parameters pretrained on a gigantic corpus.This confirms the effectiveness of AdaPrompt in domain adaptation.We observe that iterative AdaPrompt can further bring improvements on most datasets (SST-2, YELP and DBPedia).This directly demonstrates that PLMs continual pretrained on the retrieved data can be more adaptive to downstream tasks, and thus generate more task relevant label words, which can serve as a source to find better texts.Performance of iterative AdaPrompt (iAda) decreases on AG-NEWS, we believe this is because this news dataset is similar with general data used for pretraining ROBERTA, and thus continual pretraining on such retrieved data can be less useful.Finally, we see that AdaPrompt improves over 10.09 accuracy of the overall performance.
Few-shot Performance Table 4 reports the experimental results in few shot setting.Each experiment is repeated 5 times using different seeds and we report the average accuracy and standard deviation.To explore whether AdaPrompt can consistently bring improvement to ROBERTA, we conduct experiments using 10, 50, 100 samples, respectively.
Compared with ROBERTA-large baseline, under few-shot setting, AdaPrompt can still improve model performance.Although the relative improvement decreases as the size of training set improves, we can see that AdaPrompt outperforms ROBERTA over all tasks in all few-shot settings.In particular, AdaPrompt outperforms standard ROBERTA models by 2.29 ∼ 5.79% in 10-shot setting, showing that it is useful in the very-fewshot setting.

Ablation Study
To study the effectiveness of continual pretraining on prompt-aware data and verbalier augmentation, we conduct ablation experiments by removing continual pretraining (CP) or verbalizer augmentation (va).As shown in In addition, we investigate the influence on model performance by removing prompt-aware retrieval and only retrieving with raw texts.From the table we can see that on all datasets, using promptaugmented queries (AdaPrompt) give substantially stronger results.Take SST-2 for example, the accuracy is 71.22 (SST-2 -PR) given only raw input queries, but 75.92 with prompt-augmented queries, with a 4.7 absolute improvement.This shows that continual pretraining using prompt-aware data is highly beneficial to zero-shot prompt-based NLP.

Analysis
Generalization Capability For experiments in section 4.3.1,we use task test set as the sources to build queries for retrieving pretraining data.However, in a more general setting, we want to learn when the query data and test set are different, whether AdaPrompt can still generalize to this test set.To this end, we build an unseen test set by using the original training set of SST-2 and DB-Pedia.We then evaluate models (trained using queries from the origin test set) on this unseen test set.As shown in Table 6, AdaPrompt achieves 73.05 and 70.97 accuracy on SST-2 and DBPedia, respectively.Compared with performance on original test set (Table 3), although the performance of AdaPrompt sightly decreases when evaluated on SST-2 unseen test set, it can still outperform ROBERTA by a large margin (+8.23).It demonstrates that AdaPrompt has a strong generalization ability when query data and test set are different.
Size of Retrieved Data As stated, Elasticsearch returns top-k texts in the order of matching scores.Using a smaller k, the retrieved data are more textual related to the query, while using a larger k, the retrieved data can contain certain noise.To compare the effects of different sizes of retrieved data for continual pretraining, We set k to 1, 10, 50 100 for the SST-2 and set k to 1, 5, 25, 50 for DBPedia, respectively.As shown in Table 7, we see that accuracy rises in the beginning when retrieval size increases.But as the retrieval size grows bigger, the accuracy starts to decrease slightly.This can be explained by that the lower-ranked retrieved data have a lower relevance to the target task, which introduces more noise in continual pretraining.We use fixed k for our experiments in zero-shot settings (Section 4.2), due to lack of a validation set.In few-shot settings, in practice, k can be considered as a hyperparameter and tuned over validation data.
The Effect of Verbalizer Strategies Table 8 compares the model performance when using different verbalizer augmentation strategies, namely using NLI model and word similarity (Section 4.2).Additional, we compare AdaPrompt with a verbalizer augmentation method using knowledge base (KB) (Hu et al., 2022) 2 .To set a fair comparison, we limit the verbalizer word set for each label within 5. We report average accuracy and standard deviation here.Results show that, compared with using word similarity to select candidate words and directly using KBs to augment verbalizer words, using NLI to augment verbalizer words gives better performance on most tasks, and is also more stable.We also find that using KBs to augment verbalizer words gives better performance on the DBPedia tasks, but much worse performance on the TREC task.This can be because TREC is less close to topic classification (Min et al., 2022), and directly using the most related words can be noisy.This also suggests that more sophisticated strategy that cares of tasks and prompt information can be useful, which we leave for future work.

AdaPrompt with different PLMs
We apply AdaPrompt with different PLMs (Bert-large, Albert-large and ROBERTA-large).We report experimental results on the SST-2 dataset in Table 9.Although the performance of different models varies, we observe that AdaPrompt can consistently bring huge improvement over all models.We also find that model performance increases with model size.AdaPrompt using ROBERTA-large outperforms other models overall performance by a large margin (8.29 ∼ 18.67) and achieves 91.74 accuracy with the best pattern.

Conclusion
We investigated AdaPrompt, a zero-shot promptbased method for NLP that makes use of test input data and prompts for adaptive continual pretraining and verbalizer selection.Results on five classification datasets show that AdaPrompt improves over a standard prompt method by large margins.In particular, retrieving relevant data for continual pretraining of a language model can serve to warm-up the model for both domain adaptation and prompt-filling tasks.In addition, an NLI model allows effective selection of filled tokens to achieve improved performance.

Limitation
We acknowledge two major limitations of this work: 1. We only tested AdaPrompt on text classification tasks.The intention is to use this clear setting to compare with other prompt-based models.However, it is possible to extend AdaPrompt to other natural language understanding tasks or languages, which we leave for future exploration.
2. We only tested with ElasticSearch as the search method.However, there are signals showing the quality of retrieved text is constrained to the search engines.A better configuration or model of the search method might further improve AdaPrompt.

Figure 1 :
Figure 1: The distributions of data in prompt-based models.Task data, domain data, prompt data, and general data (for LM pretraining) are usually sampled from different distributions while remaining certain overlap (target data for prompt training).We aim to explore data from the overlapping area to bridge the gap between PLM and downstream tasks in prompt-based systems.

Algorithm 1
Verbalizer Adaptation Input: prompt P , seed verbalizer words y ∈ Y l , candidate words c ∈ C and an NLI system N for c in C do if N (f (P, y), f ill(P, c)) = Entail or N (f ill(P, c), f (P, y)) = Entail then add c in Y l end if end for Return Y l an adapted PLM M Dp .M Dp now contains richer knowledge of both the target domain and the prompts.It can be used to replace M in Eq. 1 for zero-shot text classification.

Table 1 :
Datasets used in this paper with seed prompts and verbalizer words.Each seed verbalizer word corresponds to a class label.

Table 2 :
Data statistics for datasets.E space corresponds to the ElasticSearch space.Note that the resulting data size is calculated after data de-duplication.

Table 2
presents the statistics of evaluation datasets used in this paper.TREC and SST contain smaller test sets, while YELP and DBPedia contain much larger test sets.To balance the retrieved data size, we set different top-|O| for predicted words and ElasticSearch space (k) for different datasets based on our practical experience.In other words, given one test input, we have |O| × k data.After de-duplication, the resulting retrieved data sizes are shown in Table 2.
Table 3 presents the results under zero-shot setting.Following previous work (Schick and Schütze,

Table 5 :
Experimental results of ablation study."-" means "without" here.va: verbalizer augmentation, CP: Continual Pretraining, PR: Prompt-aware Retrieval.Note that -PR means we do not use prompt-aware retrieval, but simply use raw test input data for retrieval and continual pretraining, refered as in-domain adaptation.

Table 6 :
Model performance tested on unseen test set.We report averaged accuracy and standard deviation.

Table 7 :
Analysis on retrieved data size.Data sizes are calculated after de-duplication.

Table 8 :
Model performance of AdaPrompt using different verbalizer augmentation strategies.va w : using word2vec similarity.va m : using ROBERTA trained on MNLI.va k : using most related words/sentiment dictionary.Avg.refers to overall averaged results.

Table 9 :
We report average accuracy and standard deviation here.Results of best patterns are shown in the bracket.