Pre-trained Language Models Can be Fully Zero-Shot Learners

How can we extend a pre-trained model to many language understanding tasks, without labeled or additional unlabeled data? Pre-trained language models (PLMs) have been effective for a wide range of NLP tasks. However, existing approaches either require fine-tuning on downstream labeled datasets or manually constructing proper prompts. In this paper, we propose nonparametric prompting PLM (NPPrompt) for fully zero-shot language understanding. Unlike previous methods, NPPrompt uses only pre-trained language models and does not require any labeled data or additional raw corpus for further fine-tuning, nor does it rely on humans to construct a comprehensive set of prompt label words. We evaluate NPPrompt against previous major few-shot and zero-shot learning methods on diverse NLP tasks: including text classification, text entailment, similar text retrieval, paraphrasing, and multiple-choice question answering. Experimental results demonstrate that our NPPrompt outperforms the previous best fully zero-shot method by big margins, with absolute gains of 12.8% in accuracy on text classification and 15.6% on the GLUE benchmark. Our source code is available at https://anonymous.4open. science/r/NPPrompt.


INTRODUCTION
Natural language understanding (NLU) has been important in many applications such as intelligent dialog assistants, online search, and social media analysis. Recent advancement of NLU has been driven by emergent pre-trained language models (PLMs) including BERT (Devlin et al., 2019;Liu et al., 2019b), GPT (Radford et al., 2018;Brown et al., 2020), BART (Lewis et al., 2020), andT5 (Raffel et al., 2020). Prior studies show that PLMs obtain substantial knowledge during pre-training on raw text corpus (Petroni et al., 2019;Feldman et al., 2019). By fine-tuning on taskspecific labeled data, PLMs exploit such knowledge and gain impressive accuracy on a wide range of NLP tasks, such as text classification (Kowsari et al., 2019), question answering (Rajpurkar et al., 2016), machine reading comprehension (Campos et al., 2016), etc.
However, fine-tuning approaches are expensive. It requires labeled datasets, which are rarely available for many tasks. Significant computational efforts are needed to update PLMs' parameters for multiple tasks. In addition, fine-tuning results in one distinct model for each task to maintain.
How can we generalize a pre-trained model to many NLP tasks, without labeled or additional unlabeled data? Existing few-shot and zero-shot approaches propose to construct prompts to elicit desired predictions from PLMs (Brown et al., 2020). The main idea of prompting PLMs is to convert an input utterance to one with masked templates. For example, in text classification an input can be "The Warriors won the NBA championship 2022" and it is instead converted to "A [MASK] news: The Warriors won the NBA championship 2022". A PLM (e.g. BERT) takes the converted text and produces predictions for the masked token, along with the probability. Ideally, a PLM will generate a higher probability for the word "sports" than "politics" on the [MASK] token.
Although these prompting-based methods are effective, they require unlabeled data for training or huge human efforts to construct prompts and to choose designated tokens to represent class labels (Schick & Schütze, 2021a;Gao et al., 2021). In addition, these manually constructed verbalizers, i.e. mapping from words (e.g. "basketball") to class labels (e.g. SPORTS), do not extend to new emerging categories after PLMs are deployed.
In this paper, we investigate the fully zero-shot learning problem for NLU where only the target label names are available but not the extra raw text. We propose nonparametric prompting PLM (NPPrompt), a novel method to generate predictions for semantic labels without any fine-tuning. NPPrompt uses PLM's own embeddings to automatically find relevant words to labels (e.g. "basketball" and "NBA" for SPORTS), therefore it does not need humans to construct verbalizers. Our key idea is to search for the top k nearest neighbors to a label name in the embedding manifold and then generate and aggregate PLM's predicted logits from masked prompts. In the above case, both predicted values for "basketball" and "NBA" contribute to the final prediction for the SPORTS category. In this way, NPPrompt can be easily generalized to any new categories as long as the category names are semantically meaningful.
The contributions of this paper are as follows. a) We develop NPPrompt, a novel method for fully zero-shot learning with PLMs. b) We conduct extensive experiments on diverse language understanding tasks including text classification, text entailment, similar text retrieval, and paraphrasing. Experimental results show that NPPrompt outperforms the previous zero-shot methods by absolute 12.8% in accuracy on text classification and 15.6% on the GLUE benchmark. Surprisingly, NPPrompt is on a par with the best prior method that trained with manual verbalizers, an additional knowledge base, and extra unlabeled data.

RELATED WORK
Prompting The success of GPT-3 (Brown et al., 2020) has attracted much attention to prompting engineering, a new way to leverage pre-trained language models. Brown et al. (2020) concatenate a few input and output pairs and feed them to the large-scale GPT-3 language model, which is an intuitive in-context learning paradigm, allowing the model to generate answers for additional cases autoregressively. Recent works (Schick & Schütze, 2021a; show that small-scale pre-trained language models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b) and ALBERT (Lan et al., 2019) can also achieve decent performance using prompt-tuning. Prompting has been applied to a large variety of tasks such as Text Classification (Schick & Schütze, 2021a), Natural Language Understanding (Schick & Schütze, 2021b), Knowledge Probing (Petroni et al., 2019), and Relation Extraction (Han et al., 2021). Typically, a piece of prompt contains a template and a verbalizer. The language model predicts a probability distribution over vocabulary given the template and the verbalizer transforms it into a prediction over class labels. In this work, we focus on designing the verbalizers automatically.
Verbalizer Design The verbalizer is an important component in prompting which bridges model outputs and labels and greatly impacts the performance. Schick & Schütze (2021a) design human written verbalizers for prompting, however, they are highly biased towards personal vocabulary with inadequate coverage. Apart from manually designed verbalizers, some recent studies explore automatic verbalizer construction. Auto-L (Gao et al., 2021) uses re-ranking to find the label words set by fine-tuning the model on the candidates searched by RoBERTa; AutoPrompt (Shin et al., 2020) applies gradient-based search to create both prompts and label words automatically with a few trigger examples. But these approaches need to update parameters with gradient descent, which turns out to be infeasible without access to the model weights (e.g., GPT-3). KPT (Han et al., 2021) incorporates external knowledge into the verbalizer in which the unlabeled dataset is needed to refine the label words and thus is not applicable to scenarios where only label names are known. In contrast, our approach NPPrompt directly finds, without any gradient update, relevant words to label names with only PLM's initial word embedding.
Zero-shot Text Classification The general zero-shot text classification usually focuses on classifying texts into classes that are unseen during the training process. Transferring knowledge from seen classes to unseen ones requires accurate and discriminative descriptions of all classes (Liu et al., 2019a;Xia et al., 2018), joint embeddings of categories and documents (Nam et al., 2016) or semantic correlations among classes (Rios & Kavuluru, 2018;Zhang et al., 2019). However, these methods require supervised data for the known label set and thus cannot be extended to settings where no labeled pairs for any category is available. (Meng et al., 2020) Figure 1: The illustration of NPPrompt. We generate the label words by searching the related words from the initial word embedding of the pre-trained language model. By aggregating logits from the label words, we predict the category with the largest score (SPORTS).
corpus for extracting the topic-related words and performing self-training. In this work, NPPrompt achieves competitive and even better performance without using any unlabeled dataset.

BACKGROUND: PROMPT-BASED TUNING FOR PLMS
We first provide standard paradigms, prompt-based tuning, that perform well in few-shot scenarios, before introducing our approach for the zero-shot case. Take N way text classification as an example. We aim to predict the label y ∈ Y for each sentence, where Y is the label set with N distinct classes.
Prompt-based tuning tunes PLM using customized prompts (Brown et al., 2020). The regular prompt-based tuning converts a specific task to a cloze-style mask language modeling problem. For each input example x (single sentence or sentence pair), we first apply a task template T on it, converting original input x to x prompt . For instance, we concatenate the template "T (·) = A [MASK] news :" with the original input "The Warriors won the NBA championship 2022" and wrap it into: The verbalizer f in vanilla prompt engineering map a set of selected words V from the vocabulary to the original label space Y, i.e., f : V → Y. Inversely, we use M(y j ) to denote the label words in V that are mapped into a specific label y j , ∪ yj ∈Y M(y j ) = V. Then we calculate the probability of label y j : where g(·) is for aggregating the probability of label words into the probability of the label. Then PLMs can be fine-tuned by minimizing the cross-entropy loss with supervised examples.
The key idea of NPPrompt is using PLM's word embeddings to automatically construct verbalizers -mapping from words to labels -in a fully zero-shot way. It does not need any additional raw text corpus for fine-tuning. NPPrompt consists of two steps to compute predictions for any labels in a nonparametric form (Figure 1). 1) We search for all label words closely related to each class y j in PLM's token embedding manifold. 2) Then we use the PLM to predict values for [MASK], filter them using each class's set of label words, and aggregate the properly weighed outputs to produce the final prediction. In the following, we describe NPPrompt for text classification but it generalizes to other language understanding tasks.
k-Nearest-Neighbor Verbalizer Construction For each class label (e.g. "SPORTS"), we search over the whole vocabulary V for the top-k words nearest to the label name in the PLM's embedding space. Here, the distance between words and label names is measured using the cosine similarity score. Other distance metrics work as well and are examined in Section 5. We denote k as the neighborhood number. Assuming the embeddings of word v i and label name y j are emb(v i ) and emb(y j ) respectively, the label words of the verbalizer for y j are selected by top-k ranking: where S(·) is the cosine similarity function: Word Sim Word Sim " sports" 1.00 " business" 1.00 " Sports" 0.77 " Business" 0.78 " sport" 0.75 " businesses" 0.74 " sporting" 0.68 "business" 0.72 " athletics" 0.65 "Business" 0.67 "sports" 0.65 " businessman" 0.59 "Sports" 0.65 " corporate" 0.58 " Sport" 0.62 " company" 0.56 " athletic" 0.61 " enterprise" 0.55 " athletes" 0.61 " businessmen" 0.55 Since the PLM is already pre-trained on raw text corpus, it acquires sensible semantic knowledge and relatedness of words in the vocabulary. We use PLM's embedding to search for label words semantically relevant to given label names. For illustration, we show the found label words of two categories in the AG News dataset (Zhang et al., 2015) and the corresponding similarity scores in Table 1. We also extend our verbalizer to support label names with longer expressions in Appendix A.1.
Nonparametric Aggregation of Prompted Predictions For each input text x, we construct a prompt-augmented sequence x prompt = T (x) with a [MASK] token. We use the PLM to predict tokens for [MASK]. In contrast to previous prompting methods which directly calculate the probability over the surface labels, we use the nearest label words from above to compute the probability for each output label. Only the words in a label's top-k neighborhood will contribute to the class prediction. The contribution from each label word is non-equal.
To be specific, with T (x), a PLM produces the logit vector Θ [MASK] for all possible words at the [MASK] token. Notice that if the whole vocabulary is V, Θ [MASK] ∈ R |V| . Then we compute the class probability for a label y j by aggregating the logits filtered by the verbalizer's label words. We use kernel smoothing to aggregate as follows: Where the weight between label word v i and class name y j is defined as: Finally, the best class prediction is selected from the maximum of all labels: Notice since we use kernel smoothing on logits instead of probability, Q is also unnormalized probability.
There are certain conditions where one class has label names containing little semantic meaning or where several keywords are needed to define a label. For instance, in the DBPedia dataset (Lehmann et al., 2015), one class is related to NATURALPLACE, then we can use the keywords {"river", "lake", "mountain"} to represent this class. In this setting, we pick out the keyword with the maximum score calculated by Equation 4 to represent each label first. Then we choose the label with the largest score. We use Φ(y j ) to denote all keywords in class y j , and the final prediction is :

EXPERIMENT
We conduct extensive zero-shot learning experiments to demonstrate the effectiveness of our method. We present our implementation details together with the main results and address several research questions pertaining to NPPrompt in this section.  We adopt sentiment classification tasks on two datasets, IMDB (Maas et al., 2011) and Amazon (McAuley & Leskovec, 2013), and topic classification tasks on another two datasets, AG News (Zhang et al., 2015) and DBPedia (Lehmann et al., 2015). All datasets are in the English language. For each task, we directly use the test set to assess model performances, without incorporating validation or training sets for posttuning or cherry-picking hand-crafted prompts. The statistics of each dataset are shown in Table  2.  To concentrate on the verbalizer and reduce the influence of templates, we adopt multiple fixed manual templates following Hu et al. (2022). We report the best template used for the RoBERTa-large model in Table 3.

DATASETS, PROMPT TEMPLATES, AND EXPERIMENTAL SETUP
We implement our experiments based on an open-source toolkit OpenPrompt , which aims to conduct prompt learning easily. We choose RoBERTa-large (Liu et al., 2019b) as our pre-trained language model. We report the best accuracy of classification results for all experiments using different neighborhood numbers. Since we directly use the pre-trained models for testing, there is no randomness (random seed) in this process. All experiments are conducted on Nvidia A6000 GPUs and more details can be found in Appendix A.1.

BASELINES
We evaluate the following baseline methods. LOTClass Meng et al. (2020) employ pre-trained neural language models with unlabeled data for category understanding, i.e., finding words similar to label names. They then introduce a selftraining approach to the entire unlabeled corpus to generalize the model.
GPT-3 with descriptions Following Brown et al. (2020), we manually write the descriptions for each class and query GPT-3 where the predicted token serves as the prediction. We show the descriptions in Appendix A.1. KPT Hu et al. (2022) propose knowledgeable prompt-tuning, which expands the label words space using external knowledge bases (KB). KPT also refines the expanded label words based on the unlabeled data. We show the best results of KPT in the zero-shot setting.

MAIN RESULTS
We demonstrate our experimental results in Table 4. Overall NPPrompt outperforms Null Prompt and Multi-Null Prompt remarkably by over 10 percent in a fully zero-shot setting. NPPrompt achieves an accuracy of over 85% on AG News and DBPedia and over 90% on IMDB and Amazon.
We conjecture that topic classifications in AG News and DBPedia are more complicated than binary sentiment classifications in IMDB and Amazon, hence the higher accuracy on the latter.
NPPrompt is only slightly worse than KPT but outperforms most baseline methods in which human efforts/external knowledge or unlabeled data are strictly required. It's worth noting that NPPrompt performs much better than ManualVerb, suggesting that the label words generated by our method are more comprehensive and unbiased than human-designed ones. Besides, NPPrompt can beat GPT-3 by 4% in terms of average accuracy, a strong sign of the great potential for RoBERTa-large with 355M parameters compared to 175B parameters giant GPT-3.
To explore how our method NPPrompt performs on different kinds of tasks, we also conduct experiments on GLUE benchmark (Wang et al., 2018). Specifically, we test on Multi-Genre Nat- As shown in Table 5, NPPrompt outperforms all other methods in fully zero-shot setting. Auto-L (Gao et al., 2021) and AMuLaP (Wang et al., 2022)   Both weight and similarity functions play a critical role in the design of NPPrompt and we test how NPPrompt performs on AG News with different configurations. The "Default" setting is as stated in Equation 3 and 5. We fix the similarity function S (emb(v i ), emb(y j )) = emb(vi) emb(vi) · emb(yj ) emb(yj ) , set w(v i , y j ) = 1 for the "Same weight" setting and w(v i , y j ) = S(emb(vi),emb(yj )) v k ∈M(y j ) S(emb(v k ),emb(yj )) for the "Average weight" setting. Besides cosine similarity, the Euclidean distance and the dot product are also common similarity measures for embeddings. Consequently, we fix the weight w(v i , y j ) = 1, choose S (emb(v i ), emb(y j )) = − emb(v i ) − emb(y j ) for the "Euclidean distance" setting and S (emb(v i ), emb(y j )) = emb(v i ) · emb(y j ) for the "Dot product" setting. It can be informed from Figure 2 that with a fixed similarity function, different weight calculations yield comparable results, but with a fixed weight, cosine similarity is the optimal similarity measure.  NPPrompt sums up all logits for a label word set as shown in Equation 4. Another possible approach is to sum up the probabilities from PLM's prediction for the label words and choose the argmax for all different labels as the prediction: P (y j |x prompt ) = vi∈M(yj ) w(v i , y j ) · P ([MASK] = v i |x prompt ), y = arg max yj P (y j | x prompt ) We conduct experiments on AG News to compare the above two approaches, one that sums up logits ("sum logit") and one that sums up probabilities ("sum prob"). Figure  3 presents the results and we find that "sum logit" performs better at small k but "sum prob" delivers better results when k exceeds 30. "sum logit" achieves the best result at k = 12 among all experiments.  The number of the label words impacts the performance of our method NPPrompt as well. In Figure 4, we display the performances of different models with varied neighborhood numbers. In general, NPPrompt attains similar test accuracy across different neighborhood numbers. Regardless of the choice for neighborhood number, NPPrompt-RoBERTa-large achieves over 80% accuracy in topic classification tasks on AG News and DBPedia, and it gains over 90% accuracy in sentiment classification tasks on IMDB and Amazon. In real-world applications, we can simply choose a fixed neighborhood number (e.g. 8-10) to achieve decent performance.  Table 6: The zero-shot results of different backbones. We also include the best neighborhood number k as the second column in each category. NPPrompt-RoBERTa-large performs the best in all datasets.

HOW
NPPrompt highly depends on the pre-trained language model. The label words for the categories with various PLMs are different, a result of their unique initial word embedding and vocabularies. To study the effect of applying different PLMs, we conduct extra experiments using BERT-base-cased, BERT-large-cased, and RoBERTa-base models. The results are shown in Table 6. NPPrompt with RoBERTa-large generates the best performance, which may result from the fact that RoBERTa-large has the largest number of parameters and that it is pre-trained on the largest corpus. In general, larger models (RoBERTa-large/BERT-large) achieve better performances than base models (RoBERTabase/BERT-base) as expected, and RoBERTa shows better accuracy than BERT models on average.

DISCUSSION
NPPrompt achieves superior results in zero-shot text classifications. We attribute good performance to two aspects. Firstly, compared to fixed words or human-designed label words, using the initial word embedding from PLMs enables us to find cognates of the label words. For example, we have {" school", " School", " schools", " SCHOOL"...} for the SCHOOL category, as shown in Table 7. Secondly, we effectively elicit the potential of pre-trained language models. During the pre-trained process, language models are required to predict the masked token. The prediction of the [MASK] token of the PLM is not fixed in the inference stage, so there is no standard correct answer to fit into the context and instead, multiple words sharing similar meanings can be predicted. Our approach reformulates the zero-shot classification problem to the masked token prediction problem which is well aligned with the pre-training process.
NPPrompt points out a promising way to deal with dynamic and open zero-shot classification problems where new classes can emerge or old classes should be deleted. Efficient PLMs and category names are all we need. Together with the key words design in Equation 7, NPPrompt can also work in special scenarios where label names do not have semantic meaning (e.g. category with label name {"A", "B", "C"}). This technique can be widely deployed in real-world applications.

CONCLUSION
In this paper, we propose NPPrompt, a novel and effective method for fully zero-shot learning with pre-trained language models. We use initial word embedding of PLM to automatically find related words for category names, which enables us to construct the verbalizers without manual design or unlabeled corpus. Experimental results show that NPPrompt outperforms the previous zero-shot methods by large margins.

A APPENDIX
A.1 EXPERIMENTAL DETAILS Table 8 shows all the manual templates of NSP-BERT. Table 9 summarizes manual designed descriptions of each dataset for Semantic Retrieval. As for GPT-3, we query the OpenAI API 1 and test with Davinci model. The prompts for GPT-3 are shown in Table 10. We list all templates and label names for NPPrompt of all experiments in Table 11. We also list the related words result in sentiment classification (GOOD/BAD) and NLI (YES/NO)) tasks in The attitude of this text is label name.   Table 4.

A.2 EXPERIMENTS OF NPPROMPT WITH T5 MODEL
NPPrompt also works on text-to-text pre-trained language models (e.g. T5 (Raffel et al., 2020)) with minor modification. We use T5-base to generate the missing spans at the end of the prompt text. We choose the first predicted token as the input to the verbalizer and follow the nonparametric aggregation steps to decide the category. The results are shown in  .", e.g. "What do animals do when an enemy is approaching? The answer is [MASK].". Then we search for k-nearest neighbors for each target answer with k = 15. Finally, we follow the process when we deal with text classification tasks and obtain the prediction. The experiment results are listed in Table A.1 (few-shot results from Zelikman et al. (2022)). NPPrompt not only achieves satisfactory results on CommonsenseQA dataset but even outperforms few-shot GPT-J (Wang, 2021) as well.

A.4 EXTENSION TO MULTI-WORD EXPRESSIONS
Here we extend our method to support multi-word label names like NATURALPLACE, MEANOF-TRANSPORTATION and etc. The major part is to obtain related words to a multi-word label name.
Once we obtain the related words, the rest non-parametric aggregation step remains identical. We consider two scenarios: The label name is multi-word (i.e., phrase) and related words are still single-words To model the phrase, we use average contextualized embedding instead of word embedding for both label names and related single-words to compute cosine similarity. As suggested in Su et al. (2021), we whiten the contextualized output of RoBERTa by a linear transformation obtained from the contextualized embedding of all words in vocabulary. To obtain the best result, we select the output of layer 6 of RoBERTa. This extension achieves 61% accuracy on the DBpedia dataset using the original multi-word label names (original label names can be found at https://rdrr.io/cran/ textdata/man/dataset dbpedia.html).
Both the label name and related words are phrases Since the search space of a related phrase is exponentially large in its length, we use another prompt to filter candidate words. The template we use is "[LABEL NAME] can also be called [MASK] * n.", where n is the length of the candidate. For example, if the label name is MEANOFTRANSPORTATION and n = 2, the template will look like "Mean of transportation can also be called [MASK] [MASK].". We feed it to RoBERTa and filter top-k candidate phrases of masked prediction. Since masked prediction is conditionally independent of each mask, we further re-rank the top-k candidate phrases by either the contextualized embedding method mentioned above or another auto-regressive LM. For the latter one, we evaluate the perplexity of the template with [MASK] filled by candidate phrases. This generates 71% accuracy on DBpedia if the length of the phrase is two and the re-ranking is performed by GPT-2 (Radford et al., 2019).