The Benefits of Label-Description Training for Zero-Shot Text Classification

Pretrained language models have improved zero-shot text classification by allowing the transfer of semantic knowledge from the training data in order to classify among specific label sets in downstream tasks. We propose a simple way to further improve zero-shot accuracies with minimal effort. We curate small finetuning datasets intended to describe the labels for a task. Unlike typical finetuning data, which has texts annotated with labels, our data simply describes the labels in language, e.g., using a few related terms, dictionary/encyclopedia entries, and short templates. Across a range of topic and sentiment datasets, our method is more accurate than zero-shot by 17-19% absolute. It is also more robust to choices required for zero-shot classification, such as patterns for prompting the model to classify and mappings from labels to tokens in the model's vocabulary. Furthermore, since our data merely describes the labels but does not use input texts, finetuning on it yields a model that performs strongly on multiple text domains for a given label set, even improving over few-shot out-of-domain classification in multiple settings.


Introduction
Pretrained language models (PLMs) (Radford et al., 2018;Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020;Raffel et al., 2020) have produced strong results in zero-shot text classification for a range of topic and sentiment tasks, often using a pattern-verbalizer approach (Schick and Schütze, 2021).With this approach, to classify the restaurant review "Overpriced, salty and overrated!", a pattern like "the restaurant is [MASK]" is appended to the review and verbalizers are chosen for each label (e.g., "good" for positive sentiment and "bad" for negative).The text is classified by the pretrained masked language modeling (MLM) head to choose † Co-senior authors.the most probable verbalizer for the [MASK] position. 1Although effective, the approach is sensitive to the choice of specific pattern/verbalizer pairs, with subtle changes in the pattern, the verbalizer, or both, often having a large impact on performance (van de Kar et al., 2022;Perez et al., 2021).
To alleviate these issues, we propose a simple alternative approach of training on small curated datasets intended to describe the labels for a task.Unlike typical training datasets, which consist of input texts annotated by hand with labels, our data contains only the descriptions of the labels.We refer to this data as LABELDESC data and show a few examples for topic and sentiment classification in Table 1.For topic classification, we include a few terms related to the label (e.g., "finance" for "Business", "racing" for "Sports"), a definition of the label from dictionary.com (e.g., "An athletic activity . . ." for "Sports"), and a sentence from the opening paragraph of the label's Wikipedia article (e.g., "Business is the activity of . . ." for "Business").For sentiment classification, we simply use related terms that capture the specific sentiment (e.g., "terrible" for "Very Negative") as well as a few hand-crafted templates (e.g., "It was t." where t is a related term).
Next, we finetune pretrained models using the pattern-verbalizer approach on LABELDESC data and evaluate them for text classification.For topic classification, we use patterns and verbalizers from Schick and Schütze (2022) to train on our LA-BELDESC examples by finetuning the model as well as the MLM head (see Section 3 for details).We refer to training on LABELDESC data as LA-BELDESCTRAINING.In experiments, we show that LABELDESCTRAINING consistently improves accuracy (average improvement of 17-19%) over zero-shot classification across multiple topic and sentiment datasets (Table 2).We also show that LA-BELDESCTRAINING can decrease accuracy variance across patterns compared to zero-shot classification (Table 3), thus being less sensitive to the choice of pattern.
We then conduct additional experiments to reveal the value of LABELDESCTRAINING under various circumstances.To study the impact of verbalizer choice, we experiment with uninformative (randomly initialized) and adversarial (intentionally mismatched) verbalizers (Section 4.2.1).While accuracy drops slightly, both settings are still much more accurate than zero-shot classification with its original verbalizers.That is, LABELDESCTRAIN-ING is able to compensate for knowledge-free or even adversarial verbalizer choice.We also compare to finetuning a randomly initialized classifier head without any patterns or verbalizers, again finding accuracy to be higher than zero-shot (Section 4.2.2).Collectively, our results demonstrate that LABELDESCTRAINING leads to strong performance that is less sensitive than zero-shot classification in terms of pattern/verbalizer choice, while also not requiring a pretrained MLM head.
Since LABELDESC data focuses entirely on the labels without seeking to capture the input text distribution, we would hope that it would exhibit stable performance across datasets with the same labels.So, we compare LABELDESCTRAIN-ING to the approach of training on a small super-vised training set from one domain and testing on another (Section 4.2.4).In multiple cases, LA-BELDESCTRAINING actually attains higher accuracy than few-shot supervised learning tested on out-of-domain test sets, even when hundreds of manually labeled training examples are used (albeit from a different input domain).
In summary, this paper shows several benefits of LABELDESCTRAINING.First, once a practitioner identifies a label set of interest for zero-shot classification, it only requires a few minutes to collect the kind of LABELDESC data shown in Table 1, and training on this data improves over zero-shot by 17-19% absolute.Second, LABELDESCTRAIN-ING leads to greater robustness to pattern/verbalizer choice than zero-shot.Third, LABELDESC data are domain independent with regard to the distribution of the inputs; a single LABELDESC training set can be used for any text classification task as long as it contains the same labels.Our experiments show that this independence to input distribution leads to stable accuracy across domains, even attaining higher accuracy than out-of-domain few-shot learning on a few cases.2

Tasks and LABELDESC Datasets
We evaluate on two types of tasks: topic classification on AGNews, Yahoo Answers, and DBPedia (Zhang et al., 2015) and sentiment classification on the Stanford Sentiment Treebank (SST) (Socher et al., 2013), Yelp Reviews (Zhang et al., 2015), IMDB (Maas et al., 2011), andAmazon Reviews Polarity (Zhang et al., 2015).We consider both binary and 5-way classification for SST and Yelp datasets (denoted as SST-2, SST-5, Yelp-2, and Yelp-5 henceforth) and only binary for IMDB and Amazon (denoted as IMDB and Amz-2 henceforth). 3Below we describe how we construct LA-BELDESC data for each label set.Dataset statistics as well as all LABELDESC data are in Section A.5 in the Appendix.
Topic Classification.Since labels in topic classification represent general concepts, we use both subjective descriptors of the labels (e.g., related terms) and objective sources of information (e.g., dictionary definition and Wikipedia sentences) when selecting LABELDESC data.In particular, we create LABELDESC examples for the label term itself, three related terms, a selected definition from dictionary.com, and the leading sentence from the label's Wikipedia article.As there are typically multiple dictionary.comdefinitions for our labels, we select a single definition that best aligns with our understanding of the concept underlying the label.We use the leading Wikipedia sentence because it is typically a brief overview/definition of the concept.Most labels in the Yahoo dataset consist of two keywords (e.g., Society & Culture).For these, we use both label terms, definitions for each, and the leading Wikipedia sentences for each.
We did not tune any of these decisions experimentally, so these choices in defining LABELDESC data are almost certainly suboptimal.This suboptimality is especially likely for the "World" label in the AGNews label set.This label reflects international news, but the dictionary definition and Wikipedia article for the term "World" do not capture that sense of the word.Nonetheless, we did not change our procedure for this label because we wanted our results to reflect a real-world implementation of the idea, complete with its limitations for certain labels.
The LABELDESC instances we are using do not contain exhaustive information.We could easily extend the lists of related terms for each topic or use WordNet or other semantic knowledge resources (Zhang et al., 2019).However, one of the goals of this research is to demonstrate how simple it is to choose LABELDESC examples to improve zero-shot classification in very little time.
Sentiment Classification.We use a slightly different procedure for sentiment classification.For 5-way sentiment, we use the label verbalizer itself and four synonym terms.In addition, we write four simple templates: "It was t.", "A(n) t experience.","Just t.", and "Overall, it was t.", where t is the label verbalizer or a synonym.For binary sentiment, we remove the neutral instances, combine the two positive labels ("Very Positive" and "Positive") into one, and combine the two negative labels ("Very Negative" and "Negative") into one.This procedure produces a total of 25 examples per label (5 terms + 5 terms × 4 templates) for 5way sentiment and 50 examples per label for binary sentiment.Since these LABELDESC instances are domain-independent, we use the same data for both for 5-way sentiment (Yelp-5 and SST-5) and for bi-nary sentiment (Yelp-2, SST-2, IMDB-2, Amz-2).
Hyperparameter Tuning.We adhere to the "true" zero-shot setting where hyperparameters cannot be tuned on a development set for the task of interest (Schick and Schütze, 2022).Therefore, we use a separate dataset for hyperparameter tuning -the 20 Newsgroups (20NG, henceforth) (Lang, 1995) -a topic classification dataset with twenty labels.We select only four labels from 20NG for our purposes: talk.religion.misc,rec.autos, sci.med, and talk.politics.guns.We chose these four labels because they are sufficiently distinct that we expect tuning to be informative for other real-world classification datasets; many of the other 20NG labels are highly technical or similar to one other, e.g., the pair comp.sys.ibm.pc.hardware and comp.sys.mac.hardware as well as the pair comp.os.ms-windows.miscand comp.windows.x.We follow the same strategy as for topic classification above when constructing LABELDESC data for 20NG.The selected hyperparameters are used for both topic and sentiment classifications.

Experimental Settings
The following settings are used in our experiments.Unless stated otherwise, we use the pretrained RoBERTa-base (b) and RoBERTa-large (l) models (Liu et al., 2019) for all experiments since RoBERTa is the predominant choice in related zeroshot and dataless research (Schick and Schütze, 2021;van de Kar et al., 2022;Gera et al., 2022).Additionally, for every dataset, we use the entire available test sets for evaluation.
Zero-shot Classification Baseline.We use the standard "pattern-verbalizer" approach for topic and sentiment classification.The set of verbalizers used can be found in Table 10 in the Appendix.For choosing verbalizers, we follow the choices of Schick and Schütze (2021) for AGNews, Yahoo, Yelp-5, and SST-5.We follow van de Kar et al. (2022) in choosing verbalizers for Yelp-2, SST-2, IMDB, and Amz-2 and we select verbalizers for DBPedia and 20NG ourselves.
Each pattern comprises a prompt including a [MASK] symbol placed before or after the text input, and we aim to predict the masked token.For example, a prompt is added after the input x to frame classification as a question answering task, e.g., "x Question: What is the topic of this newsgroup?Answer: [MASK]."We use RoBERTa- We present text inputs labeled as "Sports" from the topic classification task, and use one of our patterns (see Table 11) here as an illustration.Note that all our LABELDESC datasets are balanced, with each pattern being associated with a unique finetuned model checkpoint.
base/large with its MLM head for zero-shot experiments.Although the model is able to predict any token within its vocabulary, we choose only among the set of verbalizers, which are designed to be semantically coherent with class labels and tokenized into a single token by the model's tokenizer.
For topic classification tasks, we use the PROMPT and Q&A patterns from Schick and Schütze (2022), which amounts to 14 patterns.For AGNews, we use "news/article" in the pattern templates, while for Yahoo we replace this with "question", and for 20NG we use "newsgroup".For the sentiment classification tasks, we create new Q&A patterns such as "x Question: What is the sentiment of this text?Answer: [MASK]."and PROMPT patterns such as "x Sentiment: [MASK]."where x is the input text.There are 14 sentiment patterns in total, presented in the Appendix (Section A.2).
LABELDESCTRAINING.We use the same settings as the zero-shot baseline except that we finetune the models on LABELDESC data.We do not use any target task data for tuning or early stopping.Instead, we fix hyperparameter values, including number of training steps, by tuning on 20NG following the process described below.
We used LABELDESC data for the four selected 20NG labels as our training data and the original 20NG data (training and test sets) as our dev set, restricted to the four selected labels shown in Section 2. We preprocessed the data by removing headers, quotes, and footers.We used a batch size of 1 and tuned over a set of five learning rates ({5e-7, 1e-6, 5e-6, 1e-5, 5e-5}).Models were trained for 3500 training steps, evaluating on the dev set after each epoch, i.e., every 24 training steps since it's the size of LABELDESC dataset for 20NG.Based on tuning accuracies, we chose learning rate 5e-7 and number of training steps 2160 for RoBERTa-base and 1920 for RoBERTa-large.Additionally, we explored variations of parameter freezing, such as freezing certain layers of RoBERTa.The best setting on 20NG was to freeze the lower half of the layers (excluding the embedding layer) during finetuning, so we used this for experiments reported below.4

Results and Analysis
In this section we first present the results that are obtained via LABELDESCTRAINING and then analyze the benefits of LABELDESC data with a range of additional experiments and analysis.RoBERTa-base and 19% with RoBERTa-large.The results demonstrate that we can greatly improve the performance of zero-shot models with just a few training examples that provide a richer characterization of the label but still without requiring any textual inputs from the task datasets.

Results
Table 3 shows that accuracy variances across patterns using LABELDESCTRAINING are much lower than the zero-shot setting, which is known to be unstable (Perez et al., 2021).Finetuning on LABELDESC data not only improves accuracy, but also mitigates sensitivity to pattern selection.
Comparisons to the State of the Art.We compare to state-of-the-art (SOTA) results from the literature in Table 4 (we show results using RoBERTabase to better compare to other methods).For this comparison, we use only a single pattern with LA-BELDESCTRAINING, since doing so reflects more of a real-world use case than averaging over 14 patterns.We choose a single pattern for each of RoBERTa-base and large by tuning on 20NG as we did for other hyperparameters. 5We use three random seeds and report average accuracies and standard deviations over seeds.Chu et al. (2021a) and Chu et al. (2021b) are dataless classification approaches (Chang et al., 2008) that include single-encoder and dual-encoder methods; the latter include the idea of embedding documents and labels and performing classification via semantic retrieval; we report their non-5 Please refer to A.3 and Table 14 in Appendix for details.We use the same setting for Table 5.
ensemble results in Table 4. Schick and Schütze (2022) use labeled training data (10 or 100 examples, see Table 4) for each task, which differs from the domain-independent LABELDESC examples which are agnostic to the domain of the textual inputs. 6From van de Kar et al. ( 2022), we include the highest accuracies.
The results of LABELDESCTRAINING are comparable to other methods across datasets.For sentiment classification, LABELDESCTRAINING performs better than dataless classification (Chu et al., 2021a) by a large margin for all datasets and is competitive with van de Kar et al. (2022) and Schick and Schütze (2021).Our method is better than that of van de Kar et al. on topic datasets (AGNews, Yahoo, and DBPedia) but not sentiment datasets except for SST-2.van de Kar et al. (2022) search for naturally occurring data in large corpora; texts expressing sentiment are well-represented in corpora, while texts for topics in a fixed label set may be rarer.LABELDESCTRAINING trains on balanced data from a fixed label set, leveraging available knowledge resources to inform about topics.
Although van de Kar et al. (2022) do not report 5-way classification results for Yelp or SST, we report results for both datasets (including base and large models) so that future work can compare to our results in this table.We recommend tuning zero-shot and few-shot methods on datasets that are excluded from the final comparison, like 20NG in this paper.
Comparisons Involving GPT-3.5.Our method not only works for MLM-style models like RoBERTa, but also for autoregressive models.
In Table 5, we show zero-shot and in-context learning (ICL), where we use the entire LA-BELDESC data for the task as ICL demonstrations, with text-davinci-003 (GPT-3.5;OpenAI, 2022).Due to our restricted budget, we decided to use only 1,000 test instances for each test dataset in GPT-3.5 experiments, while ensuring that the label distribution remains consistent with that of the full test dataset.It is well known that ICL is sensitive to a variety of design choices, including the order of the demonstrations (Fei et al., 2023;Lu et al., 2022).For ICL demonstrations, we included all LABELDESC data for a task to make predictions for each test instance.To avoid the "recency bias" (i.e., the tendency to predict labels that occur towards the end of the prompt; Zhao et al., 2021a), we randomly shuffle the order of demonstrations.We left other parameters untouched.GPT-3.5 with ICL using LABELDESC data outperforms zero-shot GPT-3.5 on all datasets, showing the value of LA-BELDESC data even if in-domain inputs are unavailable.In comparison to GPT-3.5 flavors, LA-BELDESCTRAINING (RoBERTa-large) performs better on AGNews, DBPedia, Yelp-2, SST-5, and IMDB, and is competitive across other datasets.

Analysis and Discussion
One of the primary requirements of the zero-shot approach is the availability of pattern-verbalizer pairs (Schick andSchütze, 2021, 2022).Here, we study several variations of LABELDESCTRAINING to investigate whether we can simplify or remove components of these pattern-verbalizer pairs.We first experiment with changing verbalizers to gauge the impact of verbalizer choice for LABELDESC-TRAINING (Section 4.2.1).Next, we conduct classification experiments that do not use patterns or verbalizers at all (Section 4.2.2).Furthermore, we include one more baseline, i.e., the model finetuned on the 20NG LABELDESC data and patterns to analyze the generalizability (Section 4.2.3).We also report additional experiments in which we measure the multi-domain robustness of LABELDESCTRAINING compared to a standard procedure of training on one domain and testing on an out-of-domain test set (Section 4.2.4).Finally, we take a closer look at label-wise performance to better understand how LABELDE-SCTRAINING outperforms zero-shot classification (Section 4.2.5).

Impact of Verbalizers
In this section we report experiments with LA-BELDESCTRAINING without meaningful verbalizers and even with adversarially chosen verbalizers.We explore two different verbalizer settings: • RANDOM: We add c new words, i.e., RAN-DOM1, RANDOM2, . . ., RANDOMc, where c is the number of dataset labels, to the model's vocabulary and randomly initialize their embeddings.This setting prevents the use of any prior knowledge in the verbalizer embeddings.The results are shown in Table 6.Since we still use the MLM head for these results, we refer to them as "MLM, RANDOM" and "MLM, MIS-MATCHED".While LABELDESCTRAINING performs better than RANDOM, and RANDOM is better than MISMATCHED, both are better than zeroshot on average.These results suggest that LA-BELDESC data can partially compensate when the quality of the verbalizers is unknown or poor, at least to improve over zero-shot.

Classifiers Without Patterns or Verbalizers
Since finetuning on LABELDESC data outperforms zero-shot results with RANDOM verbalizers, we also evaluate its performance without patterns, i.e., using a standard randomly initialized softmax classifier.The input is the original text without any patterns and we use a two-layer classification head on top of the [CLS] token representation of the pretrained models.
The bottom two rows of Table 6 show the results.The classifiers are close to that of the MLM/RANDOM setting and still much higher than zero-shot on average, suggesting that it is not necessary to use patterns, verbalizers, or even the pretrained MLM head in order to outperform zero-shot classifiers.If it is difficult to select verbalizers or design patterns for a particular classification task, using a classifier that has been finetuned on a small LABELDESC dataset may serve as a strong alternative to the pattern-verbalizer approach.

Cross-Task Generalizability
We report results on the model finetuned on the 20NG LABELDESC data and patterns, i.e., LA-BELDESCTRAINING on 20NG (LDT 20NG ), in Table 6.While the patterns for the reported datasets are different from those used for 20NG, especially for sentiment datasets, they have similar structures (see Section A.2).For RoBERTa-base, LDT 20NG often outperforms zero-shot results, except for AG-News and Yelp-5.However, for RoBERTa-large, while LDT 20NG outperforms the zero-shot results on all topic classification datasets, it's worse on sentiment classification except for SST-5.

Multi-Domain Evaluation
Since LABELDESC examples are domainindependent, they can be used for multiple datasets that have the same labels.To assess the multidomain performance of LABELDESCTRAINING, we compare it to supervised few-shot learning in which a model is trained on data from one domain and then evaluated on a different domain with the same label set (i.e., training on SST-5 and evaluating on Yelp-5).To create multi-domain test sets for a single topic label set, we keep AGNews as it is and create a new subsampled version of Yahoo as follows: (1) "Politics & Government" and "Society & Culture" texts are assigned the label "World", (2) "Sports" texts are labeled "Sports", (3) "Business & Finance" texts are labeled "Business", and (4) "Science & Mathematics" and "Computers & Internet" texts are labeled "Sci/Tech".Other Yahoo texts are removed.We refer to this new version of the Yahoo dataset as Yahoo AG .For sentiment classification, we choose two dataset pairs that share label sets, i.e., SST-5 and Yelp-5.
We do not change anything about the LABELDE-SCTRAINING configuration for these experiments.We simply evaluate the same model on multiple test sets, reporting average accuracies over patterns.
For few-shot setup, we create datasets with 10, 100, and 500 training examples per label.For indomain experiments, train, dev, and test sets are drawn from the same domain/dataset, whereas for out-of-domain experiments, train and dev sets are drawn from one domain and the test set is drawn from another domain.We tune learning rates over the same ranges as mentioned earlier and use batch sizes 1, 2, and 4 for 10, 100, and 500 examples per label, respectively.We train for 15 epochs and select the checkpoint from the best epoch selected by the dev set.
The results using RoBERTa-large are shown in Figure 2.For brevity, we only show a subset of results. 7As we would expect, testing on out-ofdomain data leads to accuracy drops but adding more out-of-domain training data reduces this gap.LABELDESCTRAINING, shown as an orange dotted line, outperforms supervised few-shot learning in some cases, such as training on AGNews and testing on Yahoo AG , even with 500 examples per label (upper-right plot in Figure 2).We see the same trend when the supervised model is trained on Yelp-5 and tested on SST-5 (lower-right plot in Figure 2).In 3 out of 4 cases, LABELDESCTRAINING outperforms supervised few-shot out-of-domain learning with 10 examples per label, outperforming 100 in 2 out of 4 cases.

Label-wise Investigation
To better understand why LABELDESCTRAINING outperforms zero-shot, we report label-specific F1 scores in Tables 8 and 9.For AGNews, the zeroshot classifiers have low F1 scores for the World label, probably because the verbalizer "World" is much less coherent and less representative of the actual label than others like "Sports."LABELDE-SCTRAINING improves F1 on the World label by roughly 20 points, while the improvement for Sports is only about 4 points.Likewise, the F1 scores for "Very Negative", "Very Positive", and 7 Section A.4 in the Appendix shows additional results."Neutral" are very low for the zero-shot models on SST-5, indicating that those labels are being largely ignored.Again, LABELDESCTRAINING shows large improvements in F1 for some of these labels, especially "Very Positive".These trends are likely due in part to the differences verbalizer probabilities, e.g., "good" and "bad" occur more frequently than "great" and "terrible".The LABELDESC data is balanced, which helps to mitigate the ignoring of labels, even though the task test sets are not all balanced.Table 7 shows examples that are incorrectly classified by zero-shot models but are correctly classified by the LABELDESCTRAINING models.

Related Work
One common approach in zero-shot text classification is to transfer knowledge from seen labels (Dauphin et al., 2014), which requires observed labels and a notion of label similarity.Some sources of semantic knowledge used for this purpose include multiple modalities (Lampert et al., 2009) Autoregressive language models have also been used for zero-shot text classification; we report zero-shot and ICL results with LABELDESC data using GPT-3.5 (OpenAI, 2022).Zhao et al. (2021b) found it beneficial to "calibrate" such models for this setting; this idea is not immediately applicable here due to our use of encoder-only models like RoBERTa.Calibration could be extended to encoder-only models, which we plan to explore in future work.Our work is closely related to data-less classification (Chang et al., 2008) which involves building classifiers by designing or learning a generic function that scores the compatibility of a document and label defined in natural language.We compared empirically to the dataless classification approaches of Chu et al. (2021a) and Chu et al. (2021b) who used pretrained models, naturally annotated data like that from Wikipedia categories, and unsupervised clustering techniques.There is a wealth of prior work in semi-supervised text classification (Nigam et al., 2000;Xie et al., 2020;Howard and Ruder, 2018).There is also related work on generating label names (Schick et al., 2020) or label descriptions (Chai et al., 2020;Sun et al., 2019) but for supervised text classification.

Conclusions
We presented LABELDESCTRAINING, a method for improving the accuracy of zero-shot classification by using small, curated datasets that simply describe the labels for a task in natural language.Our method is 17-19% more accurate than zero-shot on average across a range of datasets.LABELDE-SCTRAINING is also more robust to the choices required for zero-shot classification, such as patterns and verbalizers.Furthermore, LABELDESC data is domain agnostic and therefore can used for any text classification task as long as it contains the same set of labels.LABELDESCTRAINING can even outperform a supervised approach that uses training data from a different domain.One future direction would be to apply the idea to structured prediction, NLI, and natural language generation tasks.Another would be to investigate ways to reduce the dependence of pretrained models on patterns and verbalizers, such as directly calibrating the marginal probabilities of verbalizers with the goal of minimizing biases of pretrained models.

Limitations
We focus on a simple approach of curating small finetuning datasets that describe the labels for text classification tasks.Although this is beneficial when the task is specific, especially when the data is difficult to obtain, the data curation process is intrinsically intuitive and relies on the practitioner's understanding of the labels and usage situation.Moreover, since a pretrained model is necessary for this approach, a few curated examples may mitigate, but cannot detect or eliminate, potential biases of the pretrained model.If the labels of a certain classification task are dissimilar from the examples the model was trained on, and the model lacks the knowledge to differentiate among them, it may lead to unsatisfying performance even after finetuning on a few examples of label descriptions.

Ethics Statement
We use pretrained models for text classification, and curate data with the assistance of data sources such as Wikipedia and dictionary definitions.The large pretrained models are trained on a massive amount of data and have been shown to have issues with bias; however, this is a common challenge when working with pretrained models and would benefit from advances made by the community on this front.While both dictionary.comdefinitions and Wikipedia are aimed at providing accurate and neutral information for a word/concept, they can be affected by the biases and limitations of their editors, especially for Wikipedia, which is an opensource encyclopedia.Our method is not reliant on specific dictionaries or encyclopedias; others could be used.We chose these resources for simplicity as they are highly accessible and widely used.Since our LABELDESC data is very small in size, we manually examined the data as we selected it for any potential biases or other issues.Finally, we use standard topic and sentiment datasets for evaluation, which are used in a great deal of prior work.

A.2.1 Topic Classification
We use the patterns shown in Table 11 for AGNews and DBPedia, and replace "news/article" by "question" for Yahoo Question, which follows Schick and Schütze (2022)'s practice.We use "newsgroup" instead of "question" for 20NG.

A.3 Hyperparameters and Best Pattern
We selected training batch size as 1 for our experiments on LABELDESC data.After fine-tuning on 20NG, the hyperparameters are selected as shown in Table 13.With the selected hyperparameters, we further examine the dev accuracy on 20NG for all prompt patterns and select the tuned pattern that x Theme: [MASK].5 x Category: [MASK] 6 x Class: [MASK] 7 x Topic: [MASK] 8 x Theme: [MASK] 9 [MASK] News: x 10 [MASK] NEWS: x Table 11: Patterns for AGNews, where x refers to the given text.
has the highest dev accuracy.The tuned patterns are listed in Table 14.
To our knowledge, this method works well when we adapt to other datasets.However, we also observe that there are fluctuations in the dev accuracy curve for 20NG during training, and we select the training steps in the middle of the flatter part of curves rather than the peak point for robustness.We suggest changing training steps or increasing batch size if this method doesn't work well.
The tuned pattern is not necessarily the best pattern after adapting to other datasets, sometimes even a little lower than the average results over all 14 patterns.

A.4 Domain Transfer
All results on RoBERTa-base/large are shown in Figure 3.

A.5 LABELDESC Data
The statistics of LABELDESC data are shown in Table 15.We use the same set of LABELDESC data for AGNews and Yahoo AG , Yelp-5 and SST-5, Yelp-2 and SST-2, respectively.The data is listed in Table 16 -Table 21.Each term/sentence that is separated by "|" in tables is an independent LABELDESC example during training.For brevity, we list all hand-crafted templates instead of listing all data for sentiment classification.

A.6 Dataset Preprocessing
For 20NG, we remove headers, quotes, and footers.
For AGNews, we concatenate the headlines and the text body of the news articles.For Yahoo dataset, we concatenate the title, the question, and the top answer to it.And for IMDB and Amazon Reviews Polarity datasets, we concatenate the title and the content.

A.7 Label-wise Metrics
We list label-wise precision, recall, and F1 scores for part of our datasets in      An educational institution is a place where people of different ages gain an education, including preschools, childcare, primary-elementary schools, secondary-high schools, and universities.dict.
an institution for instruction in a particular skill or field.
Artist terms artist | writer | actor | singer Wiki.
An artist is a person engaged in an activity related to creating art, practicing the arts, or demonstrating an art.dict.
a person who produces works in any of the arts that are primarily subject to aesthetic criteria.A building or edifice, is an enclosed structure with a roof and walls standing more or less permanently in one place, such as a house or factory (although there's also portable buildings).dict.

Athlete
a relatively permanent enclosed construction over a plot of land, having a roof and usually windows and often more than one level, used for any of a wide variety of activities, as living, entertaining, or manufacturing.A village is a clustered human settlement or community, larger than a hamlet but smaller than a town (although the word is often used to describe both hamlets and smaller towns), with a population typically ranging from a few hundred to a few thousand.dict.
a small community or group of houses in a rural area, larger than a hamlet and usually smaller than a town, and sometimes (as in parts of the U.S.) incorporated as a municipality.
Animal terms animal | insect | bird | fish Wiki.
Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia.dict.
any member of the kingdom Animalia, comprising multicellular organisms that have a well-defined shape and usually limited growth, can move voluntarily, actively acquire food and digest it internally, and have sensory and nervous systems that allow them to respond rapidly to stimuli: some classification schemes also include protozoa and certain other single-celled eukaryotes that have motility and animallike nutritional modes.
Plant terms plant | flower | tree | grass Wiki.
Plants are predominantly photosynthetic eukaryotes, forming the kingdom Plantae.dict.
Botany.any member of the kingdom Plantae, comprising multicellular organisms that typically produce their own food from inorganic matter by the process of photosynthesis and that have more or less rigid cell walls containing cellulose, including vascular plants, mosses, liverworts, and hornworts: some classification schemes may include fungi, algae, bacteria, and certain single-celled eukaryotes that have plantlike qualities, as rigid cell walls or the use of photosynthesis.
Album terms album | soundtrack | mixtape | CD Wiki.
An album is a collection of audio recordings issued on compact disc (CD), vinyl, audio tape, or another medium such as digital distribution.dict.
a record or set of records containing several musical selections, a complete play or opera, etc.A film -also called a movie, motion picture, moving picture, picture, photoplay or (slang) flick -is a work of visual art that simulates experiences and otherwise communicates ideas, stories, perceptions, feelings, beauty, or atmosphere through the use of moving images.dict.
a sequence of consecutive still images recorded in a series to be viewed on a screen in such rapid succession as to give the illusion of natural movement; motion picture.A book is a medium for recording information in the form of writing or images, typically composed of many pages (made of papyrus, parchment, vellum, or paper) bound together and protected by a cover.dict.
a handwritten or printed work of fiction or nonfiction, usually on sheets of paper fastened or bound together within covers.

"Figure 1 :
Figure1: Overview of our proposed method, including the construction of LABELDESC data, the format of the text input, and the target used for both model finetuning and inference during test time.We present text inputs labeled as "Sports" from the topic classification task, and use one of our patterns (see Table11) here as an illustration.Note that all our LABELDESC datasets are balanced, with each pattern being associated with a unique finetuned model checkpoint.

Figure 2 :
Figure 2: Domain transfer results, where the X-axis shows the number of training examples per label.

Table 3 :
Standard deviations of test accuracy (%) across 14 patterns for each test dataset.For LABELDESCTRAIN-ING (LDT in the table), three random seeds were used so we show three standard deviations, one per random seed.All standard deviations over patterns are smaller for LDT than the corresponding values for zero-shot.

Table 6 :
• MISMATCHED: We shuffle the original mapping Test accuracies (%) for several variations of LABELDESCTRAINING.The standard deviations are computed over 14 patterns for zero-shot; 3 random seeds for the classifier (no patterns); and both 14 patterns and 3 random seeds for LABELDESCTRAINING on 20NG, LABELDESCTRAINING, RANDOM, and MISMATCHED (LDT 20NG , LDT, MLM r , and MLM m in Table).

Table 7 :
AGNews/SST-5 data that are correctly classified with LABELDESCTRAINING but not in zero-shot settings.

Table 10 :
Mozes van de Kar, Mengzhou Xia, Danqi Chen, and Mikel Artetxe.2022.Don't prompt, search!miningbased zero-shot learning with language models.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7508-7520, Abu Dhabi, United Arab Emirates.Association for Computational Linguistics.Verbalizers selected for each dataset.

Table 12 :
Patterns for sentiment classification, where x refers to the given text.

Table 13 :
Hyperparameters (learning rate, training steps) selected by tuning on 20NG with RoBERTa.

Table 14 :
Table 22 -29.Tuned pattern and pattern id for each model.

Table 15 :
Statistics of datasets we used, with '#' denoting the number of labels, LD refers to LABELDESC data.

Table 18
It was t. | A(n) t experience.| Just t. | Overall, it was t.

Table 19 :
LABELDESC data for Yelp-2, SST-2, Amz-2 and IMDB.a group of individuals involved in persistent social interaction, or a large social group sharing the same spatial or social territory, typically subject to the same political authority and dominant cultural expectations.|Culture is an umbrella term which encompasses the social behavior, institutions, and norms found in human societies, as well as the knowledge, beliefs, arts, laws, customs, capabilities, and habits of the individuals in these groups.Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe.|Mathematicsisanarea of knowledge that includes such topics as numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes.dict.abranch of knowledge or study dealing with a body of facts or truths systematically arranged and showing the operation of general laws | the systematic treatment of magnitude, relationships between figures and forms, and relations between quantities expressed symbolically.Education is a purposeful activity directed at achieving certain aims, such as transmitting knowledge or fostering skills and character traits.|Reference is a relationship between objects in which one object designates, or acts as a means by which to connect to or link to, another object.dict.theactorprocess of imparting or acquiring general knowledge, developing the powers of reasoning and judgment, and generally of preparing oneself or others intellectually for mature life.|abookor other source of useful facts or information, such as an encyclopedia, dictionary, etc. a digital electronic machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically.|TheInternet(or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices.dict.aprogrammableelectronicdevice designed to accept data, perform prescribed mathematical and logical operations at high speed, and display the results of these operations.Mainframes, desktop and laptop computers, tablets, and smartphones are some of the different types of computers.|Usuallythe internet (except when used before a noun).avastcomputer network linking smaller computer networks worldwide.The internet includes commercial, educational, governmental, and other networks, all of which use the same set of communications protocols Business is the activity of making one's living or making money by producing or buying and selling products (such as goods and services).|Financeis the study and discipline of money, currency and capital assets.dict.thepurchaseand sale of goods in an attempt to make a profit.|the management of revenues; the conduct or transaction of money matters generally, especially those affecting the public, as in the fields of banking and investment.Entertainment is a form of activity that holds the attention and interest of an audience or gives pleasure and delight.|Music is generally defined as the art of arranging sound to create some combination of form, harmony, melody, rhythm or otherwise expressive content.dict.theact of entertaining; agreeable occupation for the mind; diversion; amusement | an art of sound in time that expresses ideas and emotions in significant forms through the elements of rhythm, melody, harmony, and color.

Table 20 :
LABELDESC data for Yahoo Answers.abbreviated as co., is a legal entity representing an association of people, whether natural, legal or a mixture of both, with a specific objective.dict.anumber of persons united or incorporated for joint action, especially for business terms athlete | sports | footballer | weightlifterWiki.An athlete (also sportsman or sportswoman) is a person who competes in one or more sports that involve physical strength, speed, or endurance.dict.apersontrainedor gifted in exercises or contests involving physical agility, stamina, or strength; a participant in a sport, exercise, or game requiring physical skill.'s been appointed to a position by a company or organisation but doesn't have a contract or receive regular payment may be an office-holder.dict.apersonfilling a governmental position; public official.British English), or transportation (in American English), is the intentional movement of humans, animals, and goods from one location to another.dict.ameans of transporting or conveying, as a truck or bus.
The natural environment or natural world encompasses all living and non-living things occurring naturally, meaning in this case not artificial.dict.existing in or formed by nature (opposed to artificial)