A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches

Etremely Weakly Supervised Text Classification (XWS-TC) refers to text classification based on minimal high-level human guidance, such as a few label-indicative seed words or classification instructions. There are two mainstream approaches for XWS-TC, however, never being rigorously compared: (1) training classifiers based on pseudo-labels generated by (softly) matching seed words (SEED) and (2) prompting (and calibrating) language models using classification instruction (and raw texts) to decode label words (PROMPT). This paper presents the first XWS-TC benchmark to compare the two approaches on fair grounds, where the datasets, supervisions, and hyperparameter choices are standardized across methods. Our benchmarking results suggest that (1) Both SEED and PROMPT approaches are competitive and there is no clear winner; (2) SEED is empirically more tolerant than PROMPT to human guidance (e.g., seed words, classification instructions, and label words) changes; (3) SEED is empirically more selective than PROMPT to the pre-trained language models; (4) Recent SEED and PROMPT methods have close connections and a clustering post-processing step based on raw in-domain texts is a strong performance booster to both. We hope this benchmark serves as a guideline in selecting XWS-TC methods in different scenarios and stimulate interest in developing guidance- and model-robust XWS-TC methods. We release the repo at https://github.com/ZihanWangKi/x-TC.


Introduction
Recently there has been a significant advancement in the text classification with the emergence of Extremely Weakly Supervised Text Classification (XWS-TC) methods (Meng et al., 2020b;Wang et al., 2021;Zhang et al., 2021b;Zhao et al.,   Park and Lee, 2022), which requires no humanannotated datasets.Instead, these methods rely on minimal human guidance, such as the names of the classes or instructions describing the classification task.There are two main approaches to XWS-TC: one based on matching seed words (SEED), and the other on prompting a language model (LM) with instructions (PROMPT).We give a brief introduction in the following paragraphs, and a more thorough review is in Section 3. SEED methods for XWS-TC rely on a userspecified list of seed words for each class, as well as an unlabeled in-domain corpus.These seed words are then expanded into a larger set of related words for the class through statistical methods (Mekala and Shang, 2020), embedding similarity (Wang et al., 2021), or masked language model predictions (Meng et al., 2020b).These related words are used to assign a pseudo-class to each text in the unlabeled corpus through some matching strategy (e.g., assign a text to a class if it contains the related words for that class).The pseudo labels are then used to train a classifier through standard fully-supervised fine-tuning.
On the other hand, PROMPT methods for XWS-TC, rely on reformulating text using an instruction template and prompting the language model to generate the likelihoods for each label in the classification task (Brown et al., 2020).For example, in a sentiment classification task, using an instruction template of <text>.sentiment:, the model generating "happy" or "sad" will help classifiy the sentiment of the text.Naive zero-shot prompting considers the highest likelihood label as the answer and recent improvements for more accurate likelihoods include calibration of likelihood scores (Holtzman et al., 2021;Zhao et al., 2021;Han et al., 2022) and verbalizers that find more label words to better represent the class (Schick and Schütze, 2021;Ma et al., 2023;Hu et al., 2022).
Both SEED and PROMPT methods have demonstrated strong performance in XWS-TC.However, there has been a lack of comprehensive comparison between these two approaches.This is due to the perception that the approaches are unrelated and the lack of standardization in datasets, supervision, and hyperparameter choices across methods.
We are motivated to construct a benchmark that fairly evaluates the performance of XWS-TC methods.The benchmark consists of 11 datasets covering four domains along with their fine-grained variants and different numbers of classes.In addition, we make an effort to use the same hyperparameters across datasets for the methods, as there should not be a development set to tune the hyperparameters in the XWS setting (Perez et al., 2021).
Our benchmarking results suggest that both SEED and PROMPT approaches are competitive, with no clear winner.SEED tends to perform better when both approaches use a similar-sized pretrained model and is more robust and tolerant to changes in human guidance (such as seed words, classification instructions, and label words).On the other hand, PROMPT methods have the ability to handle more general types of human guidance (such as descriptions of class names, rather than specific words) and do not have a strict requirement for an unlabeled corpus.When the underlying pre-trained language model changes, PROMPT is more robust and scales better with the language model than SEED.We also examine two specific methods from each approach, X-Class (Wang et al., 2021) and ProtoCal (Han et al., 2022), which independently proposed a post-processing approach to calibrate the class predictions through clustering on an unlabeled in-domain corpus to improve classification performance.Our results show that this subroutine can be a universal booster for both SEED and PROMPT approaches.
Through this benchmark, we aim to advance the study of XWS-TC methods and call for the develop-ment of methods that are robust to different human guidance and language models.We firmly believe that this paper will serve as a guide for selecting the appropriate method in different scenarios and contribute to the advancement of the field.

Different Types of Weak Supervision
Extremely Weak Supervision is a setting that assumes access to only high-level human inputs, such as names of classes or instructions about classification criteria.We briefly discuss different types of minimal supervision in the following paragraphs.
Few-shot Supervision Few-shot supervision is the setting where there are only a small number of labeled examples for each of the classes.An intuitive way is to directly train the classifier on few-shot data, but usually that yields subpar performance.Another popular way is called in-context learning, where the few-shot supervision is used as context to prompt LM for the answer (Brown et al., 2020).Various methods have been proposed to improve it by searching for better label words (Schick and Schütze, 2021;Ma et al., 2023), stabilizing the output (Lu et al., 2022), and efficient fine-tuning (Gao et al., 2021).
Distant Supervision Distant supervision includes supervision from external resources such as encyclopedias or gazetteers.There have been efforts to incorporate external knowledge into prompting (Hu et al., 2022), phrase mining (Shang et al., 2018), and named entity recognition (Liang et al., 2020).External models can also be used to help with extremely weak supervision.A line of research is on leveraging models trained on natural language inference data to suggest better-related words (Park and Lee, 2022) or directly classify the text (Yin et al., 2019;Gera et al., 2022).
No Supervision Unsupervised methods fall into this category where they require no supervision.These methods typically take one of the two following approaches: (1) clustering (Aharoni and Goldberg, 2020), (2) topic modeling (Blei et al., 2003).However, both of these approaches lack control over the clusters/topics generated i.e. classes.For example, a text corpus can be categorized on several basis including topic, location, and sentiment.An unsupervised method cannot handle such scenarios.It would be beneficial to be able to retrieve all possible classifications of a corpus in an unsupervised manner, but as far as we are aware, there are no methods with this ability.

Weak Supervision Benchmarks
We introduce two other Weak Supervision Benchmarks and talk about differences with this work.
Wrench (Zhang et al., 2021a) is a benchmark that explored various types of weak supervision labeling functions (i.e., rules used to label the text).They synthesize the performance of different labeling functions, ways to combine them, and the fine-tuning process to learn the pseudo-training data.In our benchmark, we analyze extremely weak text classifiers that go beyond the labeling functions and compare their performance and robustness with zero-shot prompting.
AutoWS-Bench-101 (Roberts et al., 2022) is another benchmark that analyzes how labeling functions help text classification along with additional few-shot supervision.They conclude that pretrained models are strong baselines for in-domain settings and should be considered integrating with weak supervision methods.In this work, we focus on extremely weak supervision methods without any labeled data.The SEED and PROMPT methods compared in this benchmark are all based on pre-trained language models.

Verbalizers
Verbalizers are a type of PROMPT method that find a larger set of label words so that the class choices are accurately represented.We did not consider Verbalizer methods in this benchmark since they mostly rely on additional supervision, such as fewshot (Schick and Schütze, 2021;Ma et al., 2023) or an external knowledge base (Hu et al., 2022).

Background
Extremely Weak Supervision in Text Classification refers to a few high-level human guidance as supervision.This guidance typically is in the form of seed words that describe each class, or an instruction paired with label words that define the task.There are two main approaches for XWS-TC: matching seed words (SEED) and prompting language models (PROMPT).

Seed Matching Methods
SEED approaches are provided with a few classindicative seed words and unlabeled documents as input.These methods typically involve seed word expansion where more words related to provided seed words are identified in the unlabeled corpus through several statistics-based (Salton and Buckley, 1988;Mekala and Shang, 2020) or deep learning-based strategies (Meng et al., 2020b;Wang et al., 2021;Zhang et al., 2021b).Using these expanded seed words, each unlabeled document is pseudo-labeled.Different heuristics have been explored for pseudo-labeling such as stringmatching (Meng et al., 2018).Recently, the matching approach has also evolved into softer manners such as embedding-based matching (Wang et al., 2021), and graph-based matching (Zhang et al., 2021b), that can address conflicts in a principled manner during pseudo-labeling.
We introduce 4 strong-performing SEED methods to include in our benchmark.LotClass (Meng et al., 2020b) obtains related words through predicting masked tokens in a masked language modeling trained model (Devlin et al., 2019), over an unlabelled corpus.They match the text to related words by fine-tuning a model to predict the related words given a text.XClass (Wang et al., 2021) obtains related words by finding words that have similar representations.They construct class-oriented representations for text.and match the text to related words by representation similarity.They also showed that the performance can be improved significantly by matching based on clusters from text representations.ClassKG (Zhang et al., 2021b) models the dependence of related words as an annotating problem on the keyword graph.NPPrompt (Zhao et al., 2022) obtains related words through embedding similarity from a pretrained LM.The related words are used as label words to prompt a generative LM for predictions, which are then aggregated as the matching result.To some extent, NPPrompt belongs to an intersection of PROMPT and SEED methods.

Prompt Methods
Prompting language models is another approach to extremely weak supervision in text classification.This approach involves prompting a generative language model with an instructive text and extracting the likelihoods of different label words.This approach does not require an unlabeled in-domain corpus and can be used to predict text in an online fashion.However, language models have been known to be biased towards text sequences more common in pre-training data, leading to instability in zero-shot & few-shot settings.Recently proposed post-processing methods (Holtzman et al., 2021;Han et al., 2022) have attempted to address this by calibrating the predicted probabilities using estimates of the model's bias towards each verbalized label.We describe 2 calibration methods.DC-PMI (Holtzman et al., 2021) considers a null prompt to obtain the raw likelihoods of language model to predict each label.Then, for each text, they modify the likelihood of the predicted label by marginalizing the raw ones.ProtoCal (Han et al., 2022) considers an unlabelled corpus and obtains the predicted likelihoods on the corpus.The likelihood vectors are then clustered to better obtain the prediction boundary for each class.Instead of maximum likelihood, this prediction boundary is used to predict the class.
Some more SEED and PROMPT methods are described in Appendix A.

Benchmark
In order to establish a benchmark that can accurately evaluate various XWS-TC methods, it is essential to consider a range of factors: Dataset choices, Instructions, Label words, Hyperparameter control, use of Pre-trained Language Models, Metrics and ensure their consistency across all experiments.We will discuss each of these factors in detail in the following sections.

Dataset
We consider datasets from prior evaluations (Holtzman et al., 2021;Wang et al., 2021;Meng et al., 2020b) that contain data from diverse domains.To facilitate the evaluation process, the size of the evaluation set for each dataset has been controlled to a few thousand instances.Additionally, as many XWS-TC methods require the use of an unlabelled in-domain corpus, a similar-sized sample has been sampled from the training split to serve this purpose, with the evaluation set and unlabelled corpus being disjoint.The datasets have been uniformly sampled without altering the distribution of labels, thus preserving the imbalance ratio, which is defined as the ratio between the size of the largest class and the smallest class.The statistics of the datasets are presented in Table 1.Details of the sources of the datasets are in Appendix B.

Instructions and Label/Seed Words
To fairly compare SEED and PROMPT methods, we need to provide equal amounts of human supervision.That means, for SEED methods, we should only allow a single word for each class, matching the amount used for label words.For instructions, we consider simple ones that hint at the classification criteria (Holtzman et al., 2021).Details choices can be found in Appendix C.

Metrics
For evaluation metrics, we consider the macro F 1 score on a dataset-by-dataset basis, which values each class within a dataset equally.To understand the performance of a method on all datasets, we employ two metrics: the average of the macro F 1 scores, and a ranking-based metric that combines the ranking of methods on each dataset to obtain a scale-prone value (Colombo et al., 2022).

Hyperparameters
Another crucial aspect of the benchmark is the number of hyperparameters utilized by each method.In the context of extremely weak supervision, we argue that it is unrealistic to use different hyperparameters for different datasets, as doing so would necessitate the use of a separate development set, thereby defeating the purpose of using only highlevel human supervision (Perez et al., 2021).Therefore, we slightly tune the hyperparameters on one of the datasets to rule out failing scenarios and then stick with a single choice of hyperparameters throughout all datasets.Under this hyperparameter enforcement, the ideal method should exhibit consistent performance across all datasets.

Pre-trained Language Models
PROMPT methods use generative language models such as GPT while SEED methods use representation encoding language models such as BERT.To fairly compare methods between these two approaches on XWS-TC, we have to consider the ability of language models as a factor.We use the number of parameters of the pre-trained language model as an approximation of the power of the language model.Since all language models use the transformer as the backbone, this implies that the number of layers and size of hidden states is controlled.A further discussion is in Appendix D.

Large Language Models
This benchmark specifically excludes the evaluation of (multi-task) fine-tuned language models such as T0 (Sanh et al., 2022), large language models (LLMs) such as GPT3, and human feedback-trained language models like Instruct-GPT (Ouyang et al., 2022) and ChatGPT because there are no equivalent representation encoding language models for the SEED approaches.We discuss this in more details and include an evaluation of ChatGPT on a single dataset as a reference in Appendix E.

Main Results
In Table 2 we show the performances of all SEED and PROMPT methods considered in the benchmark across the 11 datasets and report the average macro F 1 performance and the rank score.
Performance of PROMPT Methods We note that the performance of the standalone PROMPT method is about 20 points lower than its counterparts with calibration methods.The use of additional instance independent instructions (DCPMI) or an additional clustering based on unlabelled text (ProtoCal) is crucial for PROMPT methods to work well in XWS (zero-shot) text classification.
Performance of SEED Methods All the SEED methods exhibit strong performance, with X-Class performing stably well across all datasets, and ClassKG performing the best on several datasets, but losing on certain fine-grained datasets.
Comparing PROMPT and SEED Methods First, on the absolute performances, we can see that SEED methods have overall better performance than PROMPT methods, even when appropriate calibration is added for PROMPT methods.However, we can also observe that a larger pre-trained GPT model increases the performance of PROMPT methods quite significantly, while SEED methods have a lower performance improvement when a larger pre-trained language model is used.This effect is further studied in Section 5.2.3.

Robustness
Through this benchmark, we hope to not only decide which method performs the best, but also analyze under dynamic circumstances, which method is more robust to changes.Different choices of label words/seed words, instructions, and pre-trained language models can happen in real life.Therefore, the robustness of methods when these ingredients are reasonably varied would indicate how stable the method is under variating circumstances.Due to the complexity of multiple runs of each method, we focus on 4 datasets pertaining to different domains, imbalance ratios, and number of classes: Yelp, AGNews, NYT-S, and DBpedia.We leave out two methods, LoT-Class and NPPrompt to save computational resources.

Different Seed/Label words
In Table 3 we explore the effect when a different choice of label words and seed words are used.For example, for Yelp-2, we chose negative/positive, terrible/great bad/good, awful/find, and nasty/nice as the variants.We report the performance of the methods on each of the five choices, and also the aggregated performance over the 4 aforementioned datasets.We notice that PROMPT methods in general have a high instability.While DCPMI and Pro- Table 3: Performance of PROMPT and SEED methods when the label word/seed word are changed to similar meaning alternatives.We show the performance on 5 choices of label words on Yelp-2 (4 alternatives + 1 default), its median, average, and standard deviation, and the averaged metrics across all datasets.
toCal can remedy the variance a bit, SEED methods are still more robust to changes of seed words.

Different Instructions
A high variance is also observed when the instructions are changed for the PROMPT methods, as in Table 4.A noticeable trend is that when the pre-trained model is larger, while the performance increases, the variance brought by instructions or label words also increases.This could be alarming for PROMPT methods.

Different Pre-trained Language Models
In Table 5 we analyze how changes in pre-trained language models would affect the performance of SEED and PROMPT methods (See Appendix H for the full table).Although SEED performs better than PROMPT, PROMPT methods has a strong increasing trend as the size of the pre-trained language model (e.g., changing from BERT-base to BERT-large).Also, X-Class and NPPrompt fail on RoBERTa and BERT respectively, which we hypothesize is that assumptions made in the methods are not general to all pre-trained language models; for example, the distribution of similarities of representations generated by a language model might be different by models.This scaling trend is a factor that should be taken into selecting methods to use for XWS-TC, when the language model size is different than evaluated in this benchmark.-medium 33.57 33.18 56.77 78.41 42.34 42.34 48.85 (17.08) -medium 88.60 87.40 57.85 80.13 82.73 82.73 79.34 (11.18) 62.59 62.07 10.85 Table 4: Performance of PROMPT methods when the instructions are changed to similar meaning alternatives.We show the performance on 5 choices of instructions on Yelp-2 (4 alternatives + 1 default), its median, average, and standard deviation, and the averaged metrics across all datasets.

Connections between Recent SEED and PROMPT Methods
While PROMPT is introduced by the seminal GPT-3 paper (Brown et al., 2020) not too long ago, SEED has a longer history and can be traced back to early tf-idf retrieval methods (Salton and Buckley, 1988).
In recent years, SEED methods and PROMPT methods are exploring similar ideas.SEED methods have been leveraging pre-trained language models to better understand the semantics of seed words; for example, by asking the language model to fill in masks (Meng et al., 2020b) or through means of representation similarities (Wang et al., 2021;Zhao et al., 2022).PROMPT methods have been exploring calibration and verbalizers to improve and stabilize its predictions.Verbalizer includes a step of finding more label words that better represent the class, which is a similar approach used in SEED.We show that a recent representative SEED method X-Class and two PROMPT methods, Verbalizers and ProtoCal have higher similarities and deeper connections in their design.This is particularly interesting as both directions have been developing independently.In Figure 2, we provide a pipeline of the methods and highlight the similarities.

Obtaining Text Representations
X-Class matches text to classes by learning classoriented text representations from an encoderbased language model.X-Class views class representations as the union of representations describing the words.The text representation in X-Class is defined as a weighted average of individual token representations where the weights are based on their respective similarity to the class representations.On the other hand, general prompting relies on a decoder-based language model to produce a next token representation.In the penultimate layer of the decoder, the last token representation is computed by an attention mechanism over all other tokens, which essentially produces a weighted average of all the token representations.
In both methods, the text representation is obtained using an attention-like weighted average of tokens in the text.The attention is guided such that the output representation is indicative of the class.X-Class uses signals from class names to guide the attention while prompting relies on the understanding of the instruction.

Obtaining Predicted Likelihoods
PROMPT methods obtain likelihoods of the class by comparing the similarity of the next token rep- resentation to representations of the label words.A recent line of research on improving prompting for classification is to enlarge the set of label words to capture more diverse meanings of the classes, known as verbalizers, such as PET (Schick and Schütze, 2021), ProtoVerb (Ma et al., 2023), and KPT (Schick and Schütze, 2021; Ma et al., 2023;Hu et al., 2022).The notion of verbalizers is very similar to seed-words expansion in SEED methods.For example, X-Class and verbalizers both obtain a list of related words and use it to aggregate a class representation to replace the naive usage of label/seed word representation.Notably, the verbalizer methods require external supervision to find the related words, such as few-shot data (Schick and Schütze, 2021;Ma et al., 2023) or a knowledge base (Hu et al., 2022) to obtain the related word list, while SEED methods detect related words through an unlabelled corpus.Both approaches could be useful under different input settings.

Unlabeled Corpus Clustering
Finally, a SEED method X-Class and a PROMPT method ProtoCal independently introduced a postprocessing step by clustering on an unlabelled corpus, with the goal of obtaining a better decision boundary.X-Class clusters the text representations and initializes the clusters with the prior textclass similarity so that the clusters and classes are aligned.Protocal clusters the predicted likelihoods and align the clusters to classes by post-matching the cluster centers to the classes.We further explore the effect of the two clustering ideas, a summary is in Table 6 (Full table in Appendix I).We show that adding such a post-clustering process to various methods can almost freely (apart from an unlabeled corpus) improve the performance of different methods consistently for five different methods.

Implications
Given these connections between SEED and PROMPT methods and previous analysis on robustness, a natural extension is to analyze the cause of the stability issues on label/seed words and model differences.We presented one empirical analysis of the clustering step in X-Class and ProtoCal and show that this step can improve performance for various different methods talked about in the benchmark (Section 6.3).Further analysis on other components is left as future work.For example, one could reason that the introduction of related words makes the model less sensitive to the given label/seed words.This would require an exploration of the quality of the related words found by different SEED and verbalizer methods, and whether the related words between methods can be used interchangeably.

Conclusions and Future Work
In this work, we introduce a benchmark to qualitatively evaluate different SEED and PROMPT approaches for extremely weakly supervised text classification.Through the benchmark, we raise awareness of the existence of SEED approaches, that are strong competitors to the more well-known zero-shot prompting (with calibrations).We also experiment on the robustness of these two approaches, and show that SEED are more tolerant to the given human guidance changes, however also being more selective to the pre-trained language models.We also analyzed the connections of SEED and PROMPT approaches through the lens of a few representative methods of the two approaches and showed that the methodologies are converging more recently.Finally, we also include a study on clustering as a calibration technique that was independently proposed for both approaches , and show that it can be a good performance booster.We envision future work in two directions.The first one would be to understand the source of robustness difference and design a method that can take the best of both worlds (see Section 6.4).The other would be to scale up the experiments and test if the conclusions still hold for larger pre-trained language models.

Limitations
Limitation of Model Scale The benchmark only included the evaluation of moderate-size language models and did not experiment on large language models.We justify our reasons in Section 4.6 and Appendix E and include an evaluation of ChatGPT in Appendix E, showing that even human feedback fine-tuned large language models is far from perfect on XWS-TC.However, we acknowledge that the current state of extremely weak supervision would be better understood and assessed if complete evaluations on state-of-the-art large language models, such as Instruct-GPT (Ouyang et al., 2022), PaLM (Chowdhery et al., 2022), and ChatGPT exist.While we lack the computational resources to perform such an evaluation, we hope this work can stimulate interest in XWS-TC and complete the study.Limitation of Text Classification Another limitation is the scope of Text Classification.While PROMPT and SEED methods have shown strong performances on text classification, this performance does not extend to other general classification tasks, such as natural language inference/entailment (Zhao et al., 2022).

Ethics Statement
This paper establishes a benchmark for extremely weakly supervised text classification frameworks.We provide empirical results on various SEED and PROMPT methods, test their robustness, and analyze their connections.We give intuitions and insights on what method one should use for XWS-TC in different circumstances.We believe that we are on the ethical side and do not find any ethical concerns in this work.
A Other SEED and PROMPT methods More SEED methods.There are also other SEED methods that we will briefly describe here.WeST-Class (Meng et al., 2018) is one of the earlier weakly supervised methods that utilizes seed words to train a classifier by generating pseudodocuments instead of generating pseudo-labels.Conwea (Mekala and Shang, 2020) explores the multi-sense of words and proposes to view seed words of different meanings as different words.Lime (Park and Lee, 2022) uses a fine-tuned model on a natural language inference dataset to suggest the seed words.
More PROMPT methods.There are also other post/pre-processing techniques that we will briefly describe here.ContextualCal (Zhao et al., 2021) and PromptOrder (Lu et al., 2022) work for incontext learning (in the few-shot scenario), and addresses the stability issue of the few-shot context in prompts.NosiyChannel (Min et al., 2022) considers the likelihood of generating the document based on the label, rather than generating the label based on the document.

B Dataset Sources
The datasets are first introduced in the following papers: • IMDB (Maas et al., 2011)

C Detailed instructions and Label/Seed Words
We provide Table 7 showing the instructions and label words used in the main experiment of the benchmark.

D Comparing Pre-trained Language Models
We are aware that a similar number of parameters in language models do not directly imply similar abilities.We notice that the GPT-family LMs do tend to have a lower fine-tuning performance on natural language understanding tasks (Wang et al., 2019) when compared with BERT/RoBERTa.However, 2 http://qwone.com/~jason/20Newsgroups/we also notice that similar-sized GPT models do have a similar performance on zero-shot prompting as RoBERTa as observed in Table 8.Since we are comparing under an XWS setting, instead of fully supervised fine-tuning, we believe it is fair to compare similar-size GPT models and RoBERTa models.We do acknowledge that BERT might be at a disadvantage since RoBERTa is better than BERT at both fully supervised fine-tuning (Liu et al., 2019) and zero-shot prompting (Table 8).
However, as we note in Section 5.2.3, certain SEED methods that work well on BERT might not be easily transferable to RoBERTa.

E Excluding Large Language Models
We did not include large language models in this benchmark.Here, we elaborate on two specific reasons.
From the design purpose of the benchmark, the focus of the benchmark is to understand the strengths of different SEED and PROMPT methods, which would be fruitful for moderate businesses or individual persons to make decisions on which method to use for XWS-TC.Therefore, the analyses and comparisons on moderate-sized language models (100M -300M parameters in the benchmark) would be more meaningful.
From a fair evaluation principle, all the models mentioned above are only developed for generative language models, which are not typically used for SEED approaches.Using a more powerful language model for one approach would defeat the purpose of a fair comparison between models.Further, fine-tuned language models have already seen many classification tasks same as or very similar to the datasets in this benchmark.Therefore, it would be hard to access the true performance of the methods, as the similarity of the fine-tuned tasks to the evaluation tasks becomes another factor.
We also include an evaluation of ChatGPT on the benchmark.It is hard to fairly evaluate such a model, since (1) we do not know how it is trained and whether it saw the datasets in the benchmark, and (2) there is no easy way to do large-scale evaluation.We decide to evaluate it on the dataset NYT-S-Fine since we believe it is unlikely it is trained on such a fine-grained dataset.We pick 4 examples from each class resulting in total 104 examples.Since we can not retrieve the likelihoods, we embed the choice of classes in the prompt as follows: <instruction> <text> Answer:, where

G Computation Costs
We ran experiments on A6000 and A5000 GPUs.The total estimated GPU hours is 600.

H Full version of Table 5
We show Table 9, the detailed version of Table 5 that includes performances on individual datasets.

I Full version of Table 6
We show Table 10, the detailed version of Table 6 that includes performances on individual datasets.

Figure 1 :
Figure 1: Illustrations of the XWS-TC problem and the SEED and PROMPT approaches.

Figure 2 :
Figure 2: We highlight similarities (green) between a SEED method X-Class (orange) and two PROMPT methods Verbalizers and ProtoCal (blue).

Table 1 :
Dataset statistics in our benchmark.

Table 2 :
Performance of PROMPT and SEED methods on the benchmark with standard models, prompt instructions, label words, and seed word choices.All scores are higher the better.

Table 5 :
Performance of PROMPT and SEED methods when the choice of the pre-trained model is alternated.

Table 6 :
Performance of PROMPT and SEED methods with and without the clustering post-processing.