Zero-Shot Text Classification via Self-Supervised Tuning

Existing solutions to zero-shot text classification either conduct prompting with pre-trained language models, which is sensitive to the choices of templates, or rely on large-scale annotated data of relevant tasks for meta-tuning. In this work, we propose a new paradigm based on self-supervised learning to solve zero-shot text classification tasks by tuning the language models with unlabeled data, called self-supervised tuning. By exploring the inherent structure of free texts, we propose a new learning objective called first sentence prediction to bridge the gap between unlabeled data and text classification tasks. After tuning the model to learn to predict the first sentence in a paragraph based on the rest, the model is able to conduct zero-shot inference on unseen tasks such as topic classification and sentiment analysis. Experimental results show that our model outperforms the state-of-the-art baselines on 7 out of 10 tasks. Moreover, the analysis reveals that our model is less sensitive to the prompt design. Our code and pre-trained models are publicly available at https://github.com/DAMO-NLP-SG/SSTuning .


Introduction
Recent advances in pre-trained language models (PLMs) have brought enormous performance improvements in a large variety of NLP tasks (Radford and Narasimhan, 2018;Devlin et al., 2019).These paradigm shifts towards leveraging generic features learnt by PLMs are driven by the high data cost required for learning each new NLP task afresh.One promising learning method that echoes this paradigm shift is zero-shot text classification, which predicts text labels on unseen tasks.Zeroshot text classification has attracted considerable research attention in recent years (Wei et al., 2022;  Sanh et al., 2022;Yang et al., 2022), as labeled data is no longer a necessity for relearning new feature representations for untrained specific tasks.
Existing studies on zero-shot text classification can be briefly classified into two types, as shown in Figure 1.The first type is prompting, which uses PLMs to predict labels with designed templates and verbalizers (Figure 1 (a)).This can be achieved by leveraging the generation capability of large language models (Brown et al., 2020;Chowdhery et al., 2022), or reformulating text classification task as a mask-filling task (Schick and Schütze, 2021;Schick and Schütze, 2021).Likewise, generation-based methods (Meng et al., 2022;Ye et al., 2022) and mining-based methods (van de Kar et al., 2022) also rely on prompting to generate or filter noisy labeled samples, which are used for further fine-tuning.The second type is meta-tuning which fine-tunes a PLM on a collection of labeled data of related tasks before conducting inference on unseen tasks (Figure 1 (b)).By reformulating the annotated data into instruction templates (Wei et al., 2022;Sanh et al., 2022), question-answer pairs (Khashabi et al., 2020;Zhong et al., 2021), multiple-choice questions (Yang et al., 2022) or entailment pairs (Yin et al., 2019;Ding et al., 2022;Du et al., 2023), and fine-tuning on them, PLMs perform well on unseen tasks.
Despite the achieved performance, existing methods have several limitations.Prompting has shown to be sensitive to the choice of patterns and verbalizers (van de Kar et al., 2022).This makes it difficult to design different templates specifically for each task.In addition, generation-based and mining-based methods require fine-tuning PLMs for each downstream task, which is inefficient for deployment.On the other hand, meta-tuning relies on labeled data of relevant tasks or in specific formats to facilitate the learning of desired patterns.
The requirement for such large-scale annotated data narrows its application scope.
To address the above issues, we propose to leverage self-supervised learning (SSL) for zero-shot text classification tasks.SSL has been widely used during the pre-training stage of PLMs to alleviate the need for large-scale human annotations (Devlin et al., 2019;Lan et al., 2020) by exploiting the intrinsic structure of free texts.Therefore, with a suitable SSL objective, the model is able to capture certain patterns with the auto-constructed training data and can be applied to a wide range of downstream tasks in a zero-shot manner without specific designs.To our best knowledge, this is the first work to exploit SSL at the tuning stage for zero-shot classification, which we refer to as selfsupervised tuning (SSTuning).
The biggest challenge of applying SSTuning to zero-shot text classification tasks is to design a proper learning objective that can effectively construct large-scale training samples without manual annotations.Intuitively, the core of the text classification task can be treated as associating the most suitable label to the text, given all possible options.Motivated by this observation, we propose a new learning objective named first sentence prediction (FSP) for the SSTuning framework to capture such patterns.In general, the first sentence tends to summarize the main idea of a paragraph.Therefore, predicting the first sentence with the rest of the paragraph encourages the model to learn the matching relation between a text and its main idea ("label").To generate training samples, we use the first sentence in the paragraph as the positive option and the rest as text.The first sentences in other paragraphs are used as negative options.Specifically, if negative options are from the same article as the positive option, they are regarded as hard negatives since the sentences in the same article normally have some similarities, such as describing the same topic.Hard negatives may force the model to learn the semantics of the text instead of simply matching the keywords to complete the task.
In the inference phase, we convert all possible labels of a sample into options, which can be done in two simple ways: 1) use original label names; 2) convert labels using the templates (like "This text is about [label name]").Then the text and options are combined to create the final input.The tuned model can thus retrieve the most relevant option as the predicted label of the text.Since the tuned model has seen a large number of samples and various first sentences as options, which has a higher chance to consist of similar options to the ones at the inference phase, its performance is less sensitive to verbalizer design.In this way, our SSTuning enables efficient deployment of PLM for classifying texts of unseen classes on-the-fly without requiring further tuning with labeled data or unlabeled in-domain data.
Our main contributions are: • We propose a new learning paradigm called self-supervised tuning (SSTuning) to solve zero-shot text classification tasks.A simple yet effective learning objective named first sentence prediction is designed to bridge the gap between unlabeled data and text classification tasks.
• We conduct extensive experiments on 10 zeroshot text classification datasets.The results show that SSTuning outperforms all previous methods on overall accuracy in both topic classification tasks and sentiment analysis tasks.
Our analysis further demonstrates that our model is less sensitive to prompt design.

Proposed Method
In this section, we discuss our proposed framework, SSTuning, and provide details for our dataset preparation process using the idea of first sentence prediction (FSP), the tuning phase, and the zero-shot inference phase.

First Sentence Prediction
Text classification can be regarded as selecting the most relevant label for the text, given all possible labels.Based on such observation, we propose the

General Motors
The General Motors Company (GM) is an American automaker in the … …

General Motors
The FSP task to create datasets for our SSTuning by mimicking the same structure.
We design the FSP task by considering both the nature of the unlabeled corpus and the input/output format of classification tasks.In this subsection, we describe in detail how to construct the tuning and validation sets from the unlabeled corpus.Figure 2 shows the core procedures for our dataset generation.
Data filtering.We first filter data to select appropriate paragraphs for tuning (more details are shown in A.1). Removing meaningless sentences ensures data quality, which helps improve the performance of the model.
First sentence as the positive option.We consider an article A n that contains M paragraphs, i.e., A n = [P n 1 , P n 2 , ...P n M ], and suppose paragraph and the text x n,m are: As shown in Figure 2, we can retrieve the first sentence "Jim Berryman (born February 17, 1947) is a ... " as the positive option and the rest of the paragraph "He is the former mayor of Adrian ..." as the text for the first paragraph in the article.
Negative sampling.After getting the positive option, we randomly sample J "first sentences" from other paragraphs as negative options, where J is a random number that satisfies 1 ≤ J ≤ N maxLabel − 1.We let N maxLabel denote the maximum number of labels that are first sentences, which is pre-defined to ensure the total number of tokens for options is not too long.It is less or equal to N model , where N model is the number of labels for the model output layer.Having a random number of negative options bridges the gap between tuning and zero-shot inference since the number of classes for evaluation datasets may vary from 2 to N model .
Hard negatives.During negative sampling, if the negative options and the positive option are from the same article, we call the options hard negatives.
Inspired by the successful application of hard negatives in Gao et al. (2021b), we purposely add more hard negatives to enhance the model performance.
Sometimes, when we read articles, we notice that the same words appear in the first sentence and the rest of the paragraph.As shown in Figure 2, we can use the word "Berryman" to quickly find the corresponding first sentence for the text.However, if we add the hard negative "On January 6, 2012, Berryman ...", the model has to understand the true semantics to choose the positive option.
Option padding.We pad the options with the special "[PAD]" token to make the input format consistent between the tuning phase and the inference phase.Specifically, if the total number of options after negative sampling is (J + 1) < N model , we will add (N model − J − 1) [PAD] options.Thus the final list of options is: Generating final text and label.We shuffle the option list because the position of a positive option is random in the evaluation datasets.After shuffling, we assume the option list is: where the positive option O n,m c = O j .Then the label for this sample is: (5) The final input text is the concatenation of the above components: and [SEP] is the seperator token used by Devlin et al. (2019).
Thus the final text-label pair (x n,m inp , L n,m ) is the generated sample.We can repeat this process to generate a large number of samples as the tuning set.The validation set can also be generated in the same way.Note that if we select a corpus that only contains paragraphs instead of articles, we can treat each paragraph as an article, and no hard negatives are generated.

Network Architecture
We employ BERT-like pre-trained masked language models (PMLM) as the backbone, such as RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2020).Following Devlin et al. (2019), we add an output layer for classification.Such models have both bidirectional encoding capabilities and simplicity.Generative models are not necessary since we only need to predict the index of the correct option.We do not make any changes to the backbone so that the method can be easily adapted to different backbones.In order to cover all test datasets, we config the number of labels for the output layer as the maximum number of classes for all test datasets, denoted by N model .

Learning Objective
Traditional text classification with PMLMs like BERT maps each classification layer output to a class.Such a design requires a dedicated output layer for each dataset as they have different classes.Instead, our learning object for FSP with the same network is to predict the index of the positive option.In this way, we can use the output layer for both tuning and inference and for various kinds of datasets.
As shown in Figure 2, we concatenate the labels and the text as input.The outputs are the indices (0, 1, 2..., which correspond to A, B, C), which are the same as traditional classification datasets.We use a cross-entropy loss for tuning the model.

Zero-Shot Inference Phase
During the zero-shot inference phase, we can infer directly by converting the input of the sample to the same format as that in the tuning phase.

Input Formulation
As shown in Figure 2, the zero-shot inputs are formulated similarly as the tuning phase, except 1) instead of using first sentences as options, we convert the class names to options.Actually, we can simply use the original labels or some simple templates like "This text is about [label name]."for the conversion, thus little to no effort is needed.2) No shuffling is needed.Since the converted input and output during SSTuning and zero-shot phases are the same, no further adjustment of the model is required.

Constrained Prediction
Since the dimension of the output logits (N model ) may be different from the number of classes in a dataset (N L ), the predictions may be out of range (e.g. the model may output 3 for a dataset with 2 classes).To solve this issue, we simply make predictions based on the first N L logits: where P is the index for the positive option.
3 Experiment Setup

SSTuning Datasets
We choose English Wikipedia and Amazon review dataset (2018) (Ni et al., 2019) for SSTuning.The Wikipedia corpus has more than 6.2M articles 1 by the end of 2021, while Amazon Review Data has around 233.1M reviews2 .Wikipedia articles typically use formal expressions and Amazon reviews contain informal user-written texts, together covering different genres of text.
For English Wikipedia, we collect articles up to March 1st, 2022.To balance the dataset, we select up to 5 paragraphs in each article.The generated dataset has 13.5M samples.For the Amazon review dataset, we only use the review text to create our SSTuning dataset, ignoring other information such as summary and vote.The Amazon review dataset has 29 categories.To keep the model from being dominated by a certain category, we select up to 500k samples from each category.In the end, we collected 11.9M samples.
To have a balanced dataset, we sample 2.56M from the Wikipedia dataset and 2.56M from the Amazon review dataset, forming a total of 5.12M samples as the tuning dataset.In addition, we sampled 32k from each of the two datasets, forming a validation set consisting of 64k samples.
Following the baselines (Yang et al., 2022;van de Kar et al., 2022;Gera et al., 2022), we report the accuracy on the test set when available, falling back to the original validation set for SST-2.

Baselines
We choose the following baselines for comparison after considering their relevancy, impact, checkpoint availability, and model sizes: • Textual entailment (TE) (Yin et al., 2019): Following Gera et al. (2022), we download the off-the-shelf models trained on MNLI and use • TE-Wiki (Ding et al., 2022): This model is also trained with entailment methods but with a dataset constructed from Wikipedia.
• Prompting-based method (Schick and Schütze, 2021): We compare with the results using multiple verbalizers reported in (van de Kar et al., 2022).
• Mining-based (van de Kar et al., 2022): The method has three steps, which are mine, filter and fine-tune.We compare with the results reported.
• UniMC (Yang et al., 2022): We download the released checkpoint and test the model without question prompts since the reported results on text classification tasks are better on average.
We followed the setups and verbalizers of the original works as much as possible.If the original work does not have verbalizers for a dataset, we will use the same or comparable verbalizers as ours, as shown in Table 7.

Implementation Details
To test the performance of the proposed method on different model sizes and architectures, we tune three versions of models, which are based on RoBERTa base , RoBERTa large (Liu et al., 2019), and ALBERT xxlarge (V2) (Lan et al., 2020), denoted as SSTuning-base, SSTuning-large, SSTuning-ALBERT, respectively.We set the maximum token length as 512 and only run one epoch.We repeat all the experiments 5 times with different seeds by default.The experiments on SSTuning-base and SSTuning-large are run on 8 NVIDIA V100 GPUs and the experiments on SSTuning-ALBERT are run on 4 NVIDIA A100 GPUs.
The hyperparameters for fine-tuning and SSTuning are shown in Table 8.We set the batch size based on the constraint of the hardware and do a simple hyperparameter search for the learning rate.We do not add hard negatives for the Amazon review dataset since the reviews are not in the format of articles.We also tried to use the negative options from the same product category as hard negatives but did not find any meaningful improvement.We set N model as 20 and N maxLabel as 10 after simple experiment.

Main Results
The main results are shown in Table 1.We have the following observations: 1) Our method SSTuning-ALBERT achieves new state-of-the-art results on 7 out of 10 datasets, and significantly reduces the gap between fine-tuning and zero-shot methods compared to UniMC (from 10.6 to 7.2) , showing the superiority of our proposed method.2) With the same backbone, SSTuning-ALBERT outperforms UniMC by 3.4% on average.Note that different from UniMC, we do not utilize any labeled data to conduct meta-tuning, but purely rely on autoconstructed data for self-supervised tuning, which not only has a much large scale of data but also has more abundant options (first sentences).3) Comparing methods based on RoBERTa base , RoBERTa large and BART large , our SSTuning-large and SSTuningbase are the two best-performing models on average.We also observe that SSTuning-large outperforms UniMC, despite the latter possessing a stronger backbone.4) Our models do not perform very well on SST-5, which is a fine-grained sentiment analysis task.Maybe we can generate more fine-grained options from the unlabeled corpus to improve performance on such tasks.We leave it as a future work.

Ablation on Tuning Datasets
We utilize both the Amazon review dataset and English Wikipedia during the tuning stage.To evaluate their effectiveness, we conduct ablation studies to create two model variants that are only trained on one dataset.We set the number of samples for each case to 5.12M for a fair comparison.As shown in Table 2, both datasets contribute to the final performance, thus discarding any one leads to a performance drop.It is interesting that tuning with Amazon review data performs the same as tuning with Wikipedia on topic classification tasks.This is unexpected since Wikipedia is more related to topic classification tasks intuitively.We anticipate the reason is that the backbone models have already been pre-trained with Wikipedia, thus further tuning with it does not bring significant advantages.

Alternative Tuning Objectives
We have proposed first sentence prediction (FSP) as the tuning objective to equip the model learn-ing to associate the label and text in the inference stage.We consider some alternative objectives here for comparison: 1) last sentence prediction (LSP), which treats the last sentence as the positive option for the rest of the paragraph; 2) next sentence selection (NSS)3 , which treats the first sentence in a consecutive sentence pair as text and the next as the positive option; 3) random sentence prediction (RSP), which randomly pick a sentence in a paragraph as the positive option and treat the rest as text.The comparison between the four settings is shown in Table 3.We find that FSP performs the best, especially for topic classification tasks.Among the alternatives, utilizing LSP as the tuning objective leads to the best performance, which is expected since the last sentence in a paragraph usually also contains the central idea, sharing a similar function as the first sentence.Unlike topic classification tasks, the four settings perform similarly on sentiment analysis tasks.The possible reason is that each sentence in a paragraph shares the same sentiment.

Impact of Verbalizer designs
During self-supervised tuning, the model saw a large number of first sentences as options, which may contain similar options to the unseen tasks, thus it may have better generalization capabilities.To test how robust the model is to the verbalizer changes compared with UniMC, we design 10 sets of verbalizers for SST-2 and IMDb, covering various scenarios: 1) verbalizers with a single word; 2) verbalizers with different punctuation marks; 3) combinations of single verbalizers; 4) different format for different classes.For a fair comparison, we only use one of our checkpoints and compare it with the UniMC checkpoint released.The results are shown in Table 4.We find that SSTuning-ALBERT performs better on average and is more stable.For the most challenging case, which is "Terrible!"and "I like the movie!It is wonderful!",SSTuning-ALBERT outperforms UniMC by 20.4 points for SST-2 and 17 points for IMDb.

Classification Mechanism
To investigate how our models make correct decisions, we did a case study on a movie review example.As shown in Figure 3, we used SSTuningbase (number of labels configured as 2) to classify whether the movie review "A wonderful movie!" is negative or positive.We set the verbalizers as "Bad."and "It's good." to see how the length of options impacts the decision.The prediction of the model is 1, which is correct.We find that [CLS] token attends more to the second opinion, especially to the tokens around the index indicator "B" in the last layer.This is consistent with our intuitions.For humans, when we do classification tasks, we normally compare the options and select the option that best matches the text.We show additional attention maps and analysis in Appendix B.2.3.

Importance of Index Indicators
To further understand how the index indicator guides the model to make the prediction, we employ different indicator designs during the tuning and inference stage.Specifically, we consider different formats of the index indicator, which are: 1) alphabet characters (A, B, C...), which is the default format; 2) numerical index (0, 1, 2...); 3) same index indicator for all options (0, 0, 0...).During the inference, we also consider two special indicators: 4) same alphabet characters (A, A, A...), and 5) rearranged alphabet characters (B, A, D, C...).The results are shown in Table 5.There is not much difference between using alphabet characters and numerical indexes, as shown in cases 1 and 2. As shown in case 3, using the same characters will de- grade the performance but not much, which means the model can rely on position embedding of the index indicator to make the correct predictions.As shown in cases 4 and 5, using inconsistent index indicators will greatly degrade the performance, which further verifies the importance of using consistent index indicators to make correct predictions.

Impact of Hard Negative Samples
Intuitively, adding more hard negatives will make the task more difficult, thus forcing the mode to better understand the semantics of the sentences.We tested the impact of hard negatives based on two settings: 1) train with both the Amazon reviews and Wikipedia, each with 2.56M samples; 2) train with only 2.56M Wikipedia samples.We don't train with only Amazon reviews since they don't have hard negatives.The results with 0, 1, 3, 5, 7, 9 hard negatives are shown in Figure 4.
In general, adding more hard negatives will improve the performance.For the case with both datasets, the impact of hard negatives is small.This is because the Amazon review dataset alone can achieve good performance, as shown in Table 2.However, hard negatives have a significant impact on the setting with only Wikipedia for tuning.The possible reason is that without hard negatives the model may only learn keyword matching instead of semantics since the keywords may appear many times in the same Wikipedia article.

Additional Analysis
We report additional analysis in Appendix B.2.As shown in Figure 5, we can further improve the performance by increasing the tuning sample size.We also compared SSTuning-base with different numbers of output labels N model .As shown in Appendix B.2.2, we can increase N model to inference on datasets with more classes.

Related Work
Zero-shot text classification.Zero-shot learning has the advantage that no annotated data is required for downstream tasks.Prompting-based methods (Brown et al., 2020;Chowdhery et al., 2022;Schick and Schütze, 2021;Gao et al., 2021a) that reformulate the inputs as prompts can perform much worse in the zero-shot setting than few-shot settings as it may be hard for the PLMs to interpret the templates.A better option may be miningbased method (van de Kar et al., 2022), which mines the labeled data from the unlabeled corpus for fine-tuning each downstream task.Similarly, generation-based approaches (Meng et al., 2022;Ye et al., 2022) generate labeled data with a generative PLM.
More works on zero-shot text classifications are based on transfer learning.Instruction-tuningbased models like FLAN (Wei et al., 2022) and T0 (Sanh et al., 2022), fine-tine PLMs on a collection of datasets described by instructions or prompts to improve performances on unseen tasks.PLMs can also be meta-tuned (Zhong et al., 2021) on text classification datasets and do zero-shot on other classification datasets.UniMC (Yang et al., 2022) converts several tasks to multiple-choice tasks and does zero-shot inference on tasks that can be formulated in the same format.Another line of work is to convert text classification problems to textual entailment problems.By fine-tuning on natural language inference datasets (Yin et al., 2019) or a dataset from Wikipedia (Ding et al., 2022), the models can do inference directly on text classification datasets.Instead of using annotated datasets, we only need unlabeled data to generate a large number of labeled samples as tuning and validation sets by exploring the inherent text structure.
Self-supervised learning.Self-supervised learning has been widely applied during language model pre-training by leveraging the input data itself as supervision signals (Liu et al., 2021).Left-toright language modeling (Radford and Narasimhan, 2018) and masked language modeling (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020) help learn good sentence representations.In order to capture the sentence-level relations of downstream tasks, Devlin et al. (2019) pre-train a next sentence prediction task and Lan et al. (2020) use sentence order prediction task to model the inter-sentence coherence.Wang et al. (2020) combine the two objectives to form a three-way classification task.Instead of modeling the inter-sentence relations, Meng et al. (2021) employs sequence contrastive learning to align the corrupted text sequences that originate from the same input source and guarantee the uniformity of the representation space.Our work uses a harder learning objective called first sentence prediction: given several options and text, find the corresponding first sentence preceding the text.

Conclusions
In this work, we propose a new learning paradigm called SSTuning for zero-shot text classification tasks.By forcing the model to predict the first sentence of a paragraph given the rest, the model learns to associate the text with its label for text classification tasks.Experimental results show that our proposed method outperforms state-of-the-art baselines on 7 out of 10 tasks and the performance is more stable with different verbalizer designs.Our work proves that applying self-supervised learning is a promising direction for zero-shot learning.In the future, we plan to apply SSTuing to other tasks by designing proper learning objectives.

Limitations
In this work, we proposed SSTuning for zero-shot text classification tasks.During inference, we may need to design verbalizers even though we can use templates like "This text is about [label name]".For simplicity and fair comparison, we only refer to previous works for such designs, which may be sub-optimal.As shown in Table 4, using the verbalizers "Terrible."and "Great."work better than "It's terrible."and "It's great."for the SST-2 and IMDA tasks that we reported in the main results.If the labeled validation set is provided, the model may perform better by choosing verbalizers based on the validation set.
Due to limited computation resources, we only tuned the model with 5.12 million samples, which is only a small portion of the available samples.We believe that tuning the model on a larger dataset help improve the performance.Even though the computational cost will also increase, it is worth it since no more training is needed at the inference phase.In addition, we did not do extensive hyperparameter searches except for the learning rate, which may further improve the performance.
In our experiment, we only tested the method with discriminative models like RoBERTa and AL-BERT.Its performance with generative models is not known.It is non-trivial to test on such models since generative models can do both natural language understanding tasks and natural language generation tasks.We leave this as future work.

A.1 Tuning Datasets
The original unlabeled datasets can be noisy and some paragraphs are not suitable for generating tuning datasets.We filter the paragraphs with the following features: 1) the paragraph only contains  1 sentence; 2) the first sentence contains less than or equal to 3 characters; 3) the first sentence only contains non-alphabetic symbols; 4) repeated paragraphs.Some of the final generated samples from English Wikipedia and Amazon product reviews are shown in Table 9.

A.2 Evaluation Datasets
We summarize the dataset statistics for the evaluation datasets in Table 6.We download all the datasets from Huggingface (Lhoest et al., 2021), except 20newsgroup.For Yahoo Topics, we concatenate the question and answer as inputs.For DBPedia and Amazon, we concatenate the title and content.For 20newsgroup, we follow the recommendations to remove headers, footers, and quotas4 .However, if the text becomes empty after removing the components, we will use the original text instead.
The verbalizers for each dataset are shown in Table 7.We try to unify the verbalizer design for similar tasks.For topic classification tasks, we use the template "This text is about []." after converting the class names to meaningful words.For binary classifications, we use "It's terrible."for negative class and "It's great."for positive class.For SST-5, we refer to (Gao et al., 2021a) to design the verbalizers.Some of the reformulated text for the evaluation datasets are shown in Table 10.samples and 64k validation samples (also generated via FSP).

B.2.1 Impact of Tuning Sample Size
To test how the tuning sample size impacts the performance, we trained SSTuning-base with 320k, 640k, 1.28M, 2.56M, and 5.12M samples, with half generated from Wikipedia and half from Amazon reviews.The results are shown in Figure 5.With more samples, the performances are increasing in general, especially for topic classification tasks.With such observation, it is likely to further improve the performance by increasing the tuning sample size.Even though tuning on larger datasets is more computationally expensive, it is worth doing since no further training is required for downstream tasks.

B.2.2 Impact of the Number of Output Labels
In our main results, we set the number of output labels N model as 20.However, a classification dataset may have more than 20 classes.To test the scalability of the label number, we tune another variant for SSTuning-base.We use numerical numbers (0, 1, 2...) as the index indicator and set N model as 40.The comparison between the two versions is shown in Table 11.Increasing N model from 20 to 40 only degrade the performance by 1.4 points (75.9% to 74.5%), showing the good scalability of our approach.As an alternative for the datasets with more classes, we can split the labels and do a multi-stage inference.

B.2.3 Classification Mechanism
We plot more attention maps for the example discussed in Section 4.3.2 in Figure 6.We focus on a few important tokens, including the classification token <s>, the option indicators A and B, and the separator token </s>.In Layer 0, <s> attends to all the options and the text.A and B attend more to its own options.</s> attend more to the text tokens.In higher layers, A and B attend even more to their own option tokens (Layer 1) but also have some interactions (Layer 4).In layer 9, A and B attend more its own option tokens again and also the period mark, while </s> attend to both the text tokens and the options tokens for B (the positive option).In the end, <s> attends to B, which is the positive option.Based on the observations, we hypothesize that the model has the capability to encode the options and text separately, compare the options and text, and choose the positive option in the end.Layer 0: token "<s>" Layer 0: token "A" Layer 0: token "B" Layer 0: token "</s>" Layer 1: token "A" Layer 1: token "B" Layer 4: token "<s>" Layer 4: token "B" Layer 9: token "A" Layer 9: token "B" Layer 9: token "</s>" Layer 11: token "<s>"

Figure 3 :
Figure 3: Attention map of [CLS] token (which is <s> here for RoBERTa backbone) in the last layer for a movie review.This figure is generated with BertViz (Vig, 2019).

Figure 4 :
Figure 4: Zero-shot accuracy with different numbers of hard negatives.

B. 1 Figure 5 :
Figure 5: Zero-shot accuracy with different training sample sizes.Mean accuracy over 4 topic classification tasks, 6 sentiment analysis tasks, and all the tasks are reported.

Figure 6 :
Figure 6: Attention map for a movie review example.The original text is "A wonderful movie!" and the verbalizers are "Bad."and "It's Good.".The model is SSTuning-base with 2 classes.

He is the former mayor of Adrian … Label: B SSTuning Zero-shot Inference Sneaky Credit Card Tactics Keep an eye on your credit card issuers --they may be about to raise your rates.
Figure2: Data construction for SSTuning (top) and zero-shot inference (bottom).The number of labels N model is set as 5 here.The SSTuning example is from Wikipedia and the inference example is from AG News dataset.

Table 1 :
Main results for 4 topic classification tasks and 6 sentiment analysis tasks.❖: the original training sets (see dataset sizes in Table6) are used to provide results under supervised settings, served as upper bound, otherwise zero-shot results are reported.*: results are taken from corresponding papers."Labeled" indicates whether the model uses labeled (✓) or unlabeled (✗) data."Avg" is the arithmetic mean accuracy of all the datasets.For SSTuning models, we report the mean accuracy of 5 runs using different seeds.The best results for each dataset are in bold.

Table 2 :
Zero-shot results with different tuning datasets.The best result is in Bold.

Table 3 :
Zero-shot results with different tuning objectives.The best results are in Bold.

Table 4 :
Comparison of zero-shot results for 2 sentiment analysis tasks with different verbalizers.The best average results are in bold.

Table 5 :
Performance with same and different index indicators during tuning and inference."Std" indicates Standard Deviation.

Table 6 :
Dataset statistics for evaluation datasets

Table 7 :
This text is about society & culture.","This text is about science & mathematics.","This text is about health.","This text is about education & reference.","This text is about computers & internet.","This text is about sports.","This text is about business & finance.","This text is about entertainment & music.","This text is about family & relationships.","This text is about politics & government."AG News "This text is about politics.","This text is about sports.","This text is about business.","This text is about technology."DBPedia "This text is about company.","This text is about educational institution.","This text is about artist.","This text is about athlete.","This text is about office holder.","This text is about mean of transportation.","This text is about building.","This text is about natural place.","This text is about village.","This text is about animal.","This text is about plant.","This text is about album.","This text is about film.","This text is about written work."20 Newsgroup "This text is about atheism.","This text is about computer graphics.","This text is about microsoft windows.","This text is about pc hardware.","This text is about mac hardware.","This text is about windows x.", "This text is about for sale.","This text is about cars.","This text is about motorcycles.","This text is about baseball.","This text is about hockey.","This text is about cryptography.","This text is about electronics.","This text is about medicine.","This text is about space.","This text is about christianity.","This text is about guns.","This text is about middle east.","This text is about politics.","This text is about religion."Verbalizers for the evaluation datasets.