ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-shot Generalization

We propose a multitask pretraining approach ZeroPrompt for zero-shot generalization, focusing on task scaling and zero-shot prompting. While previous models are trained on only a few dozen tasks, we scale to 1,000 tasks for the first time using real-world data. This leads to a crucial discovery that task scaling can be an efficient alternative to model scaling; i


Introduction
Recent progress like GPT-3 (Brown et al., 2020) demonstrates the possibility of prompting on largerscale models for zero-shot learning, but the performance of zero-shot generalization still falls short on many tasks compared to fully-supervised finetuning.Further, other works proposed to include a set of supervised tasks into pretraining (Zhong et al., 2021;Wei et al., 2021;Sanh et al., 2021), and prompts are often used in the framework to unify the tasks.Zhong et al. (2021) converted different datasets into a unified "yes/no" question answering format with label descriptions.FLAN (Wei et al., 2021) extended the scope by considering more task types and a larger model.T0 (Sanh et al., 2021) collected a large set of diverse prompts for each task to further enhance performance.
Despite the effects of model scaling and prompts scaling (Wei et al., 2021;Sanh et al., 2021)  exploited in these works.It is still not clear how scaling the number of training tasks to hundreds even thousands of tasks affects the performance of multitask pretraining.We hypothesize that task scaling plays an important role in training generalizable zero-shot systems and explore the limits of task scaling using 1,000 tasks.Interestingly, our empirical study reveals that task scaling can be an efficient alternative to model scaling, as shown in Figure 1.With an extremely large number of training tasks, the model size has less impact on performance.A 0.4B model can achieve comparable zero-shot performance to that of a 12B model, improving training efficiency by 30 times in terms of FLOPs and the serving efficiency as well.
Our contributions can be summarized as follows.
• We scale the number of tasks to 1,000 in multitask pretraining for the first time.Our study reveals a crucial finding that on the datasets we consider, task scaling is an efficient alter-native to model scaling.
• Our experiments demonstrate that task scaling improves both the efficiency and the performance of zero-shot learning.
It has been shown that augmenting unsupervised pretraining with supervised data can significantly improve task performance during finetuning (Chen et al., 2020;Gururangan et al., 2020).Some recent studies followed this idea and obtained improved few-shot or zero-shot generalization in the same manner.For instance, Mishra et al. (Mishra et al., 2021) built a dataset with task instructions, and CROSSFIT (Ye et al., 2021) introduced a repository of few-shot text-to-text tasks.FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) applied instruction-tuning of many tasks with 137B and 11B parameters, respectively.ExT5 (Aribandi et al., 2021) applies multitask pretraining as well, but it focuses on multitask cotraining transfer instead of zero-shot generalization.Our ZeroPrompt utilizes labeled data in the pretraining phase, and we aim at studying the task scaling law of zero-shot generalization by adopting 1,000 real-world tasks.

ZeroPrompt
We follow the same framework of multitask zeroshot learning in (Wei et al., 2021;Sanh et al., 2021), where models are pretrained on a variety of tasks and then tested on held-out unseen tasks.

Datasets for Scaling to 1,000+ Tasks
We collected 80 public Chinese NLP tasks and further acquired over 1,000 real-world datasets from our production systems to investigate the task number scaling law.The number of tasks in each task type is listed in Table 1, where we define task types following previous work and intuitive knowledge.The task taxonomy of the production datasets is presented in Appendix A.1, consisting of 6 task types from 10 different domains.We split the public datasets and the production datasets into training tasks and testing tasks, as shown in Table 1.Different from FLAN (Sanh et al., 2021) or T0 (Wei et al., 2021), our test set contains a more diverse set of task clusters.Detailed train/test splits can be found in Table 8.To simulate real-world NLP production systems at scale, where the costs for data labeling are expensive, we sample 128 examples per class for each classification task and 256 examples for each generation task to build the training set3 .

Prompt Design
Although large-scale pretrained models with prompting show promising results on zero-shot generalization to unseen tasks without any labeled data, prompt design is of vital importance to their performance.We applies both the hard prompt, which is composed of label candidates and task descriptions, and the soft prompt at the mulitask pretraining stage, details of prompt design can be found in Appendix A.4.
We use a encoder-decoder model and apply both unsupervised pretraining and multitask prompted supervised pretraining.Training details of Zero-Prompt can be found in Appendix A.3.

Power of Task Scaling
To study the law of task scaling, we trained Zero-Prompt on a mixture of public data and production data, and increased the number of production training tasks from 20 to 800.Zero-shot performance on unseen production test tasks are presented in Figure 1.Larger models have much better zeroshot performance with a limited number of training tasks.However, the performance gains from larger models decrease when more training tasks are added.Generally, if we scale the number of training tasks, small models can still achieve impressive zero-shot performance, substantially improving training efficiency by 30 times in FLOPs (0.4B vs 12B) as well as the serving efficiency.

Comparison with Other Baselines
Results on the reserved testing tasks are shown in forming previous PTMs, CPM-2 and Pangu-α, by a large margin of 28 points.Notably, ZeroPrompt is comparable to or even better than a finetuned RoBERTa-large model on some academic and production datasets.Compared to the overall score of the finetuned RoBERTa, ZeroPrompt is only 4.7 points short.This is quite ecstatic considering that ZeroPrompt did not use any labeled data for tuning.

Task Scaling vs Sample Scaling
While task scaling by definition also increases the number of training samples, we also decouple the effects of task scaling and sample scaling in Table 3.The numbers of total samples are the same for "80 tasks with 1280 shots" and "800 tasks with 128 shots", but the latter shows considerably better performance-4.8 and 3.0 points improvement for the 0.4B model and the 1.5B model, respectively.

Unsupervised Data vs Supervised Data
Model 0.4B 1.5B 12B LM loss 1.9 1.7 1.5 Sup loss 0.19 0.17 0.19 Zero-shot performance is attributed to both supervised tasks and the LM task.As we increase the number of supervised tasks, they outweigh the LM task.Meanwhile, these supervised tasks have much less data to fit than the LM task, which makes smaller models viable choices.Table 4 shows that smaller models have similar losses on supervised tasks but higher losses on LM, compared to larger models.This explains why task scaling can be an alternative to model scaling.

Effect of Task Distribution
To validate the zero-shot performance on crosstask-type tasks, we select production tasks from two task types for testing and the rest for training, as presented in Figure 2. It can be seen that task scaling still leads to significant improvement of zero-shot performance on cross-task-type tasks.
On the other hand, Figure 3 shows the zero-shot performance on public datasets.For some tasks like INTENT, the scaling of production tasks is helpful, but the result could be different for other tasks like SENTI.The average performance of all public datasets does not increase monotonically with more training tasks.We suppose the reason is that the task distribution of production data is different from that of public tasks.Therefore, only part of public tasks benefit from the scaling of production training tasks.We also study the effect of cross task type transfer on public tasks, the results can be found in Appendix A.6.

Conclusions
In this paper, we propose ZeroPrompt, a multitask prompted pretraining method that significantly improves the zero-shot generalization ability of language models.In our experiments, we collect over 1,000 real-world production tasks to study the task scaling law.We find that on our considered datasets, the zero-shot performance gap between small and large models becomes less significant when having more training tasks.As a result, task scaling can substantially improve training and serving efficiency.

Limitations
Our results regarding the effect of task scaling on zero-shot performance still have a few limitations.Specifically, We control our study by only increasing the number of tasks collected from our production system, and they might only represent a subset of all the NLP problems.In addition, for different testing tasks in the public datasets, the zero-shot performance might not increase with the scaling of production training tasks.Therefore, the conclusion that task scaling can significantly boost zero-shot performance is limited to the case where training and test tasks share some similarity in distribution, but not a general conclusion for arbitrary distributions.It also remains an open problem as how to quantitatively characterize the distribution similarity between training and test tasks.We hope our results could encourage future work on addressing these limitations to further explore the potential of zero-shot learning.

A.1 Datasets
For fair evaluation of zero-shot generalization, we investigate and collect diverse public Chinese NLP datasets with different task types.The summary of all datasets used in the experiments is presented in Table 8, including train/test task split and metrics of each task.In total, we have 13 task types of public datasets and 6 task types of production datasets.

A.1.1 Public Datasets
• Sentiment Analysis requires the model to determine whether the sentiment of a piece of text is positive or negative.
• News Classification asks the model to predict the topic of a news article.
• Intent Classification asks the model to predict the intent of a person given one of his/her words.
• Machine Reading Comprehension Question Answering requires the model to answer a question given a document where the answer can be derived.
• Natural Language Inference asks the model to tell the relation of two sentences is neutral, entailment or contradiction.
• Sentence Similarity asks the model to predict whether two sentences are similar or not.
• Paraphrase asks the model to tell whether two sentences with much lexical overlap are semantically equivalent.
• Question Answer Matching asks the model to reason whether the given two sentences can form a valid question answering pair.
• Name Entity Recognition requires the model to find all entities in the given piece of text.
• Summarization requires the model to give a summary with one or two sentences of the given long document.
• Keywords asks the model to extract keywords from the given sentence.
• Winograd Schema Challenge, the sample of which composes a sentence, a pronoun and an entity in the sentence, requires the model to tell whether the pronoun refers to the entity.• App Classification asks the model to tell which type of App the given introduction is about, and there are hundreds of target App categories.

A.1.2 Production Datasets
The task taxonomy of the production datasets is presented in Figure 4, consisting of 6 task types from 10 different domains.As illustrated in Figure 4, the task taxonomy of our production contains six types of natural language understanding tasks.We provide detailed explanation here and several examples in Table 9.
• Objection are datasets that we gathered from production scenario.Objection tasks are language understanding tasks where model will have to analyze whether the speaker is proposing an argument in opposition of the previous contents.
• Profile are datasets that we gathered from realistic industrial scenario.Profile tasks are language understanding tasks similar to intent classification, where model will have to tell whether the current sentence is describing certain intention.
• Mention are also datasets that we gathered from realistic industrial scenario.Mention tasks are language understanding tasks that model have to judge whether given sentence mentioned sales keywords.
• Violation are also datasets that we gathered from realistic industrial scenario.Violation tasks are language understanding tasks that model will have to tell if speaker violates the sales guidelines.
• Acception are also datasets that we gathered from realistic industrial scenario.Acception tasks are language understanding tasks that let model tell if the speaker follows systems instruction and tell sales keywords to customer.
• Execution are also datasets that we gathered from realistic industrial scenario.Execution tasks are language understanding tasks that model will have to find out whether a salesman follow the predefined sales guidance when talking to customer.

A.1.3 Avoid Test Set Contamination
Although we split datasets into training and testing, there is non-negligible overlap between some of the training datasets and the test set.To avoid test set contamination, we follow the filter method given in (Brown et al., 2020).Specifically, we directly remove all examples in the training phase that have a 30-gram overlap with any example in the test phase.

A.2 Metric
Metrics used for diverse NLP tasks in this paper are presented in the following.AUC is the abbreviation of Area Under ROC Curve.Typically, the value of AUC is between 0.5 and 1.0.
ROUGE is the abbreviation of Recall-Oriented Understudy for Gisting Evaluation, which is an evaluation method oriented to the recall rate of ngrams.We use ROUGE-1 in the paper.
Micro-F1 is used to evaluate multi-label classification tasks.It is the harmonic average of the averaged precision and recall of all labels.
F1 measures the overlap between the prediction and the ground truth, which is typically used in span prediction tasks.
Pos-F1 is customized for NER tasks with a textto-text form as shown in Table 16.It is the averaged string F1 score for positive samples, of which the true label is not "blank".

A.3 Training Details
In the unsupervised pretraining stage, our base T5 model is pretrained for 100k steps on a 300G web-crawled Chinese corpus with the batch size of 4096 and the sequence length of 512.In the multitask prompted training stage, ZeroPrompt is trained with an Adam Optimizer for 1500 more steps with a batch size of 64 and a learning rate of 3.5e-5.We repeat experiments, including multitask pretraining and finetuning of RoBERTa, T5, five times with different random seeds to reduce variance.
At the stage of unsupervised pretraining, we apply the span corruption objective, a variant of Masked Language Modeling (MLM), following T5 (Raffel et al., 2020).Meanwhile, we also add MLM as an auxiliary loss to overcome catastrophic forgetting in the multitask pretraining phase.
The multitask pretraining loss is given in Equation 1, where L is the overall training loss, L sup is the multitask supervised loss, L MLM is the MLM loss and λ is the loss weight.According to Table 18, ZeroPrompt obtains 1.3 point gains by adding the MLM loss, proving our suppose to avoid catastrophic forgetting.

A.4 Prompt Design
In this subsection, we describe the prompt design of our choice and some other tested variants.
In the simplest form of a prompt template T , the prompting method constructs T by a handcrafted prompt P and the text input sequence X: T = {P, X, [MASK]} where [MASK] is the blank that an answer should be filled in to complete the sentence.This is known as sentence in-filling.
As illustrated in Figure 5, our optimized prompt P is further decomposed into three parts, E, V, and D, where we have the task-specific soft prompt E, the verbalizer prompt V and the task description prompt D. As a result, our prompt template T could be expressed as: To disentangle the task-specific and task-agnostic knowledge in multitask pretraining, we install a continuous prompt embedding as a prefix, which is referred as the task-specific soft prompt shown in Figure 5.We first validate the importance of including the task-specific soft prompt and the verbalizer prompt Table 5: Ablation results on the optimized prompt design.-V: without the verbalizer prompt; -E: without the task-specific soft prompt; -E, V: without the verbalizer prompt and the task-specific soft prompt.
in our choice of prompt design, and then compare different methods to build new task-specific prompt embeddings.Ablation results on the optimized prompt design are shown in Table 5.We can see that task-specific soft prompts and verbalizer prompts are useful when applied separately, and can obtain an even greater gain of 4 points when applied combined by our ZeroPrompt.
For unseen tasks, we need to build task-specific soft prompts without any labeled sample.Firstly, we tune a classifier on the mixture of training data to tell the belongings of given texts, and for new samples in the test task, the classifier can predict the similarities of this sample to training tasks.Formally, for pretrained task i, we regard its taskspecific prompt embedding as E i , the classifier output of training task i's probability as prob i .In our experiments, we have tried three methods to build the test task prompt embedding E new , they are weighted, top1 and random.
1) weighted.For the weighted, we set E new as a weighted average of pretrained task prompt embedding according to the probability, as Note that we can do the weighted average on the sample level, as well as the task level.
2) top1.For the top1, we assign the most similar task prompt embedding to the new task, as 3) random.For the random, we initialize the task prompt embedding E new randomly.
Ablation results are given in Table 6.Note that for weighted avg and top1 we only report results of per sample, results with all samples are given in Table 19.We can see that the winning approach is surprisingly random init, and the direct uses of similar task prompt embeddings seen in training in various ways are slightly worse than random init, and the worst performing method is none as expected.To comprehend the results on random init and top1, we suppose that different tasks, though with similar input data distributions, still have different mappings X →y.Therefore, it is often difficult to find the most proper task-specific soft prompt seen in the training phase for a new task in the zero-shot learning setting.

A.5 Data Retrieval and Self-training
To fully exploit unsupervised data, we take a selftraining framework similar to (Lee et al., 2013;Du et al., 2021).Given a supervised training set D train and an unlabeled dataset D un , we will retrieve task-similar data from unsupervised corpus according to sentence embedding similarity, and the self-training process may repeat several times.For sentence embedding in retrieval, a pretrained BERT is finetuned on both unsupervised and supervised corpus using SimCSE (Gao et al., 2021).We select new classification and production datasets to study the impact of data retrieval and self-training, considering similar data available in the unsupervised pretraining corpus.Results are summarized in Table 7. Self-training improves the validation set performance of 0.96 and 0.10 for NEWS and production tasks respectively, and improves the test zero-shot performance of 3.90 and 1.23.Self-training shows larger improvement on unseen tasks than training tasks.We explain that pseudo labeled data may increase the diversity of training data, resulting better zero-shot generalization abilities.

A.6 Effect of Cross Task Type Transfer
Following the previous works (Wei et al., 2021;Sanh et al., 2021), we study whether held-out task types can benefit from multitask prompted pretraining.Specifically, we choose NLI and NEWS as testing task types while other various datasets as training task types.We add different training tasks in sequence as shown in Figure 6.For NEWS, the zero-shot performance increases from 17 to 49 by adding INTENT, while adding sentence pair (STS, QAM, PARA) tasks leads to a performance drop in 7 points.Other training task types such as SENTI, SUMM, NER and MRC only have marginal impacts on the performance.For sanity check, we add NEWS in the training phase at last and the performance increases from 50 to 81 as expected.The zero-shot performance on NLI rises from 32 to 37 by adding more sentence pair tasks, and then to 39 with INTENT, but other training tasks do not further boost the performance.In conclusion, we find that the zero-shot performance on held-out task types can only benefit from some task types, and more labeled data in other task clusters do not always guarantee continuous improvement.
In comparison, our main results on task scaling indicate that performance is boosted when the number of training tasks increases according to the fixed task distribution.Note that task distribution is orthogonal to scaling the task number.How to further improve zero-shot generalization by optimizing task distribution is left to future work.

A.7 Hard Prompt Examples
In this section, we provide details of hard prompts used in this paper.For tasks within each Chinese task cluster, we use similar handcrafted prompts as shown in Table 9 ∼ 17 .We use both prefix prompts and cloze prompts.For text classification clusters such as SENTI, NEWS, [X] denotes the sample text.For sentence pair task clusters such as NLI, STS, [X1] denotes the first sample sentence and [X2] is the second sample sentence.For cluster MRC, [X1] denotes the coupus and [X2] denotes the question.For cluster SUM, [X] denotes the coupus, and a similar prompt form is applied for KEYS.For NER, [X1] is the sample text and [X2] denotes the target entity type.For WSC, [X1] is the sample text and [X2] is the pronoun.For all prompts mentioned above, '_' denotes the target position to fill in the answer.
China News on February 25: Gabi, an Italian girl who loves animals, has received a gift from a crow for feeding her snacks and family leftovers, foreign media reported.Gaby reportedly regularly feeds the crows peanuts, dog food and some leftovers, and she said she does not ask a reward but because she loves nature.Lately, they've been bringing her shiny things, usually buttons, stationery and hardware.In a few cases, she's received earrings.They even helped her mother find the cover of a camera she'd lost.According to bird experts, crows do have the ability to make friends with humans, so it's not a little girl's imagination for them to return the favor.Augmentation Prompt: [X] 这个领域的领域词典中收录的单词,应该是_。 Prompt: [X] The words in the domain dictionary of this field should be _.Target 意大利女童用零食喂乌鸦，乌鸦送"礼物"报恩" Talian girl feeds snacks to crows who return kindness with 'gifts'

Figure 1 :
Figure1: Task scaling vs model scaling.The horizontal axis is the number of training tasks, and the vertical axis is the zero-shot performance on unseen tasks.RoBERTa-Large was finetuned in a fully-supervised manner, while Pangu Alpha, CPM-2 and our ZeroPrompt were zeroshot prompted.

Figure 3 :
Figure 3: Zero-shot performance of 1.5B model on public datasets with different number of production training tasks.

Algorithm 1
Self-training Require: M, D un , D train , T Ensure: M * 1: Init D * train ← D train 2: for each t ∈ [0, T ] do 3: M * ← train M on D * i un ← select samples in D i un which are confident with M * and make pseudo label, for 10: return M * ; Ra w -L M + Pa ir + In te nt + Se nt i. + Su m .+ NE R + M RC + Ne w

Figure 6 :
Figure 6: Zero-shot performance on NLI and NEWS with different held-out task types.

Table 1 :
The number of tasks for each task type.Numbers in brackets stand for the number of tasks for training and testing, respectively.e.g.SENTI has 4 tasks for training and 13 for testing.

Table 3 :
Task scaling vs sample scaling.

Table 4 :
Language modeling (LM) and supervised (Sup) validation loss of models with different sizes.

Table 7 :
Experimental results on data retrieval + self-training

Table 8 :
Summary of collected datasets

Table 17 :
Illustrations of prompts in Summarization.

Table 18 :
Detailed ablation results on prompt design and MLM loss

Table 19 :
Detailed ablation results on building new task-specific soft prompts