Zero-Shot Learners for Natural Language Understanding via a Unified Multiple Choice Perspective

We propose a new paradigm for zero-shot learners that is format agnostic, i.e., it is compatible with any format and applicable to a list of language tasks, such as text classification, commonsense reasoning, coreference resolution, and sentiment analysis. Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training. Our approach converts zero-shot learning into multiple-choice tasks, avoiding problems in commonly used large-scale generative models such as FLAN. It not only adds generalization ability to models but also significantly reduces the number of parameters. Our method shares the merits of efficient training and deployment. Our approach shows state-of-the-art performance on several benchmarks and produces satisfactory results on tasks such as natural language inference and text classification. Our model achieves this success with only 235M parameters, which is substantially smaller than state-of-the-art models with billions of parameters. The code and pre-trained models are available at https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/unimc .


Introduction
Remarkable advances in large-scale language models have brought substantial improvements in a wide variety of tasks such as text classification, natural language inference and commonsense reasoning (Brown et al., 2020;Chowdhery et al., 2022).This progress brings opportunities to Zero-Shot Learning (ZSL) (Sanh et al., 2021;Chowdhery et al., 2022), which aims to predict labels on datasets from novel domains.Most solutions can be framed in the prompt tuning framework that activate specific parameters in PLM (Xu et al., 2022; * Equal contribution. † Corresponding Author.Liu et al., 2021) to adapt to zero-shot tasks.A powerful variant of prompt tuning is called instruction tuning (Wei et al., 2021), which shares knowledge from different domains.We summarize the mainstream large-scale frameworks in Figure 1.
Despite their success, these frameworks suffer from their inherent problems, and thus limit their potential in zero-shot learners.Firstly, promptrelated models have an extremely large number of parameters, e.g., GPT-3 has 175B, FLAN has 137B and PaLM (Chowdhery et al., 2022) has 540B.One immediate problem is that these models are often hard to be trained, making the deployment and consumption difficult.Secondly, manual processing is required when addressing zero-shot problems.For instance, T0 builds 2, 073 prompts to handle more than 170 tasks (Sanh et al., 2021).Lastly, existing models employ a single direction paradigm, either auto-regressive models or sequence-to-sequence, resulting in inadequate usage of information from both directions.As an example, PMLM tries to implement a zero-shot learner, which is shown in Figure 1 (c).Note that recent work (Liu et al., 2019a) state that PMLM is more suitable than PLM for Natural Language Understanding (NLU) tasks.However, it has to be fine-tuned on the task-specific samples to initialize the classifier instead of randomly initializing the classifier.Therefore, the ability of PMLM is limited when dealing with zeroshot scenarios.
To address the aforementioned problems, we introduce a light-weight framework, called Unified Multiple Choice model (UniMC), proposing a novel MC tuning.The proposed MC tuning has the following advantages: i) parameter updating only happens in the MC training phase, and ii) facilitating the deployment.To reduce the manual processing, we only formulate one candidate option prompt format and one question prompt format.Note that we also consider the case without any question prompt format.Under this setting, we can treat labels as options rather than building verbalizer maps and providing its text information to the models as before.We therefore can learn the information from labels directly.To this end, we convert the problematic classifiers to options.One immediate question is how to choose an option efficiently and unambiguously.Therefore, as shown in Section 3.2, we develop an option-mask tokens [O-MASK] to predict "yes" or "no" before each option.A two-step process is introduced to output the desired options.First, similar to Masked Language Modeling (MLM) (Devlin et al., 2019), we conduct Option MLM (O-MLM) to recover the "yes" or "no" for each option.Next, we propose an Option Prediction (OP) method to compute proper options.
With extensive experiments on multiple challenging benchmarks, we demonstrate that our approach's performance outperforms state-of-the-art baselines, while reducing the model size with two orders, as shown in Figure 2.This success suggests the potential of leveraging UniMC in large datasets.Our contributions are as follows.
• We propose a new zero-shot paradigm by converting this problem into multiple choice tasks.
• We develop an effective and efficient method to implement a MC-based zero-shot learner.
Our proposed method has up to 48% improvement on Dbpedia over SOTA baselines that have a few hundred times larger than our model.

Related Work
2.1 Unified NLP Task Formats NLP tasks often have diverse formats due to the fast emergence of datasets, such as machine reading comprehension and text classification tasks.Recent research shows the necessity of unifying formats to fix the gap across various tasks (Sanh et al., 2021;Wei et al., 2021;Sun et al., 2021).By developing a natural language prompted form, T0 builds an application to map original NLP datasets into target templates with custom prompts (Sanh et al., 2021).FLAN groups multiple datasets into 12 task clusters, and then designs 10 unique instruction templates to unify formats (Wei et al., 2021).Despite effective, this focuses on generative styles and thus cannot be adapted to vast label-based models that select.This motivates us to unify label-based tasks, where we develop unified Multiple Choice (MC) formats for this purpose.

Label Information
The label semantic is an important information source, such as in few-shot tasks (Hou et al., 2020;Mueller et al., 2022;Luo et al., 2021).The L-TapNet framework (Hou et al., 2020) integrates the label information with manually designed prompts for inputs to solve few-shot slot tagging tasks.In addition, LSAP (Mueller et al., 2022) obtains powerful few-shot performance by introducing label semantics into the pre-training and fine-tuning phases of the PLMs.Together, these successful employments of labels in low-resource settings inspire us to bring label semantics to our unified MC inputs to handle the zero-shot scenario.

Zero-Shot Learning
Large-scale Pre-trained Language Models (PLMs) with billions of parameters such as GPT-3 (Brown et al., 2020) have shown impressive performance across various few-shot tasks.However, they have limited competence when dealing with zero-shot tasks, which have broader applications in practice.
Recent efforts try to mitigate this issue from different perspectives.FLAN (Wei et al., 2021) designs specific instruction templates for each task and utilizes over 60 labeled datasets to "fine-tune" a 137B language model.T0 (Sanh et al., 2021) unifies all tasks into a source-target format by collecting a large variety of prompt templates, specifically 2, 073 manually constructed prompts, and trains the model with multi-task learning.Along this line, ZeroPrompt (Xu et al., 2022) applies over 1, 000 supervised datasets and proposes the genetic prompt search method to find prompts for new tasks.However, these methods cost significant laborious efforts, such as prompt engineering and template designing.Moreover, the pre-training and tuning phases of large-scale PLMs take enormous amounts of computational resources, therefore, new tasks may suffer great difficulty in deploying.As a comparison, our proposed UniMC is light-weighted, i.e., has 235M parameters and a few manual input text transformations, making it suitable for more general scenarios.

Approaches
In this section, we outline the proposed framework, i.e., UniMC, and provide the training and inference approaches in detail.
3.1 The UniMC framework

Unified Input
A unified input format will facilitate the generalization of models, promoting the sharing of knowledge across different tasks.To achieve this, we frame all tasks' objectives together as a multiplechoice (MC) problem, as shown in Figure 3.A MC problem often consists of three components, including options, question, and passage.We now discuss the details of getting these bodies.We can often get the passage component effortlessly be-

Passage
The abode of the Greek gods was on the summit of Mount Olympus, in Thessaly.

Question Based on the paragraph
Option [1] we can infer that Mount Olympus is in Thessaly.[2] we can not infer that European cars sell in Russia.

Common sense Dataset Hellaswag
Passage A graphic introduces the hand car wash video.The car is washed first gently with soap.next Question Option [1] washes persons hands and wipes them with a blue cloth.
[2] is washed first gently with soap.[3] washes game is displayed.[4] washes and a man wearing a blue shirt speaks to the camera.Question What is topic of the articles?

Sentiment
Option [1] Company [2] Educational Institution [3] Artist [4] Athlete [5] Office Holder [6] Mean Of Transportation [7] Building [8] Natural Place [9] Village [10] Animal [11] Plant [12] Album [13] Written Work cause it often exists in the original data.As to the question part, we can either use the raw question directly or provide a corresponding question when it is missing.The transformation of options depends on whether or not we can get a straightforward expression of classes.On the one hand, we can convert all classification tasks into options directly as it has specific information for choices.
On the other hand, we have to construct an option prompt to generate particular choices.Details of this transformation can be found in Appendix A.
In effect, these allow us to abandon label indices as in classification tasks, which include much less information than our used options.

Network
In our framework, we employ BERT-like PMLMs as the backbone, such as ALBERT (Lan et al., 2020) and RoBERTa (Liu et al., 2019b), to integrate the bidirectional modeled input x inp .In additional, the discussion of backbone models is in Appendix B. Instead of using the original embedding methods directly, we develop a new solution for the segment id, position id, and attention mask matrix to fit multiple choice tasks, simultaneously.Tokenization: In this framework, the key to achieve the ability of addressing MC tasks is to set up a proper option.We thus introduce option-mask Passage: It's a cookie-cutter movie, a cut-and-paste job Question: What is sentiment of the review?Options: [1] it's great.[2] it's terrible.

Unified MC Format
[C] no it's great.yes it's terrible.  .The example of input text is from the dataset SST-2 (Socher et al., 2013).
, the tokens of options can not attend to each other.
tokens ([O-MASK]), aiming to replace "yes" or "no" in the input text for a better representation ability.Here, [O-MASK] inherits the ability of [MASK], and thus remains to use token predictions to determine which option is correct.Consider, as an example, an input set, denoted as (o, q, x), includes the following: i) one passage x = x 1 . . .x |x| , ii) N Q questions q = q 1 . . .q |q| , and iii) N O candidate options o = o 1 . . .o |o| , whose input token x inp is formulated as follows: Here, Id embeddings and attention mask matrix: Note that a unified input text has multiple options, leading to undesired mutual influence between op-tions and resulting in a misunderstanding of answers.We now address this issue from the following three perspectives, including segment id, position id, and attention mask matrix.Firstly, we assign segment id to distinguish option and context (questions, passages) information, as shown in Fig. 4 (a).Secondly, we update the position id to tell apart the intra information in the option.This is because that PMLMs cannot get position information from tokens.We aim to allow PMLMs will treat tokens' position information based on their position embeddings.Lastly, we will control the flow between options, such as M mask in self-attention, as shown in Fig. 5.In particular, black squares are used to mask a part of the input attention matrix, ensuring the disentanglement between different options.We place a −inf number on the masked slots, which is the same as BERT to mask tokens.Furthermore, we can have the encoded hidden vector, denoted as T = [T 1 . . .T n ], using multiple Transformer-based layers as following, T = encoder(xinp, pos, seg, M mask ). (2)

MC tuning
Recall the backbones are often pre-trained models, resulting in excellent skill in capturing the commonsense knowledge.Intuitively, we can employ these as base modules by taking advantage of their high volume knowledge.More specifically, we use the outputs of pre-trained models as the initial states for the following MC tasks, leading to a twostage tuning paradigm.In the MC training phase, we train the models with MC tasks and gain a great initialization for selecting a correct option.In the zero-shot phase, we apply the unified MC models to unseen zero-shot tasks.

MC training phase
We now introduce the proposed option masked language modeling (O-MLM) and option prediction (OP) methods in detail.Masked Language Modeling (MLM) is a pretraining task in BERT (Devlin et al., 2019) for selfsupervised learning, where T is the random perturbed token from T ; m(T ) and T \m(T ) are the masked tokens from T and the reset tokens, respectively.In practice, we randomly replace tokens in the passage sequence x with special tokens [MASK], as opposed to the whole sequences used in standard BERT.The main difference between O-MLM and MLM is the way of masking.We always mask the [O-MASK] tokens to predict "yes" or "no", as shown in Figure 4 (b).Therefore, the loss L O−MLM and L MLM share the same style.
Once the prediction probabilities of "yes" or "no" is obtained, we next introduce the OP to teach the model for learning MC tasks, which is shown in Figure 4 (b).To learn the mutually exclusive characteristics between options, OP takes the logits yes" for each option sequence to generate label distributions.OP aims to compute a cross-entropy loss with ground truth label distribution Y : Recent studies show that including mixed tasks in a batch will improve the generalization ability of neural networks (Aghajanyan et al., 2021).When facing mixed tasks, we mask the output logits except for [O-MASK] during the Softmax operation to compute the OP loss in a mini-batch, as shown in Figure 6.The logit masking approach allows our UniMC to handle MC tasks with different number of options in a single batch.
In summary, the overall training objective in MC training is given by:

Zero-shot phase
After obtaining a unified MC model, we simply utilize O-MLM and OP to predict the answer in unseen zero-shot datasets.We know that the ground truth labels are missing, so it is impossible to compute the loss.Alternatively, we can compute the most confident option with the OP because the model still recover [O-MASK] to "yes" or "no" with O-MLM.

Discussion
Interestingly, we realize that the MC training stage and zero-shot stage are consistent in processing objectives.Recall that previous models tend to have divergence learning objectives, which may cause potential oscillation.Our proposed method is more task-driven and thus has a better chance to deliver high learning quality in task-specific outputs.

Experimental Setup
We follow the preparation in T0 (Sanh et al., 2021) to cluster the label-based NLP datasets into 6 groups.In particular, we collect publicly avail-  Following the general setting (Du et al., 2021;Wei et al., 2021), we apply accuracy in all datasets.For computing the overall average accuracy, we take the average accuracy for each task and then calculate the arithmetic mean for them.

Baselines
In the experiments, we compare our method with the state-of-the-art baselines, including: GPT2 (Radford et al., 2019), GPT3 * (Zhao et al., 2021), T0 (Sanh et al., 2021), FLAN (Wei et al., 2021), PaLM (Chowdhery et al., 2022), GaLM (Du et al., 2021) and UnifiedQA (Khashabi et al., 2020).We report the accuracy of each method to measure their performance.We only present the average outcomes if the baseline is conducted in multiple runs.Besides, we include the random guessing as a naive baseline for the comparison.

Implementation Details
In our model, we use the ALBERT-xxlarge-V2 (Lan et al., 2020) as backbone models by taking its light-weighted parameters.For fair comparison, we set the maximum length token as 512 in all experiments as in (Lan et al., 2020).In the training, we run only one epoch by following the setting in FLAN (Wei et al., 2021).We set the number of samples for each task up to 20K, aiming to prevent the model from being dominated by specific tasks.Besides, we repeat the experiment 5 times by using different seeds.We run all our experiments on 8 NVIDA A100 GPUs.

Natural Language Inference
We now present our main results from the Natural Language Inference (NLI) task in Table 1.UniMC achieves the best performance in all datasets, demonstrating its capability of NLI.In particular, UniMC achieves these competitive results with as few as 235M parameters as opposed to hundred billions of parameters in other baselines.These results confirm the effectiveness of unifying formats as a multiple choice style.Besides, a bi-directional structure in UniMC strengths its ability in capturing information as opposed to the previous onedirectional structures.

Text classification
Text classification task aims to select a label/class for given texts.This is similar to the objective of MC task in nature.Therefore, we conduct a zeroshot text classification experiment to verify our model's capability.As shown in Table 2, UniMC outperforms previous SOTA models by a large margin.In particular, we know that Dbpedia includes 13 categories, adding a significant challenge to the classification task.Fortunately, UniMC has a builtin advantage in dealing with multiple classes due to the similarity between choices and classes, leading up to 48.9% improvement.Table 3: A summary on natural language inference, commonsense reasoning, coreference resolution and sentiment analysis task.

A comprehensive comparison to FLAN
We know that FLAN is a well-known model in dealing with zero-shot option or label-related tasks.One of its particular merits is the zero-shot generalization ability.To better demonstrate the ability of UniMC, we report a comprehensive comparison between ours and FLAN, as shown in Table 3 and more comparisons are described in Appendix B.3.
In the NLI task, UniMC achieves better performance than FLAN in general, which is consistent with the results in Table 1.We also select tasks like the commonsense reasoning, the coreference resolution, and the sentiment analysis to further explore the generalization ability of ours.UniMC gets an obvious advantage in COPA, Hellaswag, Winogrande, WSC, DPR when evaluating the common sense and coreference tasks.Beyond these two tasks, we find that the construction of datasets plays a critical role to the performance.In general, these datasets can be grouped into two categories: the understanding and generation styles.UniMC tends to show better performance on datasets that more close to the understanding style.In sentiment tasks, the number of classes is limited, making the dataset construction style is less important than that in the tasks of the common sense and coreference.Therefore, both UniMC and FLAN get relative good performance.

Ablation Studies
In this section, we intend to verify the necessity of key components of our UniMC, including the MC training, the prompt effect, the flow controlling.We also show the influence of the model size.mance of using prompts or not.Although the performance for all tasks shows different directions, we hypothesize that this divergence is caused by the way of data construction.These datasets are mainly designed for two purposes, which are the language modeling task and the relationship choice task (Brown et al., 2020).The desire for question prompts increases when the data is more close to the language modeling task; vice versa.Furthermore, we classify these datasets into two categories, spoken-based and written-based, according to the definition in (Alsaawi, 2019).MNLIm/mm, CB, SNLI, SST-2 and IMDB belong to the spoken-based corpus, while the rest datasets belong to written-based corpus.Considering that PMLM is usually pre-trained on written-based corpus, e.g., the pre-training datasets of BERT are Wikipeida and BookCorpus (Devlin et al., 2019), ours may have no need of questions for written-based data.This, again, confirms that data construction affects the requirements of question prompts.
For the option prompts, we present the experimental results in Table 6.We would like to emphasize that option prompts are necessary for our UniMC, therefore, we cannot remove this component as in the above experiment.Instead, we design different option prompts to demonstrate their effects.We observe that different prompts show limited performance variations, indicating the robustness of our UniMC to option prompts.Since FLAN and PaLM are not open-sourced, we choose one of the most powerful models, e.g., UnifiedQA-T5, as the baseline to ensure the fairness in comparison.In the experiment, we find that UnifiedQA-T5 is sensitive to option prompts, which have up to 8.3 standard variation (Std).

How does the flow controlling affect the performance?
We design the prompt to frame the input sequences to make all datasets fit into UniMC directly.However, some recent methods need extra processes, such as adopting an option with a context (question and passage) into a sequence and aggregate multiple different sequences to get an answer (Sun et al., 2021).To fix this gap, we design two strategies to control the flow of the information as in Section 3.1.2.We summarize the performance of these two in Table 7.We observe that AMM adds the greatest improvement to results, which is much better than UIE.On the one hand, UniMC can learn the position relationship between options.On the other hand, UniMC can distinguish between options and context.However, UIE is unable to prevent the inter-influence in options.Thanks to self-attention mechanism, AMM makes the options unreachable to each other, eliminating the intra-information of options.

How does the model size affect the performance?
A common intuition from this domain is that a large model size will result a better performance (Wei et al., 2021;Chowdhery et al., 2022), particular large-scale PLMs.Naturally, we believe that our backbone PMLM follows this rule as well.To vali-date this, we implement an experiment by varying the model size, as shown in Figure 8.All 4 different tasks show the same trend, demonstrating the correctness of the mentioned intuition.

Conclusions
In this paper, we introduce a new zero-shot paradigm called MC tuning.This adds flexibility and generalization ability to zero-shot learners.
We propose O-MLM and OP in both MC training and zero-shot phase, aiming to capture information from both directions.Our UniMC achieves better performances over SOTA models that a few hundred times larger than our model.Our experiments demonstrate the effectiveness and generalization ability of UniMC on zero-shot tasks.In future work, we will extend UniMC to few-shot scenarios.

Limitations
In this paper, our main contribution is a simple and effective framework for zero-shot tasks while maintaining a light weight.We aim to introduce additional artificial information and reduce manual processing to the minimum.We explored how to employ question prompts in Sec.4.3.2,however, it is non-trivial to decide whether a prompt is required for complex datasets.In addition, we only compare with limited baselines when understanding the influence from the backbone in UniMC.In experiments, we implement only a few comparative experiments between ALBERT and RoBERTa (Liu et al., 2019b) due to the limit of computational resources, as shown in Appendix B.2.In the future, we will dig deeper into the principles regarding inputs and backbone, etc.

Ethical Considerations
Natural language processing is an important technology in our society.It is necessary to discuss its ethical influence (Leidner and Plachouras, 2017).
In this work, we develop a novel zero-shot NLP approach to enhance the generalization ability of NLP.As discussed in (Schramowski et al., 2022(Schramowski et al., , 2019;;Blodgett et al., 2020), language models might contain human-like biases, which might embed in both the parameters of the models and outputs.Furthermore, we note the potential abuse of zero-shot models because these are often being integrated into applications without justification.We encourage open debating on its utilization, such as the task selection and the deployment, hoping to reduce the chance of any misconduct.
beyond RoBERTa-large (Liu et al., 2019b) in their paper.In our experiments, tokenization might be another possible reason.Since O-MLM aims to predict "yes" or "no", UniMC needs a stable tokenizer to recover those words.Unlike AL-BERT, RoBERTa uses a byte-level BPE tokenizer instead of a WordPiece tokenizer.Under the settings of the byte-level BPE tokenizer, the word id does not only depend on the word itself, but also is influenced by its position.Therefore, RoBERTa faces tough O-MLM and OP tasks in the MC training phase, which presents lower score than ALBERT.We chose ALBERT, which has better results, as the default backbone model in all our experiments.

B.3 Results on all datasets
In Table 12, we can see that UniMC achieves the best performance on 11 out of 17 datasets.PLMs outperform UniMC in the tasks of commonsense reasoning and coreference resolution in Hallawag, Winogrand, and WSC, as these are formulated in the original language modeling pre-training objective, as noted in (Wei et al., 2021).In addition, PLMs benefit from unsupervised language modeling on a large-scale text corpus.For example, PaLM with 540B parameters is pre-trained on data with 780 billion tokens.

Figure 2 :
Figure 2: Zero-shot performance comparison in ALNI R1.Our proposed UniMC has the best performance w.r.t the accuracy and the model size, simultaneously.
US accounting and bookkeeping application that assists small businesses and sole proprietors with managing their business income and expenses.It also provides them with a means to organize and categorize expenses for filing a Schedule C.

Figure 3 :
Figure 3: Unified input text examples with sampling from datasets in zero-shot phase.The prompt text is underlined and the correct options are in bold.

Figure 8 :
Figure 8: Zero-shot performances on several tasks with model variants.
[S]What is sentiment of the review?[S]It's a cookie-cutter movie, …, job[S]

Table 1 :
Zero-shot results in natural language inference task.The best scores are in bold.

Table 2 :
Zero-shot results in text classification task.The best results are in bold.ableNLPdatasets on HuggingFace 1 , and assign each label-based dataset to one of the task groups, as shown in Fig.7.For each group, we design a corresponding transformation rule to convert it into a unified MC format, where detailed examples are presented in Sec.3.1.1.Please refer to Appendix A for more details of dataset descriptions and unified MC formats.Next, we split the whole datasets into two parts for the two phases in our framework, i.e., the part for MC task is for the training, and the other is for the zero-short scenarios.It is worthy mentioning that using the MC tasks only in the MC training phase can avoid intensive resource computing.

Table 5 .
For the question prompts, we conduct experiments on four challenge tasks by showing perfor-

Table 6 :
Zero-shot results in sentiment analysis task."Std" indicates Standard Deviation.The best average results are in bold.The more stable performance is underlined.

Table 7 :
Zero-shot performance with different strategies to control the flow between options."UIE" indicates Updating Id Embeddings, including segment id and position id."AMM" means Attention Mask Matrix."Improve" shows the accuracy improvement from Random Guessing.