It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models.


Introduction
Pretraining ever-larger language models (LMs) on massive corpora has led to large improvements in NLP (Radford et al., 2018;Devlin et al., 2019;Raffel et al., 2020, i.a.). A standard approach is to replace the pretrained model's output layer with a task-specific head and finetune the entire model on a set of labeled training data. However, language modeling is not only a powerful pretraining objective, but many tasks can be reformulated as cloze questions (e.g., by appending phrases such as "the correct answer is __"), allowing pretrained LMs to solve them without any or with only very few labeled examples (Radford et al., 2019;Schick and Schütze, 2021).
Recently, Brown et al. (2020) introduced GPT-3, a pretrained LM with an enormous 175 billion parameters, and showed that it has amazing few-shot abilities: By reformulating tasks as LM problems, ALBERT with PET/iPET outperforms GPT-3 although it is much "greener" in that it has three orders of magnitude fewer parameters.
GPT-3 achieves near state-of-the-art results for some SuperGLUE  tasks given just 32 labeled examples. This is achieved through priming: GPT-3 is given a few demonstrations of inputs and corresponding outputs as context for its predictions, but no gradient updates are performed. While being straightforward to use, this method has two major drawbacks: • It requires a gigantic LM to work well, making it unusable in many real-world scenarios and resulting in a large carbon footprint (Strubell et al., 2019).
• It does not scale to more than a few examples as the context window of most LMs is limited to a few hundred tokens. 2 An alternative to priming is pattern-exploiting training (PET) (Schick and Schütze, 2021), which combines the idea of reformulating tasks as cloze questions with regular gradient-based finetuning. While PET additionally requires unlabeled data, unlabeled data is much easier to obtain than labeled examples for many real-world applications. Crucially, PET only works when the answers to be predicted by the LM correspond to a single token in its vocabulary; this is a severe limitation as many tasks cannot easily be worded that way.
In this work, we adapt PET for tasks that require predicting multiple tokens. We then show that in combination with ALBERT (Lan et al., 2020), PET and its iterative variant (iPET) both outperform GPT-3 on SuperGLUE with 32 training examples, while requiring only 0.1% of its parameters (Figure 1). Moreover, training with PET can be performed in several hours on a single GPU without requiring expensive hyperparameter optimization. Finally, we show that similar performance can also be achieved without unlabeled data and provide a detailed analysis of the factors contributing to PET's strong performance: its ability to combine multiple task formulations, its resilience to wordings that are hard to understand, its usage of labeled data, and characteristics of the underlying LM. Given PET's "green" properties, we see our work as an important contribution to an environmentally sound NLP.

Related Work
Enabling LMs to perform zero-shot learning by providing task descriptions was proposed by Radford et al. (2019) and has been applied to text classification (Puri and Catanzaro, 2019), commonsense knowledge mining (Davison et al., 2019) and argumentative relation classification (Opitz, 2019). It is also commonly used for probing the knowledge contained within LMs (Trinh and Le, 2018;Petroni et al., 2019;Talmor et al., 2020;Schick and Schütze, 2020;Ettinger, 2020, i.a.).
As finding ways to reformulate tasks as cloze questions that are understood well by LMs is difficult , Schick and Schütze (2021) propose PET, a method that uses knowledge distillation (Hinton et al., 2015) and self-training (e.g., Scudder, 1965;Yarowsky, 1995;Brin, 1999;Mc-Closky et al., 2006) to easily combine several reformulations. Our modified version of PET uses masked language models (Devlin et al., 2019) to assign probabilities to sequences of text; this is similar to using them in a generative fashion (Wang and Cho, 2019) and has previously been investigated by Salazar et al. (2020) and Ghazvininejad et al. (2019). In contrast to PET, which uses gradient-based optimization, Radford et al. (2019) P (x) Oil prices rise ? __ , Oil prices fall back .

Yes
No entailment not_entailment y v(y) q p (y | x) Figure 2: Application of a PVP p = (P, v) for recognizing textual entailment: An input x = (x 1 , x 2 ) is converted into a cloze question P (x); q p (y | x) for each y is derived from the probability of v(y) being a plausible choice for the masked position.
and Brown et al. (2020) investigate priming, where examples are given as context but no parameter updates are performed. Finally, our focus on reducing the amount of compute required for few-shot learning is closely related to other efforts in Green AI (Schwartz et al., 2020a) that aim to improve model efficiency, including techniques for knowledge distillation (e.g., Hinton et al., 2015;Sanh et al., 2019;Jiao et al., 2020;Mao et al., 2020;Anderson and Gómez-Rodríguez, 2020), pruning (Han et al., 2015(Han et al., , 2016Sanh et al., 2020) and quantization (Gong et al., 2014;Zafrir et al., 2019;Stock et al., 2021) as well as early exit strategies for inference Schwartz et al., 2020b;.

Pattern-Exploiting Training
Let M be a masked language model (MLM), T its vocabulary and __ ∈ T the mask token; we denote the set of all token sequences as T * . For some z ∈ T * containing at least k masks and t ∈ T , we denote with q k M (t | z) the probability that M assigns to t at the kth masked position in z; the model's logits before applying softmax are denoted with s k M (t | z). We consider the task of mapping inputs x ∈ X to outputs y ∈ Y , for which PET requires a set of pattern-verbalizer pairs (PVPs). Each PVP p = (P, v) consists of • a pattern P : X → T * that maps inputs to cloze questions containing a single mask; • a verbalizer v : Y → T that maps each output to a single token representing its task-specific meaning in the pattern.
As illustrated in Figure 2, the core idea of PET is to derive the probability of y being the correct output for x from the probability of v(y) being the "correct" token at the masked position in P (x). Based on this intuition, a conditional probability distribution q p of y given x is defined as is the raw score of v(y) at the masked position in P (x).
For a given task, identifying PVPs that perform well is challenging in the absence of a large development set. Therefore, PET enables a combination of multiple PVPs P = {p 1 , . . . , p n } as follows: 1. For each PVP p, a MLM is finetuned on training examples (x, y) by minimizing the cross entropy between y and q p (y | x). In practice, Schick and Schütze (2021) train three MLMs per pattern as performance can vary substantially between runs.
2. The ensemble of finetuned MLMs is used to annotate a set of unlabeled examples; each unlabeled example x ∈ X is annotated with soft labels based on the probability distribution similar to Eq. 1 where w p is a weighting term that is proportional to the accuracy achieved with p on the training set before training.
3. The resulting soft-labeled dataset is used to train a regular sequence classifier by minimizing cross entropy between its output and q P .
As steps (2) and (3) above closely resemble knowledge distillation (Hinton et al., 2015), we also refer to them simply as distillation. Importantly, this process does not require holding the entire ensemble of MLMs in memory at the same time as each model's predictions can be computed sequentially; therefore, it is not more memory expensive than using a single model. To give MLMs trained on different patterns further opportunity to learn from one another, Schick and Schütze (2021) also propose iPET, an iterative variant of PET in which several generations of models are trained on datasets of increasing size that are labeled by previous generations. This is achieved as follows: First, an ensemble of MLMs is trained as in regular PET. For each model M i , a random subset of other models is used to generate

PET with Multiple Masks
An important limitation of PET is that the verbalizer v must map each output to a single token, which is impossible for many tasks. We thus generalize verbalizers to functions v : Y → T * ; this requires some modifications to inference and training. 3 We further generalize PET in that we do not assume the output space to be identical for each input: for each x ∈ X, we denote with Y x ⊆ Y the set of possible outputs given x as input. Given a PVP p = (P, v), we define l(x) = max y∈Yx |v(y)| to be the maximum number of tokens required to express any output in Y x and P k (x) to be P (x) with the mask token replaced by k masks. As a running example, we consider the task of binary sentiment classification for restaurant reviews with labels Y = {+1, −1}. We use the pattern P (x) = x. It was __ . and a verbalizer v that maps +1 to the single token great and −1 to the sequence terri • ble, i.e., we assume that the MLM's tokenizer splits the word "terrible" into the two tokens terri and • ble. For this example, l(x) = 2 for all x; P 2 (x) is illustrated in Figure 3 (a).
Inference For x ∈ X, y ∈ Y x and |v(y)| = k, we redefine q p (y | x) in an autoregressive fashion: Starting from P k (x), we perform k consecutive predictions, where we always select the next token to predict based on the MLM's confidence. That is, Note that unlike in original PET (Eq. 1), q p is not a probability distribution as its values do not sum to one. For our sentiment classification example, Figure 3 illustrates how q p (−1 | x) is computed: As |v(y)| = |{terri, • ble}| = 2, we first use z = P 2 (x) to compute the probability of each token in v(y) (Figure 3a). We then choose the token with the highest probability, put it in place of the corresponding mask token, and use the resulting cloze question z to compute the probability of the remaining token (Figure 3b). The overall score for y = −1 is then computed as Training Computing q p (y | x) as in Eq. 3 for each training example (x, y) would be prohibitively expensive. To enable computation of all required probabilities in a single forward pass, we approximate q p (y | x) by (i) always inserting the maximum number of mask tokens required to express any output and (ii) for each y ∈ Y x , predicting all tokens in v(y ) = t 1 . . . t k in parallel, where we simply ignore the model's predictions for all l(x) − k superfluous mask tokens: For our running example, this means we approximate the scores q p (y | x) by computing which can be done in a single forward pass as it only requires processing the cloze question z = P 2 (x) shown in Figure 3 (a) once. Asq p is not a probability distribution over Y x , cross entropy is not an ideal training objective as it can also be minimized by reducing the probability assigned to sequences z / ∈ v(Y x ) that are not part of the output space, despite this having no effect on the model's prediction. We instead opt for multiclass hinge loss (Weston and Watkins, 1999;Dogan et al., 2016) and minimize: That is, we require the difference between the log probability of y and the log probability of any output y ∈ Y x \ {y} to be at least 1.

Experiments
We compare PET and GPT-3 on SuperGLUE , a natural language understanding benchmark consisting of eight challenging tasks. We cannot evaluate PET using the exact same training data as GPT-3 because for most tasks, GPT-3 uses a different set of training examples for each test example and for the other tasks, training sets were not available upon request; however, the exact choice of examples has little impact on GPT-3's performance. 4 We thus create new training sets by randomly selecting 32 examples for each task using a fixed random seed.
We additionally create sets of up to 20,000 unlabeled examples for each task; this is done by removing all labels from the original training sets. We refer to the resulting sets of training examples and unlabeled examples as FewGLUE. 5

Tasks
Below, we describe each of the SuperGLUE tasks and our corresponding PVPs. We use a vertical bar (|) to mark boundaries between text segments. Of the eight tasks considered, only COPA, WSC and ReCoRD require the use of PET with multiple masks as introduced in Section 3.1.
BoolQ (Clark et al., 2019) is a QA task where each example consists of a passage p and a yes/no question q. We use the following patterns: • p. Question: q? Answer: __.
• p. Based on the previous passage, q? __.
• Based on the following passage, q? __. p We define two verbalizers mapping questions containing a true statement to yes/true and others to no/false, respectively, for a total of 6 PVPs.
CB (De Marneffe et al., 2019) and RTE (Dagan et al., 2006) are textual entailment tasks like MNLI, so we use PVPs similar to Schick and Schütze (2021). For a premise p and hypothesis h, we use and a verbalizer that maps entailment to yes, disagreement to no and neutral to maybe.
Given a premise p, the task in COPA (Gordon et al., 2012) is to determine the cause or effect of the premise given two options c 1 and c 2 . For determining the effect, we use the following patterns: "c 1 " or "c 2 "? p, so __. , c 1 or c 2 ? p, so __.
For determining the cause, we use the same patterns but replace so with because. The verbalizer for c 1 and c 2 is the identity function.
For WiC (Pilehvar and Camacho-Collados, 2019), given a word w and two sentences s 1 and s 2 in which it occurs, the task is to decide if w is used with the same sense in both sentences. We use: • "s 1 " / "s 2 ". Similar sense of "w"? __. For the first two patterns, we use yes as verbalization for words used in the same sense and no for other words; for the third pattern, we use b and 2.
For WSC (Levesque et al., 2011), each example consists of a sentence s with a marked pronoun p and noun n, and the task is to determine whether p refers to n. We follow (Raffel et al., 2020;Brown et al., 2020) and treat WSC as a generative task. We highlight p in s by putting it in asterisks and use the following patterns: • s The pronoun ' * p * ' refers to __.
• s In the previous sentence, the pronoun ' * p * ' refers to __.
• s In the passage above, what does the pronoun ' * p * ' refer to? Answer: __.
We use the identity function as verbalizer for n. Note that WSC is different from other tasks in that it requires free-form completion. This in turn requires some modifications during training and inference that are discussed in Appendix A.
MultiRC (Khashabi et al., 2018) is a QA task. Given a passage p, a question q and an answer candidate a, the task is to decide whether a is a correct answer for q. We use the same verbalizer as for BoolQ and similar patterns: • p. Question: q? Is it a? __.
• p. Based on the previous passage, q? Is "a" a correct answer? __.
For ReCoRD (Zhang et al., 2018), given a passage p and a cloze question q, the task is to decide which of a given set of answer candidates is the correct replacement for the placeholder in the cloze question. As this task is already presented in the form of a cloze question, there is little room for designing PVPs, so we only use a trivial one: the concatenation of p and q as pattern and the identity function as verbalizer. With only one PVP, there is no need to perform knowledge distillation so we directly use the resulting model as our final classifier.

Setup
As underlying LM for PET we choose ALBERTxxlarge-v2 (Lan et al., 2020), the best-performing MLM on SuperGLUE when training is performed on the regular, full size training sets. We use the same model, supplemented by a sequence classification head, as our final classifier. We run PET on the FewGLUE training sets for all SuperGLUE tasks. We do not use any development set to optimize hyperparameters; instead we use the exact same setup and hyperparameters as Schick and Schütze (2021). For COPA, WSC and ReCoRD, we use our proposed modification of PET to support verbalizers mapping labels to multiple tokens; for all other tasks, we use regular PET. We train iPET on all tasks except COPA and WSC, as their unlabeled sets contain well below 1,000 examples, as well as ReCoRD, for which iPET makes no sense as we only use a single PVP. For these three tasks, we simply reuse the results of regular PET.

Results
Our main results are shown in Table 1. As can be seen, ALBERT with PET performs similar to the largest GPT-3 model, which is larger by a factor

Analysis
We investigate the importance of several factors for few-shot performance: the choice of patterns and verbalizers, the usage of both unlabeled and labeled data, and properties of the underlying language model. We also look into our proposed modification for PET to work with multiple masks and compare it to various baselines. Finally, we measure how choosing different sets of training examples affects performance. Our analysis focuses on PET as GPT-3 is not publicly available. 6

Patterns
The way in which tasks are reformulated as cloze questions can have a huge impact on performance Schick and Schütze, 2021). These reformulations can be arbitrarily complex; for example, the pattern used by GPT-3 for WSC contains an introductory section of almost 30 words; it is unclear if and how this formulation has been optimized. 7 To investigate the importance of patterns and verbalizers, we compare three sets of PVPs: our initial set as defined in Section 4.1 (denoted p ours ), the single PVP used by GPT-3 (p GPT-3 ), and the combination of both (p comb ). We train ALBERT using PET with all three sets of patterns; results for selected SuperGLUE tasks are shown in Table 2 (top). As can be seen, the PVP used by GPT-3 outperforms our PVPs on RTE whereas our initial set of patterns performs much better on MultiRC. These large differences in performance highlight the importance of finding good ways to express tasks as cloze questions. As it is difficult to ascertain which patterns perform well without trying them on a large set of examples, a key challenge for few-shot approaches is to compensate for PVPs that the LM fails to understand well. As seen in the performance of the model trained with p comb , PET is able to do so: not only does combining all PVPs compensate for the worse performance of p ours on RTE and of p GPT-3 on MultiRC, it even further improves average performance across the three tasks compared to the best-performing set of patterns. This clearly demonstrates the potential of carefully engineering a set of suitable patterns as opposed to just choosing a single formulation without means of evaluating its effectiveness.

Unlabeled Data Usage
Unlike GPT-3, PET requires unlabeled data to distill the knowledge of all models based on individual PVPs into a single classifier; for iPET, unlabeled data is additionally used to generate training sets for future generations. The underlying assumption   is that unlabeled data can easily be obtained, which may not always be the case in real-world settings. We thus investigate the importance of unlabeled data for regular PET. To this end, we compare the performance of the final classifier in PET to that of directly using the ensemble of models corresponding to individual PVPs. While using this ensemble entirely removes the need for unlabeled data, the ensemble for k PVPs is larger than the distilled model by a factor of 3 · k as we follow the default setting of PET and train three models per PVP. However, even for a large number of PVPs the ensemble is smaller than GPT-3 by two orders of magnitude. Results without distillation can be seen in Table 2 (bottom). Averaged across the three tasks, the ensemble performs even better than the distilled classifier. This shows that if the goal is only to achieve good performance, then unlabeled data is not necessary; however, it is required to obtain a single, lightweight model as final classifier. Figure 4 illustrates the benefit of training multiple generations with iPET. For all tasks except MultiRC, there are substantial improvements from   (Dodge et al., 2020). Of course, there are further ways to leverage unlabeled data such as keeping an auxiliary language modeling objective during finetuning (Chronopoulou et al., 2019). While we leave investigating the impact of additionally using such methods to future work, we note that they can easily be applied to PET while there is no straightforward way to combine them with priming.

Labeled Data Usage
We next investigate the effect of how labeled data is used, which is one of the key differences between priming and PET. We first compare PET with regular supervised training (i.e., without using any patterns), and with a fully unsupervised model (i.e., an ensemble using all PVPs but no labeled training examples). Given 32 examples, PET clearly outperforms both baselines (Table 3).
We next compare PET directly to priming. However, we cannot do so using ALBERT as it is only able to process sequences of up to 512 tokens, which is not enough for a set of 32 examples; we instead use XLNet (Yang et al., 2019) for this comparison. As shown in Table 3, XLNet in general performs worse than ALBERT. More importantly, XLNet with PET performs much better than priming. We were not able to obtain results with priming on MultiRC because the 32 examples in FewGLUE would require more than 10,000 tokens, so processing them with a standard Transformer (Vaswani While there are some Transformer variants that can deal with much longer contexts (e.g., Kitaev et al., 2020;Beltagy et al., 2020), it has yet to be investigated to what extent such models make good use of priming examples over long context spans. We further investigate the effectiveness of priming by looking at results obtained with GPT-3 more closely. To this end, Figure 5 shows the performance difference between priming GPT-3 with 32 examples and priming it with just a single example for each task and model size. 8 As can be seen, priming with 32 examples only slightly improves performance for most tasks and model sizes. For some tasks, adding more examples even leads to worse performance, especially for smaller models. For ReCoRD, even the largest model's performance slightly drops when adding more examples.
The bottom row of Figure 5 shows the performance difference between ALBERT trained with PET (without distillation) and a fully unsupervised ALBERT model on all tasks. While results are not directly comparable due to different underlying models and PVPs, PET results in much stronger performance improvements compared to priming and does not worsen results for any task. 8 We do not compare priming to zero-shot performance as for unknown reasons, zero-shot GPT-3 performs well below random guessing for some tasks (e.g., 0.0% accuracy for WiC).
To not overestimate the benefit of priming, we therefore show gains from providing 32 examples compared to just one.

Model Type
We next look into the impact of the underlying LM on PET by comparing ALBERT with RoBERTa large  and GPT-2 medium (Radford et al., 2019). As GPT-2 is a unidirectional model similar to GPT-3, it can only process patterns where the mask token is the very last token. We therefore use p GPT-3 for CB and RTE; for MultiRC, we stick with our original set of patterns as they already fulfill this requirement. We also do not perform distillation and instead report the ensemble's performance as there is no established way of equipping GPT-2 with a sequence classification head.
Results for training all three LMs with PET in Table 4 show that using ALBERT as underlying LM is crucial for PET's strong performance; exchanging ALBERT with RoBERTa results in an average performance drop of 8 points. However, RoBERTa still clearly outperforms GPT-3 13B, which is larger by two orders of magnitude. Importantly, PET with GPT-2 performs much worse than with the two other models. As anticipated by Brown et al. (2020), a reason for this drop in performance may be that like GPT-3, GPT-2 is unidirectional, making tasks that require comparing two sequences a challenge. However, it is important to note that there are also other substantial differences between GPT-2 and the other two models, most notably the pretraining dataset. Regardless of whether unidirectionality is the reason for GPT-2's bad performance, bidirectionality of the underlying LM is important for PET as it removes the need for the mask token to be at the very end and thus allows for more flexibility in the creation of patterns.

PET with Multiple Masks
We modified PET to work for outputs that require more than a single token. To investigate the impact of this modification, we look at the three tasks for which this is required: COPA, WSC and ReCoRD. We compare our decoding strategy of predicting to-  Table 5: Results on selected tasks for our proposed variant of PET as well as other decoding strategies and for untrained ALBERT kens in order of the probability assigned to them, to which we refer as max-first, with two alternatives: decoding left-to-right (ltr) as is common for many autoregressive language models, and decoding all tokens simultaneously (parallel) as is done during training. Additionally, we compare PET with untrained ALBERT to measure the effectiveness of our proposed training loss. Results are shown in Table 5. PET clearly outperforms untrained ALBERT for the three tasks. Not performing distillation hurts performance for COPA, but leads to slight improvements on WSC; for ReCoRD, we did not perform distillation in the first place as we only use a single PVP. Our decoding strategy is clearly superior to parallel decoding except for WSC, for which most predictions consist only of one or two tokens, and performs slightly better than left-to-right decoding.

Training Examples
Recall that we conduct our experiments with training examples from FewGLUE, a randomly selected subset of the original SuperGLUE training examples. We used a fixed random seed s 0 to generate FewGLUE. Let Σ i be the randomly selected subset of SuperGLUE for random seed s i , so Σ 0 = FewGLUE. In this subsection, we create two additional subsets of SuperGLUE, Σ 1 and Σ 2 , based on different seeds. This allows us to investigate how different sets of training examples affect performance. To this end, we run PET for CB, RTE and MultiRC using the three Σ i . To measure only the effect of varying the training set while ignoring unlabeled examples, we do not use distillation.  Table 6: Results on selected tasks for GPT-3 and for PET using training sets Σ 0 , Σ 1 , Σ 2 that the average performance of PET is similar to that of GPT-3 for all seeds.
While our results may seem contrary to the insight that for GPT-3, the exact choice of examples does not play a major role, we suspect this to be due to the fact that priming benefits much less from training examples than PET (cf. Section 5.3); accordingly, the influence of the exact set of training examples on the model's performance is smaller.

Conclusion
We have proposed a simple yet effective modification of PET, enabling us to use it for tasks that require predicting multiple tokens. In extensive experiments, we have identified several factors responsible for the strong performance of PET combined with ALBERT: the possibility to concurrently use multiple patterns for transforming examples into cloze questions, the ability to compensate for patterns that are difficult to understand, the usage of labeled data to perform parameter updates, and the underlying LM itself.
We have shown that using PET, it is possible to achieve few-shot text classification performance similar to GPT-3 on SuperGLUE with LMs that have three orders of magnitude fewer parameters. This not only lowers financial cost, but above all reduces environmental impact immensely and leads to a much smaller carbon footprint. We see this as an important contribution to achieving the goal of an environmentally more friendly NLP. To enable comparisons with our work, we make our code, models and datasets publicly available.
For future work, it would be interesting to see whether PET also works for generative tasks when combined with generative LMs and whether further improvements are possible in multi-task settings.

A Training Details
Our implementation can be found in the supplementary material. It extends the original implementation of PET by Schick and Schütze (2021) which, in turn, is based on the Transformers library (Wolf et al., 2020) and PyTorch (Paszke et al., 2017). All dependencies are listed in requirements.txt. Detailed instructions on how our results can be reproduced using this implementation can be found in README.md. Unless explicitly stated differently, we use the exact same set of hyperparameters as Schick and Schütze (2021) ( Table 7) with the only difference that for iPET, we only train 3 generations of models to speed up training. All of our experiments were conducted using a single GPU with 11GB RAM (NVIDIA GeForce GTX 1080 Ti). With this GPU, training a single PET model for 250 steps took approximately 45 minutes. Depending on the task, labeling unlabeled examples took 0.2-1.5 hours per model. Training the final classifier for 5,000 steps on the soft-labeled dataset took 2.5 hours on average. Below, we list task-specific implementation details for all tasks in SuperGLUE.
COPA For COPA, we randomly switch the two options c 1 and c 2 during training with a probability of 50% to make the input more diverse; for inference, we always keep the original order. For distilling the final PET model, we obtain logits for unlabeled examples x from individual PVPs p as s p (y | x) = log q p (y | x); we use the input format proposed by .
WiC Similar to COPA, we randomly switch the input sentences s 1 and s 2 during training. Given a word w and two sentences s 1 and s 2 , we use the sequence w: s 1 | s 2 as input for the final sequence classification model, where | marks the boundary between two text segments.
WSC Unlike other SuperGLUE tasks, the WSC formulation of Raffel et al. (2020) and Brown et al. (2020) requires free-form completion, meaning that for each sentence s and pronoun p, we only have a single correct choice n that the model needs to predict, but we do not provide any alternatives. During training, we thus use regular cross entropy loss between n andq p (n | s, p) as defined in Eq. 4. However, in many cases this would allow the LM to easily identify the correct target based on the number of masks provided, so we modify each target by randomly adding up to three additional mask tokens, for which we require the model to predict a special <pad> token. For inference, we always just add a single mask token to ensure consistent results across multiple evaluations and perform greedy decoding as described in Section 3. We then follow Raffel et al. (2020) to map the output produced by the LM to a label y ∈ {true, false}. For distillation, given an unlabeled example x we set s p (y | x) = 1 if the model's output for x was mapped to y and s p (y | x) = 0 otherwise. We provide inputs to the final PET model in the format s | n where | is the boundary between two text segments and mark p in s with asterisks.
MultiRC Deviating from the hyperparameters used by Schick and Schütze (2021), we use a maximum sequence length of 512 tokens for MultiRC both during training and inference because we found many passages to be much longer than 256 tokens. Input for the final sequence classification model is of the form p | q | a where p is the passage, q is the question, a is the answer candidate and we use | to mark boundaries between text segments.
ReCoRD For ReCoRD, we again use a maximum sequence length of 512 because many passages require more than 256 tokens. For some questions q, the ReCoRD training set contains a huge number of answer candidates. To facilitate training, we split each example into multiple examples as follows: let C be the set of answer candidates