Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

Some NLP tasks can be solved in a fully unsupervised fashion by providing a pretrained language model with “task descriptions” in natural language (e.g., Radford et al., 2019). While this approach underperforms its supervised counterpart, we show in this work that the two ideas can be combined: We introduce Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. For several tasks and languages, PET outperforms supervised training and strong semi-supervised approaches in low-resource settings by a large margin.


Introduction
Learning from examples is the predominant approach for many NLP tasks: A model is trained on a set of labeled examples from which it then generalizes to unseen data. Due to the vast number of languages, domains and tasks and the cost of annotating data, it is common in real-world uses of NLP to have only a small number of labeled examples, making few-shot learning a highly important research area. Unfortunately, applying standard supervised learning to small training sets often performs poorly; many problems are difficult to grasp from just looking at a few examples. For instance, assume we are given the following pieces of text: • T 1 : This was the best pizza I've ever had.
• T 2 : You can get better sushi for half the price.
Best pizza ever! +1 ) ∈ T ( Best pizza ever! It was . Furthermore, imagine we are told that the labels of T 1 and T 2 are l and l , respectively, and we are asked to infer the correct label for T 3 . Based only on these examples, this is impossible because plausible justifications can be found for both l and l . However, if we know that the underlying task is to identify whether the text says anything about prices, we can easily assign l to T 3 . This illustrates that solving a task from only a few examples becomes much easier when we also have a task description, i.e., a textual explanation that helps us understand what the task is about. With the rise of pretrained language models (PLMs) such as GPT (Radford et al., 2018), BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), the idea of providing task descriptions has become feasible for neural architectures: We can simply append such descriptions in natural language to an input and let the PLM predict continuations that solve the task (Radford et al., 2019;Puri and Catanzaro, 2019). So far, this idea has mostly been considered in zero-shot scenarios where no training data is available at all.
In this work, we show that providing task descriptions can successfully be combined with standard supervised learning in few-shot settings: We introduce Pattern-Exploiting Training (PET), a semi-supervised training procedure that uses natural language patterns to reformulate input examples into cloze-style phrases. As illustrated in Figure 1, PET works in three steps: First, for each pattern a separate PLM is finetuned on a small training set T . The ensemble of all models is then used to annotate a large unlabeled dataset D with soft labels. Finally, a standard classifier is trained on the soft-labeled dataset. We also devise iPET, an iterative variant of PET in which this process is repeated with increasing training set sizes.
On a diverse set of tasks in multiple languages, we show that given a small to medium number of labeled examples, PET and iPET substantially outperform unsupervised approaches, supervised training and strong semi-supervised baselines. Radford et al. (2019) provide hints in the form of natural language patterns for zero-shot learning of challenging tasks such as reading comprehension and question answering (QA). This idea has been applied to unsupervised text classification (Puri and Catanzaro, 2019), commonsense knowledge mining (Davison et al., 2019) and argumentative relation classification (Opitz, 2019). Srivastava et al. (2018) use task descriptions for zero-shot classification but require a semantic parser. For relation extraction, Bouraoui et al. (2020) automatically identify patterns that express given relations. Mc-Cann et al. (2018) rephrase several tasks as QA problems. Raffel et al. (2020) frame various problems as language modeling tasks, but their patterns only loosely resemble natural language and are unsuitable for few-shot learning. 2 Another recent line of work uses cloze-style phrases to probe the knowledge that PLMs acquire during pretraining; this includes probing for factual and commonsense knowledge (Trinh and Le, 2018;Petroni et al., 2019;Wang et al., 2019;Sakaguchi et al., 2020), linguistic capabilities (Ettinger, 2020;Kassner and Schütze, 2020), understanding of rare words (Schick and Schütze, 2020), and ability to perform symbolic reasoning (Talmor et al., 2019). Jiang et al. (2020) consider the problem of finding the best pattern to express a given task.

Related Work
Other approaches for few-shot learning in NLP include exploiting examples from related tasks (Yu et al., 2018;Gu et al., 2018;Dou et al., 2019;Qian and Yu, 2019;Yin et al., 2019) and using data augmentation (Xie et al., 2020;Chen et al., 2020); the latter commonly relies on back-translation (Sennrich et al., 2016), requiring large amounts of parallel data. Approaches using textual class descriptors typically assume that abundant examples are available for a subset of classes (e.g., Romera-Paredes and Torr, 2015;Veeranna et al., 2016;Ye et al., 2020). In contrast, our approach requires no additional labeled data and provides an intuitive interface to leverage task-specific human knowledge.

Pattern-Exploiting Training
Let M be a masked language model with vocabulary V and mask token ∈ V , and let L be a set of labels for our target classification task A. We write an input for task A as a sequence of phrases x = (s 1 , . . . , s k ) with s i ∈ V * ; for example, k = 2 if A is textual inference (two input sentences). We define a pattern to be a function P that takes x as input and outputs a phrase or sentence P (x) ∈ V * that contains exactly one mask token, i.e., its output can be viewed as a cloze question. Furthermore, we define a verbalizer as an injective function v : L → V that maps each label to a word from M 's vocabulary. We refer to (P, v) as a pattern-verbalizer pair (PVP).
Using a PVP (P, v) enables us to solve task A as follows: Given an input x, we apply P to obtain an input representation P (x), which is then processed by M to determine the label y ∈ L for which v(y) is the most likely substitute for the mask. For example, consider the task of identifying whether two sentences a and b contradict each other (label y 0 ) or agree with each other (y 1 ). For this task, we may choose the pattern P (a, b) = a?
, b. combined with a verbalizer v that maps y 0 to "Yes" and y 1 to "No". Given an example input pair x = (Mia likes pie, Mia hates pie), the task now changes from having to assign a label without inherent meaning to answering whether the most likely choice for the masked position in P (x) = Mia likes pie?

PVP Training and Inference
Let p = (P, v) be a PVP. We assume access to a small training set T and a (typically much larger) set of unlabeled examples D. For each sequence z ∈ V * that contains exactly one mask token and w ∈ V , we denote with M (w | z) the unnormalized score that the language model assigns to w at the masked position. Given some input x, we define the score for label l ∈ L as and obtain a probability distribution over labels using softmax: l ∈L e sp(l |x) We use the cross-entropy between q p (l | x) and the true (one-hot) distribution of training example (x, l) -summed over all (x, l) ∈ T -as loss for finetuning M for p.

Auxiliary Language Modeling
In our application scenario, only a few training examples are available and catastrophic forgetting can occur. As a PLM finetuned for some PVP is still a language model at its core, we address this by using language modeling as auxiliary task. With L CE denoting cross-entropy loss and L MLM language modeling loss, we compute the final loss as This idea was recently applied by Chronopoulou et al. (2019) in a data-rich scenario. As L MLM is typically much larger than L CE , in preliminary experiments, we found a small value of α = 10 −4 to consistently give good results, so we use it in all our experiments. To obtain sentences for language modeling, we use the unlabeled set D. However, we do not train directly on each x ∈ D, but rather on P (x), where we never ask the language model to predict anything for the masked slot.

Combining PVPs
A key challenge for our approach is that in the absence of a large development set, it is hard to identify which PVPs perform well. To address this, we use a strategy similar to knowledge distillation (Hinton et al., 2015). First, we define a set P of PVPs that intuitively make sense for a given task A. We then use these PVPs as follows: (1) We finetune a separate language model M p for each p ∈ P as described in Section 3.1.
As T is small, this finetuning is cheap even for a large number of PVPs.
(2) We use the ensemble M = {M p | p ∈ P} of finetuned models to annotate examples from D. We first combine the unnormalized class scores for each example x ∈ D as where Z = p∈P w(p) and the w(p) are weighting terms for the PVPs. We experiment with two different realizations of this weighing term: either we simply set w(p) = 1 for all p or we set w(p) to be the accuracy obtained using p on the training set before training. We refer to these two variants as uniform and weighted. Jiang et al. (2020) use a similar idea in a zero-shot setting.
We transform the above scores into a probability distribution q using softmax. Following Hinton et al. (2015), we use a temperature of T = 2 to obtain a suitably soft distribution. All pairs (x, q) are collected in a (soft-labeled) training set T C .
(3) We finetune a PLM C with a standard sequence classification head on T C .
The finetuned model C then serves as our classifier for A. All steps described above are depicted in Figure 2; an example is shown in Figure 1. (2) The final set of models is used to create a soft-labeled dataset T C .
(3) A classifier C is trained on this dataset.

Iterative PET (iPET)
Distilling the knowledge of all individual models into a single classifier C means they cannot learn from each other. As some patterns perform (possibly much) worse than others, the training set T C for our final model may therefore contain many mislabeled examples.
To compensate for this shortcoming, we devise iPET, an iterative variant of PET. The core idea of iPET is to train several generations of models on datasets of increasing size. To this end, we first enlarge the original dataset T by labeling selected examples from D using a random subset of trained PET models ( Figure 2a). We then train a new generation of PET models on the enlarged dataset (b); this process is repeated several times (c).
More formally, let M 0 = {M 0 1 , . . . , M 0 n } be the initial set of PET models finetuned on T , where each M 0 i is trained for some PVP p i . We train k generations of models M 1 , . . . , M k where M j = {M j 1 , . . . , M j n } and each M j i is trained for p i on its own training set T j i . In each iteration, we multiply the training set size by a fixed constant d ∈ N while maintaining the label ratio of the original dataset. That is, with c 0 (l) denoting the number of examples with label l in T , each T j i contains c j (l) = d · c j−1 (l) examples with label l. This is achieved by generating each T j i as follows: 1. We obtain N ⊂ M j−1 \ {M j−1 i } by randomly choosing λ · (n − 1) models from the previous generation with λ ∈ (0, 1] being a hyperparameter.
2. Using this subset, we create a labeled dataset To avoid training future generations on mislabeled data, we prefer examples for which the ensemble of models is confident in its prediction. The underlying intuition is that even without calibration, examples for which labels are predicted with high confidence are typically more likely to be classified correctly (Guo et al., 2017). Therefore, when drawing from T N , we set the probability of each (x, y) proportional to s N (l | x).
3. We define T j i = T ∪ l∈L T N (l). As can easily be verified, this dataset contains c j (l) examples for each l ∈ L.
After training k generations of PET models, we use M k to create T C and train C as in basic PET.
With minor adjustments, iPET can even be used in a zero-shot setting. To this end, we define M 0 to be the set of untrained models and c 1 (l) = 10/|L| for all l ∈ L so that M 1 is trained on 10 examples evenly distributed across all labels. As T N may not contain enough examples for some label l, we create all T N (l) by sampling from the 100 examples x ∈ D for which s N (l | x) is the highest, even if l = arg max l∈L s N (l | x). For each subsequent generation, we proceed exactly as in basic iPET.

Experiments
We evaluate PET on four English datasets: Yelp Reviews, AG's News, Yahoo Questions (Zhang et al., 2015) and MNLI (Williams et al., 2018). Additionally, we use x-stance (Vamvas and Sennrich, 2020) to investigate how well PET works for other languages. For all experiments on English, we use RoBERTa large (Liu et al., 2019) as language model; for x-stance, we use XLM-R (Conneau et al., 2020). We investigate the performance of PET and all baselines for different training set sizes; each model is trained three times using different seeds and average results are reported.
As we consider a few-shot setting, we assume no access to a large development set on which hyperparameters could be optimized. Our choice of hyperparameters is thus based on choices made in previous work and practical considerations. We use a learning rate of 1 · 10 −5 , a batch size of 16 and a maximum sequence length of 256. Unless otherwise specified, we always use the weighted variant of PET with auxiliary language modeling. For iPET, we set λ = 0.25 and d = 5; that is, we select 25% of all models to label examples for the next generation and quintuple the number of training examples in each iteration. We train new generations until each model was trained on at least 1000 examples, i.e., we set k = log d (1000/|T |) . As we always repeat training three times, the ensemble M (or M 0 ) for n PVPs contains 3n models. Further hyperparameters and detailed explanations for all our choices are given in Appendix B.

Patterns
We now describe the patterns and verbalizers used for all tasks. We use two vertical bars ( ) to mark boundaries between text segments. 3 Yelp For the Yelp Reviews Full Star dataset (Zhang et al., 2015), the task is to estimate the rating that a customer gave to a restaurant on a 1to 5-star scale based on their review's text. We define the following patterns for an input text a: P 1 (a) = It was . a P 2 (a) = Just ! a P 3 (a) = a. All in all, it was .
P 4 (a) = a In summary, the restaurant is . 3 The way different segments are handled depends on the model being used; they may e.g. be assigned different embeddings (Devlin et al., 2019) or separated by special tokens (Liu et al., 2019;Yang et al., 2019). For example, "a b" is given to BERT as the input " We define a single verbalizer v for all patterns as AG's News AG's News is a news classification dataset, where given a headline a and text body b, news have to be classified as belonging to one of the categories World (1), Sports (2), Business (3) or Science/Tech (4). For x = (a, b), we define the following patterns: We use a verbalizer that maps 1-4 to "World", "Sports", "Business" and "Tech", respectively.
Yahoo Yahoo Questions (Zhang et al., 2015) is a text classification dataset. Given a question a and an answer b, one of ten possible categories has to be assigned. We use the same patterns as for AG's News, but we replace the word "News" in P 5 with the word "Question". We define a verbalizer that maps categories 1-10 to "Society", "Science", "Health", "Education", "Computer", "Sports", "Business", "Entertainment", "Relationship" and "Politics".
MNLI The MNLI dataset (Williams et al., 2018) consists of text pairs x = (a, b). The task is to find out whether a implies b (0), a and b contradict each other (1) or neither (2). We define and consider two different verbalizers v 1 and v 2 : Combining the two patterns with the two verbalizers results in a total of 4 PVPs.
X-Stance The x-stance dataset (Vamvas and Sennrich, 2020) is a multilingual stance detection dataset with German, French and Italian examples.
Each example x = (a, b) consists of a question a concerning some political issue and a comment b; the task is to identify whether the writer of b  supports the subject of the question (0) or not (1). We use two simple patterns and define an English verbalizer v En mapping 0 to "Yes" and 1 to "No" as well as a French (German) verbalizer v Fr (v De ), replacing "Yes" and "No" with "Oui" and "Non" ("Ja" and "Nein"). We do not define an Italian verbalizer because x-stance does not contain any Italian training examples.

Results
English Datasets Table 1 shows results for English text classification and language understanding tasks; we report mean accuracy and standard deviation for three training runs. Lines    Table 2 shows that PET and iPET substantially outperform both methods across all tasks, clearly demonstrating the benefit of incorporating human knowledge in the form of PVPs.

X-Stance
We evaluate PET on x-stance to investigate (i) whether it works for languages other than English and (ii) whether it also brings improvements when training sets have medium size. In contrast to Vamvas and Sennrich (2020), we do not perform any hyperparameter optimization on dev and use a shorter maximum sequence length (256 vs 512) to speed up training and evaluation. To investigate whether PET brings benefits even when numerous examples are available, we consider training set sizes of 1000, 2000, and 4000; for each of these configurations, we separately finetune French and German models to allow for a more straightforward downsampling of the training data. Additionally, we train models on the entire French (|T Fr | = 11 790) and German (|T De | = 33 850) training sets. In this case we do not have any additional unlabeled data, so we simply set D = T . For the French models, we use v En and v Fr as verbalizers and for German v En and v De (Section 4.1). Finally, we also investigate the performance of a model trained jointly on French and German data (|T Fr + T De | = 45 640) using v En , v Fr and v De .
Results are shown in Table 3; following Vamvas  Table 4: Minimum (min) and maximum (max) accuracy of models based on individual PVPs as well as PET with and without knowledge distillation (|T | = 10). and Sennrich (2020), we report the macro-average of the F1 scores for labels 0 and 1, averaged over three runs. For Italian (column "It"), we report the average zero-shot cross-lingual performance of German and French models as there are no Italian training examples. Our results show that PET brings huge improvements across all languages even when training on much more than a thousand examples; it also considerably improves zero-shot cross-lingual performance.

Analysis
Combining PVPs We first investigate whether PET is able to cope with situations were some PVPs perform much worse than others. For |T | = 10, Table 4 compares the performance of PET to that of the best and worst performing patterns after finetuning; we also include results obtained using the ensemble of PET models corresponding to individual PVPs without knowledge distillation. Even after finetuning, the gap between the best and worst pattern is large, especially for Yelp. However, PET is not only able to compensate for this, but even improves accuracies over using only the bestperforming pattern across all tasks. Distillation brings consistent improvements over the ensemble; additionally, it significantly reduces the size of the final classifier. We find no clear difference between the uniform and weighted variants of PET.

Auxiliary Language Modeling
We analyze the influence of the auxiliary language modeling task on PET's performance. Figure 3 shows performance improvements from adding the language modeling task for four training set sizes. We see that the auxiliary task is extremely valuable when training on just 10 examples. With more data, it becomes less important, sometimes even leading to worse performance. Only for MNLI, we find language modeling to consistently help.
Iterative PET To check whether iPET is able to improve models over multiple generations, Figure 4 shows the average performance of all generations of models in a zero-shot setting. Each additional iteration does indeed further improve the ensemble's performance. We did not investigate whether continuing this process for even more iterations gives further improvements. Another natural question is whether similar results can be obtained with fewer iterations by increasing the training set size more aggressively. To answer this question, we skip generations 2 and 3 for AG's News and Yahoo and for both tasks directly let ensemble M 1 annotate 10 · 5 4 examples for M 4 . As indicated in Figure 4 through dashed lines, this clearly leads to worse performance, highlighting the importance of only gradually increasing the training set size. We surmise that this is the case because annotating too many examples too early leads to a large percentage of mislabeled training examples. In-Domain Pretraining Unlike our supervised baseline, PET makes use of the additional unlabeled dataset D. Thus, at least some of PET's performance gains over the supervised baseline may arise from this additional in-domain data.
To test this hypothesis, we simply further pretrain RoBERTa on in-domain data, a common technique for improving text classification accuracy (e.g., Howard and Ruder, 2018;Sun et al., 2019). As language model pretraining is expensive in terms of GPU usage, we do so only for the Yelp dataset. Figure 5 shows results of supervised learning and PET both with and without this indomain pretraining. While pretraining does indeed improve accuracy for supervised training, the supervised model still clearly performs worse than PET, showing that the success of our method is not simply due to the usage of additional unlabeled data. Interestingly, in-domain pretraining is also helpful for PET, indicating that PET leverages unlabeled data in a way that is clearly different from standard masked language model pretraining.

Conclusion
We have shown that providing task descriptions to pretrained language models can be combined with standard supervised training. Our proposed method, PET, consists of defining pairs of cloze question patterns and verbalizers that help leverage the knowledge contained within pretrained language models for downstream tasks. We finetune models for all pattern-verbalizer pairs and use them to create large annotated datasets on which standard classifiers can be trained. When the initial amount of training data is limited, PET gives large improvements over standard supervised training and strong semi-supervised approaches.

A Implementation
Our implementation of PET and iPET is based on the Transformers library (Wolf et al., 2020) and PyTorch (Paszke et al., 2017).

B Training Details
Except for the in-domain pretraining experiment described in Section 5, all of our experiments were conducted using a single GPU with 11GB RAM (NVIDIA GeForce GTX 1080 Ti).

B.1 Hyperparameter Choices
Relevant training hyperparameters for both individual PET models and the final classifier C as well as our supervised baseline are listed in Table 5. All hyperparameters were selected based on the following considerations and experiments: Batch size / maximum length Both batch size and maximum sequence length (or block size) are chosen so that one batch fits into 11GB of GPU memory. As Devlin et al. (2019) and Liu et al.
(2019) use larger batch sizes of 16-32, we accumulate gradients for 4 steps to obtain an effective batch size of 16.
Learning rate We found a learning rate of 5e−5 (as used by Devlin et al. (2019)) to often result in unstable training for regular supervised learning with no accuracy improvements on the training set. We therefore use a lower learning rate of 1e−5, similar to Liu et al. (2019). Experiments with various learning rates can be found in Appendix D.
Training steps As the number of training epochs recommended by Liu et al. (2019) in a data-rich scenario is in the range 2-10, we perform supervised training for 250 training steps, corresponding to 4 epochs when training on 1000 examples. For individual PET models, we subdivide each batch into one labeled example from T to compute L CE and three unlabeled examples from D to compute L MLM . Accordingly, we multiply the number of total training steps by 4 (i.e., 1000), so that the number of times each labeled example is seen remains constant (16 · 250 = 4 · 1000). For the final PET classifier, we train for 5000 steps due to the increased training set size (depending on the task, the unlabeled set D contains at least 20 000 examples). Deviating from the above, we always perform training for 3 epochs on x-stance to match the setup of Vamvas and Sennrich (2020) more closely. The effect of varying the number of training steps is further investigated in Appendix D.
Temperature We choose a temperature of 2 when training the final classifier following Hinton et al. (2015).
Auxiliary language modeling To find a suitable value of α for combining language modeling loss and cross-entropy loss, we first observed that in the early stages of training, the former is a few orders of magnitude higher than the latter for all tasks considered. We thus selected a range {1e−3, 1e−4, 1e−5} of reasonable choices for α and performed preliminary experiments on Yelp with 100 training examples to find the best value among these candidates. To this end, we split the training examples into a training set and a dev set using both a 90/10 split and a 50/50 split and took the value of α that maximizes average dev set accuracy. We adopt this value for all other tasks and training set sizes without further optimization.
Models per ensemble As we always train three models per pattern, for both iPET and training the final classifier C, the ensemble M (or M 0 ) for n PVPs contains 3n models. This ensures consistency as randomly choosing any of the three models for each PVP would result in high variance. In preliminary experiments, we found this to have only little impact on the final model's performance.
iPET dataset size For iPET, we quintuple the number of training examples after each iteration (d = 5) so that only a small number of generations is required to reach a sufficient amount of labeled data. We did not choose a higher value because we presume that this may cause training sets for early generations to contain a prohibitively large amount of mislabeled data.
iPET dataset creation We create training sets for the next generation in iPET using 25% of the models in the current generation (λ = 0.25) because we want the training sets for all models to be diverse while at the same time, a single model should not have too much influence.
Others For all other hyperparameters listed in Table 5, we took the default settings of the Transformers library (Wolf et al., 2020).

B.2 Number of parameters
As PET does not require any additional learnable parameters, the number of parameters for both PET and iPET is identical to the number of parameters in the underlying language model: 355M for RoBERTa (large) and 270M for XLM-R (base). For MixText, we use the original implementation 5 and the default set of hyperparameters. Specifically, each batch consists of 4 labeled and 8 unlabeled examples, we use layers 7, 9 and 12 for mixing, we set T = 5, α = 16, and use a learning rate of 5 · 10 −6 for RoBERTa and 5 · 10 −4 for the final classification layer. We optimize the number of training steps for each task and dataset size in the range {1000, 2000, 3000, 4000, 5000}.

B.3 Average runtime
For UDA, we use a PyTorch-based reimplementation 6 . We use the same batch size as for MixText and the hyperparameter values recommended by Xie et al. (2020); we use an exponential schedule for training signal annealing and a learning rate of 2 · 10 −5 . We optimize the number of training steps for each task and dataset size in the range {500, 1000, 1500, . . . , 10000}.

B.5 In-Domain Pretraining
For in-domain pretraining experiments described in Section 5, we use the language model finetuning script of the Transformers library (Wolf et al., 2020); all hyperparameters are listed in the last column of Table 5. Pretraining was performed on a total of 3 NVIDIA GeForce GTX 1080 Ti GPUs. 5 https://github.com/GT-SALT/MixText 6 https://github.com/SanghunYun/UDA_ pytorch

C Dataset Details
For each task and number of examples t, we create the training set T by collecting the first t/|L| examples per label from the original training set, where |L| is the number of labels for the task. Similarly, we construct the set D of unlabeled examples by selecting 10 000 examples per label and removing all labels. For evaluation, we use the official test set for all tasks except MNLI, for which we report results on the dev set; this is due to the limit of 2 submissions per 14 hours for the official MNLI test set. An overview of the number of test examples and links to downloadable versions of all used datasets can be found in Table 6.
Preprocessing In some of the datasets used, newlines are indicated through the character sequence "\n". As the vocabularies of RoBERTa and XLM-R do not feature a newline, we replace this sequence with a single space. We do not perform any other preprocessing, except shortening all examples to the maximum sequence length of 256 tokens. This is done using the longest first strategy implemented in the Transformers library. For PET, all input sequences are truncated before applying patterns.
Evaluation metrics For Yelp, AG's News, Yahoo and MNLI, we use accuracy. For x-stance, we report macro-average of F1 scores using the evaluation script of Vamvas and Sennrich (2020).

D Hyperparameter Importance
To analyze the importance of hyperparameter choices for PET's performance gains over supervised learning, we look at the influence of both the learning rate (LR) and the number of training steps on their test set accuracies.
We try values of {1e−5, 2e−5, 5e−5} for the learning rate and {50, 100, 250, 500, 1000} for the number of training steps. As this results in 30 different configurations for just one task and training set size, we only perform this analysis on Yelp with 100 examples, for which results can be seen in Figure 6. For supervised learning, the configuration used throughout the paper (LR = 1e−5, 250 steps) turns out to perform best whereas for PET, training for fewer steps consistently performs even better. Importantly, PET clearly outperforms regular supervised training regardless of the chosen learning rate and number of training steps. ---* temperature --2.0 -weight decay 0.01 0.01 0.01 0.01 0.0