Improving and Simplifying Pattern Exploiting Training

Recently, pre-trained language models (LMs) have achieved strong performance when fine-tuned on difficult benchmarks like SuperGLUE. However, performance can suffer when there are very few labeled examples available for fine-tuning. Pattern Exploiting Training (PET) is a recent approach that leverages patterns for few-shot learning. However, PET uses task-specific unlabeled data. In this paper, we focus on few-shot learning without any unlabeled data and introduce ADAPET, which modifies PET's objective to provide denser supervision during fine-tuning. As a result, ADAPET outperforms PET on SuperGLUE without any task-specific unlabeled data. Our code can be found at https://github.com/rrmenon10/ADAPET.


Introduction
Pre-trained language models (LMs) have shown significant gains across a wide variety of natural language processing (NLP) tasks in recent years (Devlin et al., 2019;Radford et al., 2018;Raffel et al., 2020). Most of these gains are obtained by fine-tuning language models on labeled data for a particular task. However, performance can suffer when there is very limited labeled data available for a downstream task (Xie et al., 2020;Chen et al., 2020).
Recently, GPT-3 (Brown et al., 2020) demonstrated how language models, when scaled to hundreds of billions of parameters, can learn well when primed with only a few labeled examples. However, the scale of GPT-3 (175B parameters) makes it impractical to study. There is, therefore, a need to develop smaller language models that can work equally well with limited labeled data.
Pattern-Exploiting Training (PET; Schick and Schütze, 2021a,b) reformulates natural language understanding tasks as cloze-style questions and * Equal contribution performs gradient-based fine-tuning. In doing so, PET outperforms GPT-3 with few labeled examples using ALBERT (Lan et al., 2020). However, PET uses additional task-specific unlabeled data.
We propose ADAPET (A Densely-supervised Approach to Pattern Exploiting Training) that uses more supervision by decoupling the losses for the label tokens and a label-conditioned masked language modeling (MLM) objective over the full original input. On SuperGLUE (Wang et al., 2019) with 32 labeled examples per task, ADAPET outperforms iPET without any unlabeled data.

Background
Cloze-style questions and MLM. A cloze task is a problem where certain parts of a text are removed, and the goal is to replace the missing portion based on the context (Taylor, 1953). Here, the text that has some parts removed is considered a cloze-style question. Inspired by cloze tasks, BERT introduces the MLM objective that tries to predict the original word at the masked out Here, the blue boxes refer to the inputs from a task (entailment, in this case). Figure 2a shows the decoupling label objective. The model has to predict the correct and incorrect labels at the masked out position, using a BCE loss over all labels. For the label conditioning objective in Figure 2b, the input text either includes the correct or incorrect label. At a randomly masked out position, the model should predict the original token when the input text has the correct label, and should not predict the original token when the input text has an incorrect label.
positions in a cloze question.
Notation. Let G represent a language model, x represent the input example converted into a cloze-style question, and y represent the label at the masked location m. We are interested in the quantity [[G m (x)]] z which represents the logit value for a specific token z at the mask location m.

Unlabeled Data Access
Schick and Schütze (2021a,b) assumes access to task-specific unlabeled data. For some applications such as sentiment analysis, unlabeled data can be cheap to acquire. But for SuperGLUE, where the examples are pairs of text with a label that is constructed to test a model's natural language understanding abilities, it might be more expensive to acquire unlabeled data. For example, the construction of BoolQ requires annotators to filter good question-article pairs before assigning labels (Clark et al., 2019). Hence, for our setup, we do not assume access to task-specific unlabeled data, which aligns with the setup in Brown et al. (2020).

PET
Our work primarily builds on top of PET (Schick and Schütze, 2021a,b). PET converts an example into a cloze-style question, similar to the input format used during pre-training. The query-form in PET is defined by a Pattern-Verbalizer Pair (PVP). Each PVP consists of • a pattern which describes how to convert the inputs into a cloze-style question with masked out tokens. We illustrate this for an entailment task in Figure 2a. Here, we convert the premise ("Oil prices fall back") and the hypothesis ("Oil prices rise") into a clozestyle question with the pattern: <premise> ? <mask>, <hypothesis>.
• a verbalizer which describes the way to convert the classes into the output space of tokens. In Figure 2a, the verbalizer maps "Not Entailment/Entailment" to "No/Yes".
After hand-designing a PVP for a given task, PET obtains logits from the model G m (x) (in the singletoken label case). Given the space of output tokens Y, (in Figure 2a {"Yes", "No"}) PET computes a softmax over y ∈ Y, using the logits from G m (x). The final loss is shown in Equation 2.
PET additionally distils knowledge from an ensemble of models trained with different patterns on both labeled and unlabeled data. iPET is an iterative variant of PET that trains models across iterations. The size of the training set gradually increases each iteration based on the labels of previous iterations. For a description of the different patterns used across the tasks (Schick and Schütze, 2021b), we refer the reader to Appendix A.1.

ADAPET
Our proposed approach, called ADAPET, modifies the objective from PET so that it can provide more supervision and learn without task-specific unlabeled data.

Decoupling Label Losses
PET computes class probabilities using the logits that correspond to the labels for a specific task. This discards the information from all the other logits in the vocabulary that do not correspond to a label. For example, in Figure 2a, "oil" is not a class token so the LM head should assign a low probability to "oil". However, because PET only extracts the token logits that correspond to labels, the non-label tokens will never have any gradient signal.
One solution is to change the objective to a regular MLM objective. In that case, there would be no distinction between tokens corresponding to incorrect classes and any other token in the vocabulary. For example, in Figure 2a, the model would be trained to treat "Yes" (the incorrect token) the same as any other token such as "oil". While we want the model to discourage "oil", the training objective should still specifically suppress "Yes".
In ADAPET, we penalize incorrect class tokens and encourage correct class tokens. Specifically, the model computes the probability of each token as a softmax normalized across all tokens so that each probability is influenced by the logits of all the vocabulary tokens. Then, we maximize the probability of the correct class tokens and minimize the probability of incorrect class tokens. This is equivalent to binary cross entropy, as shown in Figure 2a. Formally, if y * is the true label for an example, The loss can be rewritten using binary cross entropy or regular cross entropy as:

Unified Loss for Different Tasks
For normal tasks where the label is exactly one token, PET uses the formulation described in Equation 2. For WSC (Levesque et al., 2012), which does not have incorrect class labels, PET uses the original MLM objective rather than Equation 2. This is equivalent to Equation 5 without the second term in ADAPET.
For other tasks with multi-token labels (COPA (Roemmele et al., 2011), ReCoRD (Zhang et al., 2018)), PET computes the probability of the classes as the sum of the log probabilities of the individual tokens. However, it is not obvious how to convert these label probabilities into a valid probability distribution.
Rather than normalizing the probabilities, PET uses a hinge loss to ensure a margin between the correct label and the incorrect labels.
In ADAPET, for each token in the label, L D discriminates the correct token from every other tokens, via the following loss: This objective splits a single loss based on multiple tokens into multiple losses over single tokens. As a result, we do not need to to multiply the probabilities of the individual tokens, and thus do not run into normalization issues.

Label Conditioning
The PET objective encapsulates the question: "Given the input, what is the right label?." However, since the input space and output space both consist of tokens, we can also ask the inverse question, "Given the answer, what is the correct context?". The model is trained to predict the input given the label. Formally, let x ′ be the original input x modified by randomly masking out tokens from the context and x m be the original context tokens masked out in x ′ . In the label conditioning objective, we are interested in the quantity P (x m |x ′ , y), which encourages the model to predict the masked out tokens in the input given the label.
During training, if the label is correct, the model has to predict the original token, as shown in Fig This objective is the same as the decoupling label losses approach described in Equation 5, except with different inputs and outputs.
The final loss for ADAPET is a sum of the decoupled label loss and the label-conditioned MLM loss.

Results and Analyses
We run experiments on SuperGLUE, and follow the same data split as Schick and Schütze (2021b), which consists of 32 labeled examples for each task.
Our code is implemented in Pytorch (Paszke et al., 2019) using HuggingFace (Wolf et al., 2020). We use the same pre-trained model and hyperparameters as PET, except we increased the number of training batches to 1k and choose the best checkpoint on the dev set, since it has been shown that training longer can help even with few samples (Zhang et al., 2021). For all ablation experiments, we only use the first pattern 3 and train for 250 batches. We refer the reader to Appendix B for more details.
Since we do not assume access to unlabeled data (see Section 2.1), we do not apply the three-step training procedure of PET and iPET to ADAPET. We still assume access to the full development set to choose the best masking ratio and checkpoint model, since PET presumably used the full development set to choose their hyperparameters which we copy. Table 1 and Table 2 shows our results on the validation and test sets on SuperGLUE. We compare against GPT-3 and PET/iPET. Note that PET/iPET uses unlabeled data and a three step training procedure (Schick and Schütze, 2021b). For fair comparison, we train PET with a single pattern (sPET) for 1k batches, and report scores for the best performing pattern on the validation set. We include a further analysis of how well the models perform for each pattern in Appendix A.2.

Results
On the dev set, ADAPET outperforms all models that do not use unlabeled data, and even outperforms PET's iterative variant, iPET, by 0.5 points absolute. Surprisingly, sPET outperforms PET, but still loses to iPET by 2.6 points. But, this is in line with the ablation from Schick and Schütze (2021b), which shows that ensembling sPET models, trained with only labeled data, outperforms PET. Also, Gao et al. (2021) show that the model with the best performing pattern outperforms ensembling sPET models.
On the test set, ADAPET outperforms all other models including iPET without access to the unlabeled examples (∼9k on average per task) and achieves state-of-the-art for few-shot learning on SuperGLUE.

Conclusion
In this paper, we propose ADAPET, a new method for few-shot natural language understanding. Crucially, our work does not use unlabeled data and instead leverages more supervision to train the model. Assuming the same data budget, our model outperforms GPT-3 on SuperGLUE using just 0.1% as many parameters. However, our method has limitations; for example, we use a naive random masking strategy, which might not make sense for label conditioning. Future work could look into better masking strategies for labeled conditioned MLM, such as masking important tokens based on the the gradients of the logits for an example, as has been done for interpreting models (Simonyan et al., 2014). For this QA task, we are given a paragraph pand a yes/no question q. We use two forms of labels for this task yes/no and true/false. This is a textual entailment task similar to CB, except that we have just two labels for classification, entailment and not entailment. We map these two labels to yes and no respectively in the PVPs. In this task, we are given two sentences s 1 and s 2 and we need to identify if a word w occurs in the same sense in both sentences.
• Pattern : "s 1 " / "s 2 " Similar sense of "w"? ___ . Verbalizer: yes/no Here, we are given a sentence s that contains some nouns and pronouns. We are tasked with finding the correct noun that a specific pronoun p refers to. Within the FewGLUE dataset, we are provided with the only positive examples and hence our verbalizer contains just the correct noun phrase.

Verbalizer: correct noun
• Pattern : s In the passage above, what does the pronoun '*p*' refer to? Answer: __.

Verbalizer: correct noun
A. 1.7 MultiRC (Khashabi et al., 2018) In this task, we are given a passage pand multiple questions q. We are tasked with finding the right answer from a list of candidate answers e. Here, we pose it as a binary classification task where we predict yes if the e answers q with context p, else no.
Verbalizer: yes/no • Pattern : p. Based on the previous passage, q? Is "e" a correct answer? __.
Verbalizer: yes/no A.1.8 ReCoRD (Zhang et al., 2018) For this task, given a passage p and cloze question q, we are supposed to find the right replacement for a '@placeholder' token in the question. Since the task itself is already framed in a cloze-style format, we merely concatenate the passage with the cloze question to form the input to the language model.

A.2 Results on Individual Patterns
We train the sPET and ADAPET models using the same experimental setup mentioned in Section 4 and report results across all patterns for all datasets on the validation dataset of SuperGLUE. Note that the numbers in Table 1 contains the best numbers from this table for the dev results. Our results can be found in Table 4   We trained ADAPET for 1k batches and compared to PET/iPET which were trained for 250 batches. In this section, we compare sPET and ADAPET trained for 250 and 1k batches in Table 6. Note that training for 1k batches is not guaranteed to outperform training for 250 batches, even if we checkpoint every 250 batches, since the learning rate scheduler will have to accommodate for a different number of total batches. Overall, ADAPET gets a boost by training longer, especially on ReCoRD, while sPET peaks at 250 batches.

C.2 Multi-Task Multi-Pattern Training
We also tried training the model with multiple patterns at once, as compared to ensembling and distilling them. We formulated this as a multitask training problem, where different patterns are viewed as different tasks, and the model would sample a pattern to train from each batch. We compare sPET, ADAPET, and ADAPET without the label conditioning objective. The results are shown in Table 7. In general, multi-task multi-pattern training hurts performance for ADAPET, is mixed on sPET, and is beneficial for ADAPET with the label conditioning objective.

C.3 Replacement Token Detection (RTD)
In our formulation, the decoupled label objective can be viewed as a binary classifier that seeks to assign high probability to the correct label token, and low probability to the incorrect label token. In reality though, the model has a softmax classifier head on top that is converted into a one-vs-all classifier. Another way to achieve the same objective would be to use a binary classifier head on top. Rather than feeding in the "[MASK]" token, we would feed in either the correct label token or the incorrect label token, and the model must distinguish whether these tokens make sense in context or not. This objective would be very similar to the RTD objective for ELECTRA (Clark et al., 2020).
Inference would be slower since the number of forward passes would scale up by the number of labels. For multi token labels though, because there is not need to condition on other label tokens, the number of forward passes would scale down by the number of tokens in the labels. Table 8 shows the results of using the RTD objective with a binary classifier. Overall, the RTD objective seems to perform worse than the decoupled label objective. There are several reasons why using a RTD head might perform worse. First, the RTD head would have |V | times fewer parameters, but relative to the whole model, the change in number of parameters is not substantial. Second, the softmax classifier has been pretrained, and contains lots of information, which is now lost when we discard the softmax classifier and randomly initialize a binary classifier head from scratch.
We also experiment with using a binary classifier head initialized with ELECTRA, but the results were the same and so we omit them from the table. We note that ALBERT (xxlarge-v2) is a much better performing model than BERT, and ELEC-TRA is more comparable to BERT than ALBERT (xxlarge-v2).

C.4 Label Conditioning with Important Words Masked Out
For the label conditioning component, we randomly mask out tokens in the input text, and the model tries to predict the original token when conditioned on the correct label, and not predict the original token when conditioned on an incorrect label. This makes sense if the masked out token is an influential token that affects the label, like "Yes" in Figure 2a, but makes less sense if the masked out token is an unimportant word like "the". We experiment with only masking out important words, using TFIDF as an approximation of how important a word is. The results are shown in table 9. Overall, using TFIDF as an approximation for masking out important words hurts performance.

C.5 Ensembles
PET/iPET ensemble and distill with unlabeled data. However, it is not clear how beneficial unlabeled data is for ensembling, so we show results of ensembling models trained only on labeled data with different patterns and different seeds. For ensembling, we average the logits across the different models.   C.5.1 Across Patterns Table 10 shows our results ensembling across patterns. In general, ensembling across patterns provides mixed results for ADAPET and sPET. This corroborates the finding in Gao et al. (2021) where sometimes the best performing model performs better than ensembling across patterns. Table 11 shows our results ensembling across seeds. We fix the pattern (pattern 1) and train with different seeds. For this experiment, we ensemble across models for seeds 41, 42, 43. From our results in Table 11, we find that ensembling patterns across seeds provides mixed results. Hence, we do not apply ensembling for our final results.

C.6 Masking Ratio
We experiment with several different masking schemes, where we mask out a fixed percentage (FIXED), or up to a fixed percentage (VARIABLE) in Table 12. If x is the number of tokens masked out in FIXED masking, we mask out between 1 and x tokens for VARIABLE masking. For the ablation, we tested with multiples of 1.5 for the masking ratio (in addition to 10%), to match the 15% ratio of ALBERT pre-training. From our results in Table 12, we find that 10.5% VARIABLE mask ratio provided the best trade-off between scores for all models. Hence, we choose that for our final experiments in the main paper.
C.7 What if we had unlabeled data?
One of the key motivations of our work is to eliminate the need for unlabeled data during few-shot training on language understanding tasks. In this section, we push that limitation of prior methods