Few-Shot Text Generation with Natural Language Instructions

Providing pretrained language models with simple task descriptions in natural language enables them to solve some tasks in a fully unsupervised fashion. Moreover, when combined with regular learning from examples, this idea yields impressive few-shot results for a wide range of text classification tasks. It is also a promising direction to improve data efficiency in generative settings, but there are several challenges to using a combination of task descriptions and example-based learning for text generation. In particular, it is crucial to find task descriptions that are easy to understand for the pretrained model and to ensure that it actually makes good use of them; furthermore, effective measures against overfitting have to be implemented. In this paper, we show how these challenges can be tackled: We introduce GenPET, a method for text generation that is based on pattern-exploiting training, a recent approach for combining textual instructions with supervised learning that only works for classification tasks. On several summarization and headline generation datasets, GenPET gives consistent improvements over strong baselines in few-shot settings.


Introduction
Pretraining large neural networks with a language modeling objective has led to significant improvements throughout NLP (Peters et al., 2018;Howard and Ruder, 2018;Radford et al., 2018;Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020, i.a.). Further improvements are often possible by choosing a different pretraining objective that more closely matches the downstream task of interest. Examples include casing prediction for named entity recognition (Mayhew et al., 2020), gap sentence generation for summarization (Zhang et   Banking accounts are now setup again for accessing. The login id is still your main account with the password being reset to the last six (6) digits of your SSN. Without any instructions, the model simply generates a continuation of the given input (top). Providing an instruction makes it generate an appropriate summary (center) or e-mail title (bottom) even in zero-shot settings and enables much more data-efficient learning. 2020), and sentence unshuffling for discourse representations .
While such approaches can significantly reduce the amount of training data required, they typically still do not perform well if only a handful of examples is available for the downstream task, which is a common scenario for many real-word uses of NLP. In such few-shot settings, however, significant gains are possible by reversing what is adapted to what: Instead of making pretraining more similar to a downstream task, we can reformulate the downstream task to make it more similar to the pretraining objective. For masked language models (e.g., Devlin et al., 2019;Lewis et al., 2020), one such reformulation technique is to convert inputs to cloze questions by adding a text snippet that contains some form of task description, often in the form of a short prompt (Radford et al., 2019;Schick and Schütze, 2021a). Besides making pretraining and finetuning more similar, this approach has the compelling benefit of enabling users to explain a task to a pretrained model, making it much easier for the model to understand the task. This is illustrated in Figure 1, where a pretrained language model is given the same input with different instructions and adapts its output accordingly. The idea of providing task descriptions even works in an unsupervised setting (Radford et al., 2019) or when examples are simply provided as additional context (Brown et al., 2020); however, it only unfolds its full potential when combined with gradient-based training on a handful of labeled examples (Schick and Schütze, 2021b). Unfortunately, current approaches for doing so are limited to text classification tasks (Schick and Schütze, 2021a). Inspired by their success, we investigate whether the underlying idea can also be transferred to more challenging text-to-text tasks that require the generation of text sequences given an input text, such as abstractive summarization. We introduce GENPET, a novel method based on PET (Schick and Schütze, 2021a), that enables finetuning of generative language models using both instructions and labeled examples. We show that GENPET is a highly data-efficient method that enables us to finetune a pretrained PEGASUS model  with as little as 10 or 100 training examples. We evaluate our approach on a diverse set of six English headline generation and text summarization tasks both in zero-shot and few-shot settings and show that PEGASUS trained with GENPET clearly outperforms regular finetuning.
In summary, our contributions are as follows: • We introduce GENPET, a finetuning procedure for generative language models that achieves great data efficiency by using both textual instructions and training examples.
• We show that training PEGASUS with GEN-PET outperforms standard finetuning across a broad set of tasks and training set sizes.
• We analyze the factors contributing to GEN-PET's strong performance and quantify the impact of all its components.

Related Work
Masked language modeling was proposed as a pretraining objective by Devlin et al. (2019). Several variants of this objective that involve generating sequences of text have been proposed, including T5 (Raffel et al., 2020), BART (Lewis et al., 2020) and PEGASUS , of which we make use in this work. The idea to rephrase tasks as cloze questions is commonly used to probe the knowledge contained within masked language models (e.g., Petroni et al., 2019;Wang et al., 2019;Talmor et al., 2020;Ettinger, 2020;Kassner and Schütze, 2020;Sakaguchi et al., 2020). Schick and Schütze (2021a) propose PET, which combines this idea with gradient-based learning for efficient few-shot text classification. Jiang et al. (2020) and  consider the problem of finding the best way to rephrase a given task as a cloze question. Schick and Schütze (2021b)'s version of PET can generate multiple tokens, but still requires a text classification objective and does not scale to long output sequences. Radford et al. (2019) consider task descriptions for text generation tasks, but do so only in a zero-shot setting. In a similar spirit, Brown et al. (2020) investigate the ability of pretrained language models to leverage task descriptions and examples without any gradientbased optimization.
Other approaches to few-shot learning in NLP commonly require large sets of examples from related tasks (Gu et al., 2018;Dou et al., 2019;Qian and Yu, 2019;Ye et al., 2020), parallel data for consistency training (Xie et al., 2020;, or highly specialized methods tailored towards a specific task (Laban et al., 2020). In contrast, GENPET requires no additional labeled data and provides an intuitive interface to leveraging task-specific human knowledge.
Our work is also related to prefix-constrained decoding in interactive machine translation for making suggestions on how to complete a partial translation (Knowles and Koehn, 2016;Wuebker et al., 2016). Keskar et al. (2019) and He et al. (2020) similarly use prompts and keywords for controllable text generation, but require specific pretraining procedures and do so only in high-resource settings.

PEGASUS Pretraining
We briefly summarize the pretraining procedure of PEGASUS , the model to which we apply GENPET. PEGASUS is a standard Transformer encoder-decoder architecture (Vaswani et al., 2017) that is pretrained using gapsentence generation, an objective tailored to text summarization tasks. This pretraining objective requires a set of documents consisting of multi- The input x is converted into a cloze question P (x). The probability p(y | x) of each label y is derived from the probability that a pretrained model M assigns to its verbalization v(y) at the masked position. Figure  ple sentences. The key idea is to preprocess each document by (i) picking a subset of m informative sentences, 2 (ii) replacing each of these sentences by a mask token, and (iii) concatenating all removed sentences into a pseudo-summary. The Transformer model is then trained to generate this pseudo-summary given the partially masked document. Similar to prior work (e.g., Raffel et al., 2020;Lewis et al., 2020), this is done by having the encoder process the entire masked document and the decoder generate the output autoregressively.  train two variants of PE-GASUS: PEGASUS-base, a 12-layer model with approximately 223M parameters, and PEGASUSlarge, a 16-layer model with 568M parameters. As only the latter version is publicly available in a variant that is not finetuned on any downstream task, all our experiments are based on PEGASUS-large.

Pattern-Exploiting Training
Pattern-Exploiting Training (PET, Schick and Schütze (2021a)) is a finetuning method for text classification tasks. That is, PET can be applied to problems where a text sequence x ∈ X must be mapped to a label y from a finite set Y. As shown in Figure 2, PET enables data-efficient text classification by converting inputs into cloze questions; this drastically reduces the number of examples required (Schick and Schütze, 2021a,b).
Let M be a masked language model, V its vocabulary of tokens and __ ∈ V the mask token; we denote the set of all token sequences as V * . Given an input sequence z ∈ V * that contains exactly one mask token, let p M (t | z) denote the probability assigned to t ∈ V by M at the masked position in z. As illustrated in Figure 2, PET requires: • a pattern P : X → V * that maps each input 2 The most informative sentences are selected where informativeness is measured as the Rouge1 F1 score (Lin, 2004) between the sentence and the remaining document.
x to a cloze question containing exactly one mask token; • a verbalizer v : Y → V that maps each label y to a single token representing its meaning in the pattern.
The probability of y given x is then derived from the probability that M assigns to v(y) at the masked position in P (x): For finetuning, the cross-entropy between p(y | x) and the true label of x is used as training objective.

Generation with Instructions
We now introduce GENPET, our method for finetuning language models with instructions for text generation. Similar to PET, we provide instructions by means of patterns P : X → V * that we use to modify the original input. However, we do not require a verbalizer as our output space already consists of natural language sentences, i.e., Y ⊆ V * . In designing GENPET, we tackle three key challenges for few-shot text generation with instructions: 1. How should we provide an instruction to an encoder-decoder model so that the model can make the best possible use of it? ( §5. Notation Let P be a pattern, x ∈ X and y ∈ Y input and output text sequences, and z = P (x) the result of applying P to x, i.e., a text sequence containing a single mask token. Furthermore, let y = y 1 . . . y n , z = z 1 . . . z m and let the mask token in z be at some position h ≤ m. We denote the subsequence y i . . . y j by y i:j . We consider an encoder-decoder model M pretrained by masked language modeling. That is, the model must be able to compute a probability p M (y | z) that measures to what extent y is a plausible substitute for the mask in z. We further require that this is done by decomposing the joint probability of y as follows: 3 where p M (y i | z; y 1:i−1 ) is obtained by processing z using the encoder and y 1:i−1 using the decoder.
If we happen to already know some prefix y 1:k−1 of y, we denote with (3) the probability that M assigns to the remaining sequence y k:n if the prefix y 1:k−1 was already processed with the decoder.

Using a Single Instruction
As M is an encoder-decoder language model, we have several options for how to apply a pattern P , i.e., how to ingest an instruction when computing the probability of y given x: We may process the entire sequence P (x) = z with the encoder, but we may also choose some index j < h and process z 1:j−1 z h:n using the encoder and z j:h−1 using the decoder. For example, if z = Summary: __ Text: x , we can process the prefix "Summary:" using the encoder or the decoder; that is, we may compute either of the following (cf. Figure 3): In preliminary experiments, we found tokens that belong to the partially generated output sequence (i.e., tokens that are processed using the decoder)  Figure 3: Generation process of an output y = y 0 ...y n for input x when the instruction is entirely processed using the encoder (top) and when parts of it are processed using the decoder (bottom). We use ⟨s⟩ to denote the model's start-of-sequence token. The seemingly subtle difference between the two setups can lead to quite different generations: Instructions processed by the decoder have a stronger impact on the model's predictions than those processed by the encoder.
to have a much stronger impact on the model's predictions than regular input tokens (i.e., those processed by the encoder). This applies all the more to PEGASUS, which is pretrained to always generate full sentences: If the pattern used consists of a partial sentence (e.g., a short prompt) which is to be completed by the model, PEGASUS tends to instead simply start a new sentence that does not relate to the given prefix if the latter is processed with the encoder. Based on this observation, we supplement each pattern P with a decoder prefix d ∈ V * that is given to the model as part of the generated sequence rather than the observed input. Accordingly, we define the probability of y given x as In Eqs. 4 and 5, probability p 1 corresponds to using pattern P 1 (x) = Summary: __ Text: x with an empty decoder prefix d 1 , whereas p 2 corresponds to using the pattern P 2 (x) = __ Text: x with a decoder prefix d 2 = Summary: . Both variants are illustrated in Figure 3. We finetune M on a set of training examples (x, y) simply by minimizing the cross-entropy between p (P,d) (y | x) and y using teacher forcing.

Combining Instructions
As shown in previous work (Jiang et al., 2020;Schick and Schütze, 2021a), using different instructions or formulating the same input in different ways can have a strong impact on the model's performance. Unfortunately, in the absence of a large development set, instructions that work well are often hard to distinguish from those that perform poorly. We alleviate this issue by enabling the simultaneous usage of multiple instructions (represented by multiple pairs of patterns and decoder prefixes) and combining them using a mechanism similar to knowledge distillation (Hinton et al., 2015). This mechanism mitigates the negative influence of instructions that are hard to understand for the model. This means that users can simply provide all (variants of) instructions that they can think of. Further, it is much faster and more memory efficient than having to constantly use multiple instructions (and thus, multiple models) during inference. PET (Schick and Schütze, 2021a) also uses a multi-pattern approach -which is based on averaging the predictions obtained with different patterns -, but it is not applicable in text generation settings as we cannot compute the average of multiple generated sequences in a meaningful way.
Given pairs of patterns and corresponding decoder prefixes (P 1 , d 1 ), . . . , (P k , d k ) and a set of models M 1 , . . . , M k , where each M i was finetuned using (P i , d i ), we aim to obtain a single modelM that contains the combined knowledge of all models. To do so, we require a small set of unlabeled examples U. For each x ∈ U, we first generate one output sequence y (P i ,d i ) per (P i , d i ) using greedy decoding as in , resulting in a set of candidate outputs C To assign a score to each candidate y ∈ C x , we first compute the log-likelihood of y for each ( The total score of y is then simply the exponentiated average over the patterns: The modelM is trained on pairs (x, y) where x ∈ U and y is drawn from C x with probability proportional to s(y | x). While we could train this final model to simply maximize pM (y | x), we note that this creates a large discrepancy between pretraining and finetuning: During pretraining, masked language models only process sequences that contain at least one mask token. In the spirit of our intention to make pretraining and finetuning as similar as possible ( §1), we therefore trainM using a trivial pattern P (x) = __ x that just prepends a single mask token to the input and use an empty decoder prefix; that is, we maximize pM (y | __ x ; ) instead of pM (y | x). In addition to reducing the pretrainingfinetuning discrepancy, putting the mask token before the input biases the model towards generating text that is likely to precede the input. This is desirable because news articles -which abound in big language models' pretraining data -often have a headline and a short summary before the article rather than after it.

Preventing Overfitting
In preliminary experiments, we found pretrained encoder-decoder models to strongly overfit the training data when trained on just a handful of examples: When generating new texts, they often simply reproduce phrases from training examples, even if they are not in any way related to the current input. To alleviate this issue, we introduce two modifications to our training procedure; we refer to them as unsupervised scoring and joint training.
Unsupervised Scoring For unsupervised scoring, we compute s(y | x) as in Eq. 8, but we use an untrained model (i.e., one that has not been finetuned on task-specific examples) to compute p (P i ,d i ) (y | x) in Eq. 7 for all i ∈ {1, . . . , k}.
The intuition behind this is as follows: If for a given input, a trained model simply reproduces phrases from its training set, the resulting pair of input and output texts should look strange to an untrained model, which has not seen the example from which the output is (partially) copied. Thus, sampling outputs from the candidate set C x based on the probability assigned to each example by an untrained model helps prevent overfitting: It results in the final model being primarily trained on examples that also look natural to a model that has not seen the training data.
We further use this idea to discard generated texts of really poor quality altogether. To this end, we sort the set C = x∈U C x of all outputs for all candidate sets based on their likelihood according to the untrained model in ascending order. Let the rank r y of each output y ∈ C be its position in this sorted list, divided by the list's size. We then remove all outputs with r y < τ from the candidate sets C x , where the threshold τ is a hyperparameter.
Joint Training In §5.2, we assume the existence of an ensemble {M 1 , . . . , M k } where each model was trained using a different instruction. However, instead of training an individual model M i for each pair (P i , d i ), we can also train a single model jointly on all instructions. To do so, we simply replicate each training instance k times and process the ith copy with (P i , d i ). Our motivation is that forcing a single model to work well for all instructions can act as a regularizer to prevent overfitting. This approach comes with the additional benefits of both being faster to train and generating less overhead. Note that we still require instruction combination ( §5.2) because even given a single model understanding all instructions, it would be unclear which instruction to choose during test time, and querying the model with all instructions would be inefficient.

Experiments
Tasks We evaluate PEGASUS with and without GENPET on a subset of the tasks in . As our computing resources are limited, we only choose those tasks for which the maximum output length in  is at most 128 tokens. We include the following tasks: • AESLC (Zhang and Tetreault, 2019): Given an email body, predict the title of the email.
• Gigaword (Rush et al., 2015): Given the first sentence of a news article, generate its headline.
• XSum ( For each task, we use the entire test set for evaluation. 4 We create two types of training sets containing either 10 or 100 training examples; in addition, we provide 1,000 unlabeled examples per 4 The only exception to this is NEWSROOM, which contains more than 100,000 examples: We only consider a subset of 10,000 examples to ensure a resource-friendly evaluation.  large impact on model performance, we create three distinct training sets per size (10 and 100) and task using different random seeds, resulting in a total of six training sets per task. Scores reported in this section are always average scores across all three equal-sized sets of training examples, except for zero-shot settings where no training data is available at all.
Instructions We use the same set of patterns across all tasks, but we combine them with different decoder prefixes. The patterns we use are: All decoder prefixes are shown in Table 1. We combine each pattern with each decoder prefix, resulting in four pairs per task: (P 1 , d 1 ), (P 1 , d 2 ), (P 2 , d 1 ), (P 2 , d 2 ).
Setup For all our experiments with GENPET, we use PEGASUS-large  as underlying language model and perform greedy decoding; our implementation is based on the Transformers library (Wolf et al., 2020) and PyTorch (Paszke et al., 2017). Unless stated differently, all experiments are performed using the same setup as Schick and Schütze (2021a) and a single GPU with 11GB RAM (NVIDIA GeForce GTX 1080 Ti). For optimizing hyperparameters, much previous few-shot work uses development sets that are larger than the training sets by multiple orders of magnitude (e.g., Xie et al., 2020;; however, assuming the existence of such large development sets is inconsistent with real-world few-shot settings. In contrast, Schick       and Schütze (2021a) assume no development data at all and determine hyperparameters based only on previous work and practical considerations. We choose a middle course and create a small development set of 100 examples for only one of the six tasks, XSum. We use this development set in combination with a single training set of 10 examples to determine hyperparameters for all tasks and training sets. However, we do so only for hyperparameters for which no consistent value can be derived from previous work.
Following , we use a maximum input length of 512 tokens, the Adafactor optimizer (Shazeer and Stern, 2018) with square root learning rate decay, a dropout rate of 0.1 and label smoothing setting ε = 0.1 (Szegedy et al., 2016); we also adopt Zhang et al. (2020)'s maximum output lengths for each task. As recommended by Schick and Schütze (2021a), we train all models for 250 steps using a batch size of 8. We also tried training for 500 and 1,000 steps on our development set but found no major differences in performance. For the learning rate, we tried values of α · 10 −5 with α ∈ {1, 10, 50} as Schick and Schütze (2021a) use α = 1 and  use α = 50; we found α = 10 to perform best for all models. For unsupervised scoring ( §5.3), we use a threshold of τ = 0.2, i.e., we discard the 20% of examples that are least likely according to an untrained model. We chose this value by looking at texts generated by PEGASUS trained on 10 examples from the XSum development set, where we found the bottom 20% to contain texts of poor quality, including random telephone numbers and repetitions of the same word. For evaluation, we follow  and report Rouge1, Rouge2 and RougeL (R1/R2/RL) F1 scores (Lin, 2004) after stemming using the Porter algorithm (Porter, 1997).
Results On all six tasks, we compare the following three approaches for finetuning a pretrained PEGASUS model: • PEGASUS: The regular finetuning procedure described in .
• PEGASUS-M: Finetuning with a single trivial pattern that inserts a mask token before the first word.
• GENPET: Finetuning with GENPET using patterns P 1 and P 2 and the decoder prefixes in Table 1 as described above; we apply all modifications described in §5.3.
We do not compare to other few-shot approaches as they either make quite different assumptionsfor example, GENPET requires manually designed patterns and some amount of unlabeled examples, whereas meta learning approaches (e.g., Gu et al., 2018;Dou et al., 2019;Qian and Yu, 2019) require large annotated datasets for related tasks -, or they cannot be transferred to a generative setting in a straightforward fashion, as is the case for consistency-based methods such as those of Xie et al. (2020) and . However, we note that PEGASUS is a strong baseline in terms of data efficiency, almost matching the performance of prior state-of-the-art systems trained on the full datasets with as little as 100 examples for many tasks . Table 2 shows results for zero-shot learning and for few-shot learning with 10 and 100 training examples. In the few-shot settings, GENPET consistently outperforms PEGASUS across all tasks, resulting in an average improvement in R1 over PEGASUS of 7.20 (31.63 vs 24.43) and 2.58 (34.45 vs 31.87). PEGASUS-M performs better than regular finetuning, indicating that even just adding a single mask token at the very beginning, without any instructions, already effectively improves performance. (Recall that the effect of the initial mask is to make finetuning more similar to pretraining and to bias the models towards generating text that is likely to appear before the input; see §5.2). However, it still performs clearly worse than GENPET, demonstrating that PEGASUS is indeed able to make use of the instructions provided. In the zero-shot setting, GENPET also outperforms all baselines on average, but falls short on individual tasks.
Quantitative Analysis To analyze the factors contributing to GENPET's performance, Table 3 compares the performance of the best ("best only") and the worst ("worst only") performing pairs of pattern and decoder prefix to that of GENPET in a setting with 10 training examples. We see some difference in performance between using only the best and worst pairs, but this difference is not as pronounced as in previous work (Schick and Schütze, 2021b,a) -possibly because our instructions are more similar to each other than patterns in prior work. Notably, our strategy for combining instructions clearly performs better than using just the best instruction across all tasks and measures (compare GENPET with "best only"). Table 3 also shows results for using the best pattern without a decoder prefix ("no dec. prefix") and instead processing the entire input using the encoder. That is, given (P, d) with P (x) = z 1 . . . z n and z h = __, we compute p M (y | z 1 . . . z h−1 dz h . . . z n ) rather than p M (y | z 1 . . . z n ; d) similar to the example shown in Figure 3 (top). While this variant still performs better than PEGASUS-M on two out of three datasets, results clearly show that PEGASUS makes less use of task descriptions if they are processed using the encoder. The bottom two rows of Table 3 show performance when we replace unsupervised scoring ( §5.3) with regular scoring using the supervised models ("sup. scoring") and if we additionally do not perform joint training ("no joint train."). As can be seen, not using joint training hurts performance for all three tasks and supervised scoring hurts performance for two out of three tasks. Table 4 shows zero-shot abilities of three methods for one selected input from Gigaword that illustrates some typical behaviors: Regular PEGASUS just creates a verbatim  24.80/12.48/24.19 34.15/12.05/26.78 33.94/21.34/30.03 no dec. prefix 15.49/ 7.24/15.09 34.12/11.95/26.41 32.56/20.15/28.64 sup. scoring 25.33/13.41/24.87 35.68/13.19/28.06 34.37/22.04/30.53 no joint train. 24.37/12.67/24.00 35.41/13.15/27.95 34.04/21.95/30.35 Table 4: Zero-shot summaries for the news item given as "Input". PEGASUS (PG) simply creates a verbatim copy of the second part of the input. PEGASUS-M (PG-M) hallucinates ("Monday" vs. "Friday"). GENPET's summary is close in quality to gold.

Qualitative Analysis
copy of the input's second half -this is true not only for this particular example, but can be seen frequently for all datasets. We assume this is due to the fact that  introduce some modifications to their training procedure that encourage the model to copy text. PEGASUS-M is able to produce an output that is not just a wordfor-word copy of the input, but hallucinates information that is not backed by the input text ("monday"). We found that hallucination is a frequent problem for PEGASUS-M. This is hardly surprising given that the model has no way of knowing that it is expected to generate a factual headline summarizing the input. In contrast, GENPET generates a fluent and factual headline that covers all relevant aspects.

Conclusion
We investigated the ability of pretrained language models to make use of simple instructions with the aim of enabling more data-efficient text generation. We identified three major challenges: enabling language models to make good use of the instructions provided, ensuring that the instructions are useful and preventing overfitting. We tackle these in our proposed approach, GENPET

A Analysis
Sequence Length We look at the performance of GENPET as a function of the maximum output length ℓ. One might be concerned that the influence of the decoder prefix on generated tokens may decrease with distance. This would mean that diminishing gains are to be expected from GENPET for tasks that require longer text sequences to be generated. To investigate whether this is a problem for GENPET, Table 5 shows the performance of PEGASUS and GENPET for all tasks with an original maximum output length of 128 tokens, using maximum output lengths of ℓ = 32 and 128.
For both values of ℓ, we compute the gains g ℓ from using GENPET as the difference in performance between GENPET and PEGASUS. On average, increasing ℓ to 128 tokens reduces the gains from GENPET over regular finetuning by just g 32 − g 128 = 0.10 points R1. This shows that instructions provided using GENPET have a strong impact on generated tokens even if there are dozens of other tokens in between. Thus, GENPET works not only for short sequences, but is also beneficial for generating long text sequences.
Unsupervised Scoring We motivated the use of unsupervised scoring in Section 5.2 by the observation that PEGASUS tends to overfit the training data. This can for example be seen when training PEGASUS with individual instructions on the 10 examples from the XSum dataset used to optimize hyperparameters. One of these examples has the gold-standard summary "Hugo Chavez [. . . ] is one of the most visible, vocal and controversial leaders in Latin America"; as shown in Table 6, this induces PEGASUS to generate the phrase "the most visible, vocal and controversial" for many other inputs, even in cases where this phrase does not make any sense given the input text. Out of the summaries generated for 1,000 unlabeled examples, we found 92 to contain this particular phrase word-for-word. Table 6 also shows the rank of each output as defined in Section 5.3 (i.e., its relative position in a list of all generated outputs that is sorted by likelihood in ascending order) both when likelihood is assigned using the trained models (r sup ) and when it is assigned using a fully unsupervised PEGASUS model (r unsup ). As can be seen, an untrained model indeed assigns much less likelihood to those examples, thus downweighting their influence on the  Variance To quantify the significance of performance improvements with GENPET over our two baselines, PEGASUS and PEGASUS-M, Table 7 shows the standard deviation of Rouge1/Rouge2/RougeL scores across the three different training sets for all tasks considered.