A Thorough Evaluation of Task-Specific Pretraining for Summarization

Task-agnostic pretraining objectives like masked language models or corrupted span prediction are applicable to a wide range of NLP downstream tasks (Raffel et al.,2019), but are outperformed by task-specific pretraining objectives like predicting extracted gap sentences on summarization (Zhang et al.,2020). We compare three summarization specific pretraining objectives with the task agnostic corrupted span prediction pretraining in controlled study. We also extend our study to a low resource and zero shot setup, to understand how many training examples are needed in order to ablate the task-specific pretraining without quality loss. Our results show that task-agnostic pretraining is sufficient for most cases which hopefully reduces the need for costly task-specific pretraining. We also report new state-of-the-art number for two summarization task using a T5 model with 11 billion parameters and an optimal beam search length penalty.


Introduction
Previous work mostly used task-agnostic pretraining methods like corrupted span prediction (T5; Raffel et al., 2019), masked language model (BERT; Devlin et al., 2018), denoising objective (BART; Lewis et al., 2019 or a vanilla language model (GPT; Radford et al., 2019). Intuitively it makes sense to refine the pretraining to a setup that closer resembles the downstream task.  demonstrate that task-specific priors into BERT language model pretraining improves on low-resource finetuning tasks. This is also done by  with PEGASUS, where important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, to teach summarization models to do better content selection and Narayan et al. (2021) proposed a content * Equal contribution. XSum SOTA (Narayan et al., 2021) 47.80 / 25.06 / 39.76 PEGASUS  47   44.17 / 21.47 / 41.11 T5 xxl (Raffel et al., 2019) 43  planning pretraining objective with PEGASUS, by pre-pending the output sequence with the entity plans observed in it. PEGASUS achieved state of the art ROUGE-1/-2/-L scores (Lin, 2004) on BBC XSum (Narayan et al., 2018) with 47.21 / 24.56 / 39.25 and CNN/DailyMail with 44.17 / 21.47 / 41.11. These numbers could not be matched by Raffel et al. (2019) even when using a much larger model with up to 11 billion parameters. This seems to support the intuition that task specific pretraining is important for the best performance. However, Raffel et al. (2019) used a beam search length penalty (beam alpha) of 0.6. We set the beam alpha parameter to the optimal value and report new state of the art results on XSum and SAMSum (Table 1).
Given these new results we want to answer the questions if task-specific pretraining objectives are still at an advantage. To avoid any influence of hyperparameters, pretraining datasets, tokenization or evaluation scripts we reimplement all experiments in the same framework, namely the PEGA-SUS framework. 1 To our surprise, we found that in a controlled comparison the task-agnostic pretraining methods perform as good as task specific pretraining methods for large finetuning setups. We further extend our study to a low resource and zero shot setup, to understand how many training examples are needed in order to ablate the task specific pretraining without quality loss. And finally we want to see if our findings also translate to other text generation tasks. We therefore pretrain a model with corrupted text and evaluate it on grammatical error correction.

Pretraining Models
We use a transformer architecture (Vaswani et al., 2017) with 12 hidden layers, a hidden size of 768, filter size 3072 and 12 attention heads, with a total of 223M parameters. All models are pretrained for 1.5 million steps on the C4 corpus (Raffel et al., 2019) with a batch size of 16, Adafactor (Shazeer and Stern, 2018), a learning rate of 0.01, and maximum input-output lengths of 512 and 256, in the PEGASUS framework. If not mentioned otherwise we do not perform any hyperparameter tuning but use the best performing hyperparameters founded by . We do not explore a pretraining plus prefinetuning setup in this paper (Aghajanyan et al., 2021).
We now briefly explain the task-agnostic objective of corrupted span prediction and two taskspecific objectives, salient sentence selection for summarization and text corruption for grammar error correction. Additionally, we also experimented with the objectives of masking and predicting random and lead sentences.
Corrupted Span Prediction (T5) This pretraining objective is based on a span-prediction task, an adaptation of masked-language objective for autoregressive seq2seq models. As in BERT, we mask out 15% of the input text. We allow masking on continuous spans of lengths 1, 2, 3, 4 and 5 with probabilities 0.1, 0.2, 0.4, 0.2, 0.1, respectively. An example of span prediction: Mask Salient Sentence (PEGASUS) We follow  to select and mask whole sentences from documents. The concatenated gapsentences can be seen as a pseudo-summary and will serve as targets. To more closely approximate a summary, sentences that appear to be important/principal to the document are selected. As a proxy for importance ROUGE-1 F1 score between the sentence and the rest of the document is used.

Mask Random Sentence (MRNDS)
We also pretrain a model with randomly select sentences as gap-sentences. This can be seen as a sentence level version of the masked language model (Devlin et al., 2018), a version of T5 that generates whole sentences or as a simplification of PEGASUS where the content selection aspect is missing.
Mask Lead Sentence (MLEADS) In this setup we pretrain a model with the first m sentences of a document as gap-sentences. This is motivated by the fact that for some text snippets, for example news, the most important information comes at the beginning of a paragraph. This is a natural setup for summarization since it is known that lead-sentences are a good baseline to compare summarization models against.
Text Corruption (TEXTCOR) Analogous to PEGASUS for summarization we pretrain a task specific model for grammatical error correction. To create pairs of broken and correct text snippets we corrupt each sentence using a combination of the following operations: a) drop tokens b) swap tokens c) insert tokens d) replace tokens e) drop characters f) swap characters g) insert characters h) lower-case a word i) upper-case the first character of a word. We limited our self to the fore mentioned purely unsupervised corruption techniques and do not use more sophisticated methods like replacing words with common misspellings as done by Náplava and Straka (2019).

Finetuning Experiments
All our experiments are done in the PEGASUS framework. We validate that the numbers are roughly identical with a comparable setup in T5. 2 For this, numbers in Table 1 labeled T5 base ours should match numbers in Table 2 labeled T5 Full. Both experiments correspond to the same model size conducted in different frameworks.

Datasets and Eval Metrics
We measure the performance on three commonly used summarization benchmarks, namely CNN/DailyMail (Hermann et al., 2015), BBC XSum, (Narayan et al., 2018) and SAMSum (Gliwa et al., 2019)   . We also report on zero-shot results. We report Lead-1 baseline for BBC from (Narayan et al., 2018) and Lead-3 baseline for CNN/DailyMail from (Rothe et al., 2020). For SAMSum, we achieve the best lead scores when we select top 5 sentences for each input. Result in gray are worse than the lead sentence baseline. Best results in each block are bolded. Results marked with * are not comparable, see text. For results marked with o , the untrained checkpoint at step 0 was performing best on the development set.
-2 and -L as metric. The datasets differ in the degree of abstraction and summarization length. The summaries of CNN/DailyMail are more of extractive nature and have an average length of 3 sentences. The summaries of BBC XSum are singlesentences and more abstractive. The SAMSum summaries consist of 2-3 meeting minutes. Finally, the CNN/DailyMail, BBC XSum and SAMSum datasets have 287k/13.4k/11.5k, 204k/11.3k/11.3k and 14.7k/818/819 training/development/test examples, respectively. We finetune our pretrained models on the full datasets and subsampled versions with 10, 100, 1,000 and 10,000 examples. During finetuning, we use maximum input/output lengths of 1024/128 for CNN/DailyMail, 1024/64 for XSum and 512/128 for SAMSum. All models were finetuned with a batch size of 256. The best model was selected based on the ROUGE-L performance on the full development set. During inference, all models were decoded with a beam alpha of 0.8 and a beam size of 5. Results shown in Table 2 are the average performance of 5 models trained with different samples, as low resource setups are known to have high-variance.

Results
We found that the performance of the span prediction objective is always better or on par with the performance of the salient sentence prediction objective for all three datasets when using the whole training set. Linguistically, it might be more interesting to generate full sen-tences than spans, but empirically, we found no evidence to support that the mask salient sentence pretraining is better at content selection than the corrupted span pretraining for summarization. In fact, we found that constraining pretraining to task-specific information such as the most important information at the beginning of a paragraph (MLEADS; CNN/DailyMail), makes it hard to generalize across datasets and leads to inferior performance compared to pretraining by generating random sentences (MRNDS).
For low-resource setups results varied a bit depending on the task. For abstractive datasets such as XSum and SAMSum, T5 achieved better performance than PEGASUS with as little as 10 or 100 examples. With 1000 and 10000 examples, results from both models were on par for SAMSum, but PEGASUS reported better than T5 for XSum. For CNN/DailyMail, PEGASUS continuously outperformerd T5 for all low-resource setups. On the other side CNN/DailyMail is not ideal for evaluating low-resource models due to the extractive nature of summaries; one can simply perform well by selecting the first few sentences. The Lead baseline and MLEADS are on par and outperform the other methods, while MLEADS does not use any training data when 1000 examples or less are provided.
Zero-Shot We also assess how well pretrained models perform out-of-the-box on different generation tasks (zero-shot). For this we simply infer  Figure 1: Comparison of how models adapt to target lengths from zero-shot to low-resource cases. We plot the average summary lengths for different models. We report results on XSum, similar patterns were found on CNN/DailyMail and SAMSum. on the test sets using different pretrained checkpoints without any finetuning. Results in Table 2 are not surprising that the sentence-level pretraining in MRNDS, MLEADS and PEGASUS are better than T5 in producing well-formed summaries; also it is probably not fair to evaluate T5 for zero-shot summarization as T5 models are pretrained to generate masked spans and not full sentences.
Length comparisons It has been argued that ROUGE tends to prefer longer summaries, so we wanted to investigate if (a) model leverages this phenomenon and (b) if it is unfair to compare different pretraining methods trained on self-supervised targets with different length distributions. As depicted in Figure 1, for the zero-shot case we observe very different average lengths in predicted summaries for different models, with PEGASUS being closest to the target lengths. However, by 1000 training examples, all models start generating summaries of comparable lengths.

Grammatical Error Correction
We further investigate if our findings translate to other generation tasks. Here, we focus on the task of grammatical error correction, but also other important aspects of text generation show benefit from task specific pretraining and are still underexplored; e.g., improving evaluations (Sellam et al., 2020), factuality (Chen et al., 2020) or planning for grounded generation (Narayan et al., 2021).

Datasets and Eval Metrics
For Grammatical Error Correction we fine-tune our pre-trained models on the FCE (Yannakoudakis et al., 2011) and W&I (Bryant et al., 2019) (Dahlmeier and Ng, 2012) computed by the M 2 scorer. 3 Results As shown in Table 3 TEXTCOR outperforms T5 on all dataset sizes. The results also show that an unrelated task-specific pretraining objective hurts performance even when training on the full dataset. This is notable as for example the MRNDS pretraining is not that far of from a normal language model pretraining and should learn a reasonable amount about language and well formed sentences.
Zero Shot In contrast to summarization, no easy baseline exists for grammatical error correction. A simple copy baseline would give us a high word overlap like BLEU or ROUGE, but on our main metric F 0.5 this only gets a score of 4.24. Our pretrained TEXTCOR model achieves an F 0.5 score of 18.64, precision 40.94 and recall 5.87. The T5 model needs only 10 training examples to achieve the same F 0.5 score (Table 3). We hypothesize that zero shot performance of the TEXTCOR pretraining could be greatly improved by tuning the hyperparameters of the text corruption to better match distribution the CoNLL dev and test sets. However, this would limit the scope the pretrained model even further as this distribution would not translate to other datasets or related tasks, like correcting OCR (optical character recognition) or ASR (automatic speech recognition) errors.

Conclusion
We evaluated several pretraining techniques on two different text generation tasks, summarization and grammatical error correction. Our findings are that, while pretraining for summarization is very important, we found no evidence that task specific pretraining improved on common benchmarks for abstractive datasets, even in a low resource setting. On extractive datasets, task specific pretraining showed benefits but the results are below a sentence selection baseline, questioning the practical usefulness. Given the trend to larger neural network models with significant costs to train them, we recommend to use a task agnostic pretraining regime. Corrupted span prediction is currently our most successful candidate, with state-of-the-art results on two investigated summarization benchmarks. But we are curious if even more flexible pretraining technique will emerge. For grammar error correction, task specific pretraining was showing superior performance, especially in a low resource setting. We therefore believe that, task-specific pretraining or prefinetuning can still be useful for important aspects of text generation.