UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot Summarization

The high annotation costs and diverse demands of various summarization tasks motivate the development of few-shot summarization.However, despite the emergence of many summarization tasks and datasets, the current training paradigm for few-shot summarization systems ignores potentially shareable knowledge in heterogeneous datasets.To this end, we propose UniSumm, a unified few-shot summarization model pre-trained with multiple summarization tasks and can be prefix-tuned to excel at any few-shot summarization task.Meanwhile, to better evaluate few-shot summarizers, under the principles of diversity and robustness, we assemble and release a new benchmark SummZoo. It consists of 8 summarization tasks with multiple sets of few-shot samples for each task, covering diverse domains.Experimental results and analysis show that UniSumm outperforms strong baselines by a large margin across all sub-tasks in SummZoo under both automatic and human evaluations and achieves comparable results in human evaluation compared with a GPT-3.5 model.


Introduction
There has been a recent surge of interest in summarizers based on large pre-trained language models (PLMs) (Liu and Lapata, 2019; Yang et al., 2020;Zhong et al., 2020;Yu et al., 2022;, where various summarization tasks (The term task later in this paper refers to a specific summarization task, e.g., query-focused meeting summarization, which is usually associated with a corresponding dataset, e.g., QMSum, unless otherwise specified.) have been proposed to meet different practical demands, such as comprehending different inputs (e.g., news (Fabbri et al., 2019) and dialogue (Zhong et al., 2022a)) and generating different outputs (e.g., headlines (Zhang and * Yulong Chen completed this work during his internship at Microsoft. † Yang Liu is the corresponding author.
How to summarize their talk?

Attention is All You Need
Paper Summ Figure 1: The few-shot summarization scenario in this paper. We are interested in how to re-use previous datasets (e.g., CNNDM) to improve the few-shot performance on unseen target tasks (e.g., DIALOGSUM).
Recently, prefix-tuning (Li and Liang, 2021) has established strong baselines on many few-shot natural language generation tasks, including summarization. The main idea is to extract knowledge from PLMs by prepending and tuning additional parameters (prefixes) before each layer of the PLM. Work has been done to improve the performance by designing more sophisticated prefixes (Ghazvininejad et al., 2022;Liu et al., 2022b). Despite being effective, PLMs can have limited summarization knowledge due to the salient gap between pre-training objectives (e.g., language modeling) and summarization objectives (Aribandi et al., 2022). In addition, existing summarization datasets can provide relevant knowledge to newly-proposed summarization tasks, and therefore benefit sum-marization tasks, especially under the few-shot scenario. However, existing work tends to tune PLMs directly on a new task, without exploiting cross-task knowledge from summarization datasets, which may limit the generalization and adaptation abilities of models (Zhong et al., 2019;Chen and Yang, 2021;Fang et al., 2022).
We address these issues by proposing a unified few-shot summarization framework, UNISUMM. The idea is to combine multi-task pre-training (Chen and Shuai, 2021) on existing summarization datasets with few-shot prefixtuning (Li and Liang, 2021) on target tasks. To this end, we first build a multi-task model based on a Transformer-based language model as the backbone and equip it with task-specific prefix vectors, and then pre-train the multi-task model on diverse summarization datasets. In this stage, we optimize the summarization model together with task-specific prefixes and also a universal prefix, using an asymmetrical weight decay strategy. Using prefixes in the multi-task pre-training stage leads to two advantages: First, the mixture of shared summarization parameters and unique task-specific parameters helps to leverage natural benefits across datasets (Ruder, 2017). Second, the pre-trained prefixes can be tuned to serve as a knob for the second stage of prefix-tuning on unseen tasks. When facing an unseen few-shot summarization task, we freeze the multi-task learned backbone model and use the universal prefix as initialization for prefix-tuning.
A data obstacle for few-shot summarization research is the lack of a benchmark for fair comparison. Previous studies either focus on one type of data, e.g., news text (Liu et al., 2022b), or train their systems on non-public few-shot samples. However, because few-shot models can be highly sensitive to training data, the selection of different few-shot samples in different papers can lead to ambiguous comparisons (a.k.a. Sample Selection Bias (Cortes et al., 2008)). To address these issues, we assemble and release a new few-shot summarization benchmark, SUMMZOO, following two principles, namely diversity of tasks and robustness of evaluation. SUMMZOO collects summarization data from 8 existing datasets, which are diverse in terms of domain (news, academic papers, meetings, etc.), format (single-document and multi-document), and length on both source and target sides. For more robust evaluation, for each task, SUMMZOO provides 5 different (randomly sampled) few-shot training sets, and requires all systems to report their averaged results. Finally, SUMMZOO includes 10-shot and 100-shot settings.
We compare UNISUMM against several strong baselines, including a GPT-3.5 model (text-davinci-002) (Brown et al., 2020;Ouyang et al., 2022), on SUMMZOO and conduct thorough analysis. Experimental results of automatic evaluation metrics show that UNISUMM outperforms baselines across all sub-stasks and human evaluation shows that UNISUMM achieves better performance than baselines of similar sizes and comparable performance compared with text-davinci-002. Additionally, UNISUMM is empirically found to be more stable and robust when facing different few-shot samples. Analysis shows that combining multi-task pre-training and few-shot prefix-tuning is essential to the performance of UNISUMM and other techniques, such as universal prefix and asymmetrical weight decay strategy, can all improve its generalization ability. We release our code, model and benchmark at https://github.com/microsoft/UniSumm.

Related Work
Few-shot Summarization A critical challenge for neural summarizers is that they are data-hungry and require large-scale annotated data. To alleviate the data sparsity issue, Fabbri et al. (2021) extract characteristics of the target dataset and build pseudo summaries from the Wikipedia corpus. Small plug-in networks (Bražinskas et al., 2020) are injected into PLMs to predict the properties of the target dataset with only a small amount of labeled instances. To close the gap between pretraining and fine-tuning,  propose a second stage of pre-training before fine-tuning with large-scale generative models. Such challenges of summarization have also been explored in the crosslingual setting (Bai et al., 2021;Chen et al., 2022b). Although transfer learning methods make use of external data, one still needs to carefully select source domains and tasks to avoid negative transfer (Gururangan et al., 2020;Pilault et al., 2020). Compared with them, UNISUMM can be easily prefix-tuned to any target tasks without the effort of building large pseudo data or selecting relevant data. To our knowledge, we are the first to combine prefix-tuning and multi-task learning for few-shot summarization, showing very positive results. Existing few-shot summarization evaluation suffers from two data-related problems. First, previous studies usually focus on only one type of summarization tasks in their experiments (Bražinskas et al., 2020;Liu et al., 2022b). Thus, it is difficult to evaluate their generalization ability. Second, the few-shot settings and selections of few-shot samples are miscellaneous, which makes evaluations from different research papers not comparable with each other (Cortes et al., 2008). Therefore, in this work, we propose SUMMZOO for better benchmarking future research on few-shot summarization. To our knowledge, SUMMZOO is the first public few-shot summarization benchmark that covers a set of diverse summarization tasks.

CNNDM
Prompt Learning for Text Generation The idea of prompt learning is first proposed in GPT-3 (Brown et al., 2020), where it aims to guide PLMs to do different tasks without further fine-tuning by prepending task-related examples to the input and has shown positive results on many text generation tasks, including summarization (Goyal et al., 2022). Prefix-tuning extends this idea from discrete tokens to continuous vectors (Li and Liang, 2021). It adds continuous embeddings (prefixes) to each Transformer layer as external value and key vectors. During training, only prefixes are updated while the other parameters are unchanged. Logan IV et al. (2022) and Gu et al. (2022) propose to use pre-training to boost the low performance for few-shot learning. Li et al. (2022) combines the transfer learning and prompt learning for text generation. Compared with them, we are interested in few-shot summarization and propose multi-task pre-training as an effective strategy to make use of data from related tasks to improve performance of diverse target tasks, which suits real-life scenarios.

Method
Following Chen and Shuai (2021), the task of fewshot text summarization is defined as follows. For an unseen target summarization task u, few-shot text summarization is to generate a summary Y , given an input text X, by learning from a limited number k (k ≤ 100 typically) of labeled training instances of u, with the help of general knowledge K.
The overall framework of UNISUMM is shown in Figure 2. It consists of 2 phases: 1) Learning general knowledge by multi-task pre-training on existing summarization datasets ( § 3.1) and; 2) Learning target task knowledge by prefix-tuning on each target few-shot summarization dataset ( § 3.2).

Multi-Task Pre-Training with Prefix
As shown in Figure 2 (a), in the first stage, we take a Transformer-based pre-trained language encoderdecoder model (for example, BART (Lewis et al., 2020)) M = [M en ; M de ] as the summarization model, parameterized by θ. We further pre-train this model on a set of popular summarization datasets (e.g., CNNDM, PubMed and XWikis) to learn general summarization knowledge. For each task t, we inject task-specific prefix vectors of encoder (P t en ) and decoder (P t de ), P t = [P t en ; P t de ], into the model, parameterized by θ p t . Following (Li and Liang, 2021), the prefix vectors are prepended to each Transformer layer of M as additional key and value vectors as: For all pre-training tasks, given input text X, the multi-task optimization objective is to minimize the negative log-likelihood of generating the target summary Y = {y 1 , y 2 , ...y |Y | }: In the multi-task pre-training stage, we optimize θ and θ p t together.

Prefix-Tuning
Through multi-task pre-training, we obtain the UNISUMM model with diverse summarization knowledge. As shown in Figure 2 (b), for an unseen summarization task u (for example, Wikihow or MultiNews), given only k training samples, we conduct prefix-tuning (Li and Liang, 2021) on the UNISUMM model. A new-task prefix P u = [P u en ; P u de ] is created, parameterized by θ p u , which can be either initialized randomly or from a prefix of pre-training tasks. We then freeze the parameters θ of the shared summarization model and only tune θ p u using the objective defined in Equation 1. By doing this, we can maximize the learned summarization knowledge in UNISUMM and also avoid over-fitting the model to very few samples.

Universal Prefix
Empirically, given a target task, initializing newtask prefix from the most related pre-training tasks can be helpful. However, for a brand new task, selecting meta tasks can be a complicated process, which requires large efforts of feature engineering (Chen and Shuai, 2021). Therefore, during multi-task pre-training, we also pre-train a universal prefix, which can be used as a stable initialization for few-shot prefix-tuning.
In particular, during multi-task pretraining ( § 3.1), we initialize a universal encoder and decoder prefix vector P * = [P * en ; P * de ], parameterized by θ p * . For each training instance from task t, it has a 15% probability to be coupled with this universal prefix vector instead of its task-specific prefix P t . The parameters θ p * are optimized together with θ. Then in prefix-tuning, we use this universal vector as initialization for the unseen task parameter θ p u ( § 3.2).

Asymmetrical Weight Decay
A potential problem in multi-task learning is the negative transfer among different pre-training tasks.
To alleviate this, inspired by previous work (Evgeniou and Pontil, 2004;Bengio, 2012;Liu et al., 2019), we set different weight decay regularizations on different parameters of UNISUMM. Specifically, we separate optimizers of the prefixes and the summarization model in pre-training. We assign a lower weight decay value d p =0.01 on the prefix optimizer, enabling prefixes to flexibly learn task-specific knowledge, and a higher weight decay value d l =0.05 on the summarization model optimizer, enforcing it to learn a broader generalization across different tasks.
Formally, at training step i: where α i and α i p are the learning rates for summarization model and prefix, and ∇f i (θ i ) and ∇f i p (θ i p ) are the batch gradient for summarization model and prefix.

The SUMMZOO Benchmark
SUMMZOO is sourced from from existing summarization benchmark based on the principles of diversity and robustness, where we assemble each dataset into few-shot evaluation settings.
Diversity of Tasks As a major goal, we ensure that SUMMZOO can include a diversity of different summarization tasks, covering multiple domains, text styles and compression ratios. Thus, we carefully select 8 summarization tasks including monologue/dialogue texts and single/multi-document summarization tasks. Their domains also span an assorted set such as news, scientific papers, instructions, online forums and meetings.

Robustness of Evaluation
Our second goal is to ensure that experiments on SUMMZOO can be compared with each other in a robust manner. Also, we want to reduce the randomness from different selections of few-shot samples. Therefore, for each task, we provide 5 sets of few-shot training samples, and we ask all models to train on these 5 sets respectively and report their averaged results and standard deviations. We also formulate two fewshot training settings with the number of shots k set to 10 or 100, where the first can be considered as a more extreme low-resource scenario while the second is a more commonly tested setting.  Table 1: Summary of sub-tasks in SUMMZOO. We report the sizes of test sets here. "Avg. D/S length" stands for "averaged document/summary token length". For QMSum, we concatenate the query and gold span as input. Table 1 summarizes the statistics of sub-datasets in SUMMZOO. The detailed descriptions of each dataset can be found in Appendix A.
To balance the training data size of different datasets, we perform down-sampling on over-sized datasets and up-sampling on low-resource datasets respectively. The detailed descriptions of each dataset and statistics of resulting data for pretraining are shown in Appendix B and Table 8.

Baseline Models
PEGASUS (Zhang et al., 2020) is a large pretrained encoder-decoder model, which is particularly designed for text summarization. The model is trained using the gap sentence generation task. We use PEGASUS LARGE (C4+HugeNews) 1 for comparison, which improves upon the results reported in the original paper.
BART (Lewis et al., 2020) is a pre-trained encoder-decoder language model using selfdenoising tasks. We compare with the BARTlarge model 2 with two tuning strategies on fewshot summarization tasks, namely standard finetuning (BART-FT) and prefix-tuning (BART-PT). 1 https://huggingface.co/google/pegasus-large 2 https://huggingface.co/facebook/bart-large In BART-PT, the prefix vector is added in the same way as in UNISUMM.
MultiBART is a variant of BART-large. Similar to UNISUMM, it is first multi-task pre-trained on the same data ( § 5.1) but without prefixes. And it can also be fine-tuned or prefix-tuned to fit fewshot summarization tasks. We only show the results of prefix-tuned MultiBART because we find finetuning the entire MultiBART model always leads to worse performance in the few-shot setting. This strong baseline can be considered as an indicator to verify the effectiveness of using prefixes in both multi-task pre-training and few-shot tuning.
Text-davinci-002 (Brown et al., 2020;Ouyang et al., 2022) is a large language model (175B) from the GPT-3.5 family, 3 using instruction tuning, and has shown great zero-/few-shot performance on many NLP tasks, including summarization. Specifically, recent work finds that GPT-3.5 models can show much better performance with the technique of in-context learning (ICL) (Brown et al., 2020;Liu et al., 2022a). We use text-davinci-002 with ICL for experiments, and only show the performance of 1-shot ICL because of its input length limitation. 4 All baseline models and UNISUMM are evaluated on SUMMZOO (Appendix C shows the implementation details). We conduct both automatic and human evaluation. As described, SUMMZOO requires models to report averaged results and their standard deviations over 5 sets of different few-shot samples (except for text-davinci-002). We use ROUGE (Lin, 2004) for automatic evalua-   Table 3: R2 scores of 1-shot text-davinci-002 (GPT-3.5) using ICL compared with 10-shot UNISUMM and 100-shot UNISUMM.
tion 5 , which evaluates the n-gram overlap in the model-generated summary against the reference summary. We report the F -1 scores of ROUGE-1 (R1), ROUGE-2 (R2) and ROUGE-L (RL).
6 Automatic Evaluation

Main Results
The main results are shown in Table 2 and 3. First, compared with PEGASUS, UNISUMM outperforms it across all tasks except 100-shot XSum, and shows the best averaged scores in both 10-shot and 100-shot settings. We also find that 10-shot UNISUMM can outperform 100-shot PEGASUS on MultiNews, Arxiv and QMSum by a large mar- 5 We use the files2rouge for evaluation.
gin, suggesting that UNISUMM can benefit from diverse training data and effectively adapt indirect knowledge to unseen tasks. It is notable that although the foundation BART model is inferior to PEGASUS, the BART-based UNISUMM can still outperform PEGASUS with the learned summarization knowledge. Overall, UNISUMM surpasses both BART-FT and BART-PT by a large margin on all tasks in all settings, which suggests the equipment of multi-task learning can substantially improve model performance on few-shot summarization tasks, in particular in the 10-shot setting.
UNISUMM also outperforms MultiBART by a large margin, especially in the 10-shot setting (Avg. 2.86 R1 improvements). Considering that Multi-BART is multi-task pre-trained on the exact same data as UNISUMM does, the main difference from UNISUMM is whether to use prefixes in both multitask pre-training and few-shot tuning. The result verifies the effectiveness of UNISUMM framework, in particular the prefix addition in the multi-task pre-training phrase ( § 3.1).
The comparison between text-davinci-002 and UNISUMM is shown in Table 3. Generally, 100-shot UNISUMM achieves higher ROUGE scores than 1-shot text-davinci-002 on all tasks and overall performance and 10-shot UNISUMM shows better performance compared with 1-shot text-davinci-002 except for XSum and Reddit.  Such improvements can be attributed to the fact that UNISUMM is few-shot trained on more samples. It is also worth noting that UNISUMM is based on BART-large (400M), while GPT-3.5 is orders of magnitude larger (175B). Also, we note that 10-shot UNISUMM can achieve higher ROUGE scores on some tasks such as MultiNews and Arxiv compared with text-davinci-002. Besides UNISUMM is multi-task trained on relevant data, one possible reason is that text-davinci-002 is only presented with 1-shot summary as ICL context, due to the length limitation. However, given the previous finding (Goyal et al., 2022) that GPT-3.5 generated summaries can be favored by human evaluators with even lower ROUGE scores, we also conduct human evaluation in § 7.

Model Robustness
The sample selection bias (Cortes et al., 2008) has been a major problem for few-shot tasks, where model performance is strongly correlated with the selection of few-shot samples. And a sound system should be robust and stable when taking different few-shot samples. To demonstrate the robustness and stability of different few-shot summarization models, we report their standard deviations of  ROUGE-1 scores on 5 different sets of few-shot samples provided in SUMMZOO in Table 4. Overall, the standard deviations of UNISUMM are lower than all other baselines on most tasks in both settings, suggesting that UNISUMM is most stable and robust when facing different few-shot samples. Also, MultiBART outperforms BART-PT and shows better averaged results than PEGASUS in the 100-shot, showing that reusing related summarization datasets is valuable. However, it can still be unstable in the 10-shot setting. In contrast, UNISUMM shows the least averaged standard deviations across all tasks in both settings. This suggests that the two-phase training with prefixes in the UNISUMM framework is essential for enhancing the model robustness.
We present the full table, including standard deviations of R2 and RL scores, in Appendix D.
Overall, we find that UNISUMM is most robust and stable towards different training samples.

Human Evaluation
To better understand the outputs of different fewshot summarization systems, following Kryscinski et al. (2019Kryscinski et al. ( , 2020, we conduct a human evaluation from four dimensions: Fluency, Consistency, Coherence and Relevance. We select 30 samples from QMSum, WikiHow and MultiNews, respectively, covering both monologue and dialogue texts. Then, for each sample, we ask a judge with experience in human evaluation for summarization tasks, to give scores from 1 to 5 (higher score indicates better quality) along each evaluation dimen-
In human evaluation, UNISUMM outperforms PEGASUS and BART-PT on all datasets regarding all dimensions, achieving a higher fluency score than gold summaries on QMSum and a comparable score on MultiNews and WikiHow, suggesting that UNISUMM can generate very fluent sentences which can be comparable with human annotated summaries. A challenge of QMSum is that models are asked to generate summaries focusing on the input queries. Thus, Relevance is a very important metric for this task. However, Relevance sees very low score for PEGASUS (3.27) and BART-PT (2.80), suggesting they are weak in extracting relevant information based on user queries. In contrast, UNISUMM achieves a higher score (3.97). Text-davinci-002 also performs very well on this task, even outperforming the gold summaries on Fluency, but UNISUMM still achieves comparable results with limited training samples and much lower cost.
On MultiNews, since text-davinci-002 is only input with 1-shot summary as ICL example due to length limitation, although it can generate very fluent (4.97) and coherent (4.73) summaries, it is less preferred by human annotators w.r.t. Consistency and Relevance. UNISUMM still outperforms other systems and only loses to gold summaries on this two metrics. Similar results are also observed on WikiHow, where text-davinci-002 tends to generate very long summaries, which can con-tain some hallucination and less important content, and UNISUMM shows comparable performance on Consistency and Relevance.
We show case studies and their analysis, including an error case where UNISUMM fails, in Appendix F.

Task Scale in Multi-task Training
One common concern about multi-task training is that: when multiple tasks are combined, will newly added tasks hurt or help the performance? To verify this, we add one variant of UNISUMM for comparison, whose phase-1 is multi-task pretrained on 3 tasks instead of all 7 tasks in Table 8. For the 3 tasks, we use the combination of CN-NDM, PubMed and MediaSum, which are typical datasets for news summarization (MultiNews and Xsum), academic paper summarization (ArXiv) and dialogue summarization (DIALOGSUM, SAM-Sum and QMSum).
Results in Table 6 show that when extending the multi-task pre-training datasets from 3 to 7, UNISUMM achieves better results on multiple datasets. For example, taking ArXiv as the target task, 7-Task UNISUMM outperforms 3-Task UNISUMM in both 10 and 100-shot settings. It suggests that 7-Task UNISUMM can benefit from GovReport, XWikis, SummScreen and BillSum for scientific text summarization. On average, the R2 score improves by 0.4 for the 10-shot setting and 0.7 for the 100-shot setting. This shows that negative transfer is minor in UNISUMM and suggests that by training UNISUMM on even more datasets, its generalization can potentially be improved by learning more indirect summarization knowledge.

Different Prefix Initializations
UNISUMM is equipped with a universal prefix that was randomly (15%) picked by all tasks during multi-task pre-training ( § 3.3). In Table 7, we show the ablation study of using different prefix initialization strategies in few-shot prefix-tuning. Due to space limitation, we show R-2 scores here. We compare three strategies: initialized the prefix randomly, using CNNDM prefix or using universal prefix. The CNNDM prefix is selected to be compared here because it is considered as a general summarization task and has been proved helpful to many tasks, e.g., SAMSum (Gliwa et al., 2019).
We see that using universal prefix yields the best results on most tasks. Also, the universal prefix is particularly useful for the 10-shot setting, bringing a 0.23 improvement for R2 score. In addition, we find that using task-specific prefix (CNNDM) shows the worst performance on some tasks, such as QMSum and ArXiv, and has the lowest average score. This can be explained by that the taskspecific prefix (CNNDM) stores abundant task specific knowledge, which however can be harmful to unseen target tasks, especially when the target task is very different from the pre-training task.
We show more analysis in Appendix G.

Conclusion
We introduced UNISUMM, a novel few-shot summarization system that can be easily prefix-tuned to excel at and generalize on a diversity of summarization tasks. We propose to combine multitask learning and prefix-tuning by jointly training the prefixes and the summarizer on multiple existing summarization datasets. By only tuning the prefix parameters, UNISUMM shows superior performance over strong baseline systems, yielding fluent and faithful summaries across tasks. In addition, we assembled and released a new benchmark, SUMMZOO, for fairly and effectively evaluating few-shot summarization models. It covers an assorted set of summarization tasks and provides multiple few-shot sets for a more robust and fairer comparison.

Limitations
The limitation of UNISUMM can be stated from three perspectives. First, the multi-task pre-training of UNISUMM can be time and cost consuming, which requires large GPU resources. Second, the current framework uses prefixes of a fixed length for both multi-task training and few-shot prefixtuning. However, different summarization task may prefer different size of prefixes. Third, in this work, we focus on summarization tasks in English. The performance of UNISUMM for languages that have a different morphology or syntactic structures from English needs further exploration.

Ethics Statement
Copyright and Citation Issue The copyright of individual datasets in SUMMZOO belongs to the original authors. The usage license of each dataset also applies to SUMMZOO. To ensure fair credit, when using SUMMZOO for evaluation, please also cite original papers, where individual datasets are introduced.
Data Availability and Safety Pre-training and fine-tuning summarization data studied in this paper are mostly publicly available, otherwise we will provide links to the access application. Although filtering has been conducted in building the original datasets, some contents can contain uncomfortable descriptions, e.g., news coverage of violent crimes and events.
Usage of Large PLM The GPT-3.5 model is used to generate text (summaries) for input documents of summarization tasks. The generated text is only used for experiments and analysis, which are presented in corresponding sections. No further usage, e.g., generating content for manuscripts, of GPT-3.5 or its family, is included in this paper.

Human Evaluation
We conduct human evaluation with the help of one judge, who obtained their postgraduate degree in the United Kingdom and has a solid experience in evaluating summarization tasks. They were compensated through a payment of around 400 USD for 450 instances ( § 7).

A Datasets in SummZoo
The final SummZoo contains following sub-tasks: MultiNews (Fabbri et al., 2019) is a large-scale multi-document summarization dataset. The task is to generate a summary given multiple news articles.
XSum (Narayan et al., 2018) is an extreme text summarization dataset. Given a news article, the task is to generate a one-sentence summary.
Reddit-TIFU (Kim et al., 2019) is a social post summarization dataset. The task is to generate a short summary for posts from the online discussion forum Reddit. 6 Compared with news text, the text in Reddit-TIFU is less formal and structured.
ArXiv (Cohan et al., 2018) is a long scientific paper summarization dataset collected from ArXiv, including articles of multiple domains, such as physics, computer science, etc.
WikiHow (Koupaee and Wang, 2018) is a largescale instruction summarization dataset. The task is to generate a short summary given the multiplestep instruction.
SAMSum (Gliwa et al., 2019) is a written conversation summarization dataset for Messengerstyle chit-chats. Both dialogue and summary are annotated by experts.
DIALOGSUM (Chen et al., 2021) is a real-life scenario dialogue summarization dataset that covers a wide range of daily life dialogues, including diverse task-oriented dialogues. The testset of DI-ALOGSUM provides three reference summaries for each dialogue, we report the averaged results.
QMSum (Zhong et al., 2021) is a query-based meeting summarization dataset that is derived from Augmented Multi-party Interaction (AMI) corpus (Kraaij et al., 2005), the International Computer Science Institute (ICSI) (Shriberg et al., 2004) and Committee Meetings. The task is to generate a summary given a meeting and a query.

B Multi-Task Pre-Training Datasets
We use the following datasets for multi-task pretraining:  MediaSum (Zhu et al., 2021) is an interview summarization dataset that contains 463.6k transcripts and summaries from NPR and CNN.
SummScreen (Chen et al., 2022a) consists of long TV series transcripts and human written recaps.
XWikis (Perez-Beltrachini and Lapata, 2021) is a cross-lingual summarization dataset that contains Wikipedia articles and leading paragraphs in multiple languages. We only use the English data that have paired documents and summaries.
To balance the training data size of different datasets, we perform down-sampling on over-sized datasets and up-sampling on low-resource datasets respectively. The statistics of resulting data for pre-training are shown in Table 8.  Table 9: Comparison of model robustness towards different few-shot samples. We report the standard deviations of ROUGE scores on 5 sets of few-shot samples provided in SUMMZOO for each task and setting. D R1 , D R2 and D RL mean the standard deviations of R1, R2 and RL, respectively. Lower standard deviation indicates the model is more robust towards different few-shot samples. The bottom block presents the averaged results of all 8 sub-tasks.

C Implementation Details
We use BART-large (Lewis et al., 2020) to initialize the summarization model of UNISUMM. All experiments are conducted on NVIDIA A100 GPU with PyTorch 1.11. The max input length and target length are set to 2,048 and 400. The hyperparameter choice is based on previous few-shot summarization work (Zhang et al., 2020;Fabbri et al., 2021;Chen and Shuai, 2021) and empirical consideration. For multi-task pre-training, we initialize from BART-large, and train the model on 16 GPUs with 300,000 steps, batch size of 32, learning rate of 1.5e-5, and warm-up with 4,000 steps. For few-shot tuning, we prefix-tune the model on 4 GPUs with 100 and 1000 steps for 10-shot and 100shot, respectively, with batch size of 32, learning rate of 1.5e-4, and warm-up with 10% of the training steps. For XSum, the training steps are set to 10 and 100 for 10-shot and 100-shot, respectively, while other configurations are unchanged. Table 9 shows the standard deviations of ROUGE-1, ROUGE-2 and ROUGE-L scores on 5 different sets of few-shot samples in SUMMZOO. Overall, UNISUMM shows the least standard deviations on most metrics across tasks in both settings, suggesting it is most robust and stable towards different selections of training samples.

E Human Evaluation
Following Kryscinski et al. (2019, 2020), we conduct human evaluation from 4 dimensions, which can offer a more robust and holistic perspective to understand summarization systems (Zhong et al., 2022b): • Fluency evaluates the quality of individually generated sentences, including grammar, word order, etc; • Coherence evaluates the collective quality of generated summaries; • Relevance evaluates the importance of information in the generated summaries; • Consistency evaluates the factual alignment of the generated summary against the input document.
We ask a judge to give scores from 1 to 5 along these 4 dimensions. Higher score indicates better quality. The judge is a postgraduate student, who studied in the United Kingdom and has solid experience in evaluating summarization tasks.

F Case Study
We qualitatively demonstrate the advantages of UNISUMM (100-shot) using cases from MultiNews and QMSum, and present an error analysis using case from WikiHow. As shown in Table 11 (MultiNews), we see that the UNISUMM generates a summary with similar events and faithful descriptions compared with the gold summary. However, PEGASUS generated summary contains factual errors ("... was last seen in a package shipped to the us from belgium.") while the summary generated by UNISUMM ("... unearthed ... shipment from belgium to newark") is consistent with the gold summary and input ("... turned up ... shipped from belgium."). This shows that UNISUMM has the ability to collect important information from multiple news reports and generate high-quality summaries, which is a task that the model has never seen during multi-task pre-training.
Also, as shown in Table 12 (QMSum), compared with gold summary, although the summary generated by UNISUMM is longer, it is highly relevant to the query. And UNISUMM properly rephrases the key utterance from the source meeting into an objective description, which suits the characteristic of conversation summarization. In contrast, the summary generated by PEGASUS misses im-portant contents and contains irrelevant sentences compared with UNISUMM and human annotation. This evidence shows that UNISUMM successfully learns important characters of query-based meeting summarization task with only 100 samples.
An error case where UNISUMM fails can be found in Table 14 (WikiHow). UniSumm mistakenly generates "...matches the text of the letter...", where the ground truth should be the "...matches. . . the one (address)...on the envelope". Moreoever, the summary generated by UniSumm is a bit repetitive in wording, e.g., serveral repeated phrases "... on the inside of the letter...".
We present more cases in Table 13 (ArXiv and DIALOGSUM), Table 14 (XSum) and Table 15 (SAMSum and Reddit). Overall, we find that UNISUMM is capable of generating very fluent, relevant, faithful and human-like summaries on diverse unseen tasks. This verifies UNISUMM's great generalization ability in the few-shot scenario.

G Influence of Weight Decay
In § 3.4, we design a separated weight decay strategy to circumvent negative transfer in multi-task learning. In Table 10, we examine whether the combination of different weight decay rates (d p for prefixes and d l for the summarization model) is beneficial. Specifically, we report ROUGE-2 scores on SUMMZOO with different combinations of weight decay rates. We can see that the model performs the best with d p = 0.05 and d l = 0.01. And this asymmetrical weight decay is especially helpful to 10-shot XSum, which is more distinct from pretraining summarization tasks and relies more on general summarization knowledge.

MultiNews
Input text paris (ap) the pompidou centre in paris hopes to display a long-vanished picasso painting in may, now that it has been recovered by u.s. customs authorities. this undated photo provided by the united states department of justice, shows a cubist painting entitled the hairdresserby pablo picasso. authorities say the painting worth millions of dollars was ... ||||| a stolen picasso worth millions of dollars was shipped to the u.s. in a package that described it as a $37 "art craft " but it will soon be on its way back to france...
PEGASUS -a 1911 cubist painting worth millions of dollars has been recovered after being mysteriously transported from france to the us nearly a decade ago. the pablo picasso painting, titled la coiffeuse, was reported missing from a pompidou storeroom in paris in 2001 and was last seen in a package shipped to the us from belgium, the ap reports. federal prosecutors in brooklyn filed papers thursday to seize the painting, which was hidden in a climate-controlled storage facility in queens, ny, and will soon be returned to france, reports the new york times. the center for art recovery in paris says it hopes to have the painting back in may, at a time when the cultural world is reeling from islamic state videos showing the destruction of statues in iraq. "this is a true comfort for us at a time when the cultural world is reeling from the islamic state video showing the destruction of statues in iraq," says alain seban, the director of the pompidou centre in paris. "it is a work of art that represents the genius of pablo picasso and was bequeathed to us in 2001 by the master of france's national museum." GPT-3.5 -the new york times has a front-page story friday detailing how former fbi chief james comey tried to instruct president trump about the proper boundaries that should exist between them.
UNISUMM -a century-old picasso painting missing from paris for more than a decade resurfaced in the united states under false pretenses as a $37 holiday-themed "art craft," the ap reports. the 1911 painting, called "la coiffeuse," was unearthed in december in a fedex shipment from belgium to newark. the canvas had been smuggled out of a storeroom of the centre georges pompidou, the paris museum and arts center. officials there thought it was safe and sound until they received a loan request for it in 2001. they searched the storerooms and could not find it. on dec. 17, someone going by "robert" with an address in belgium brought the painting to a climate-controlled warehouse in long island city and was seized, the ap reports. the us attorney for the eastern district of new york filed a civil complaint thursday to forfeit the picasso and return it to france. the canvas, titled "la coiffeuse," was bequeathed to the national museums of france in 1998. the pompidou director alain seban said in a statement friday that he hopes the work can be exhibited again publicly in may. no arrests have been made in the case.
Gold -a picasso painting that was found to have vanished from a paris museum more than a decade ago has turned up 2014 in the us, in a package shipped from belgium. its papers identified it as a $37 "art craft/toy" and also included the line "joyeux noel," or merry christmas, the new york times reports. a man named "robert" attempted to send the package to a climate-controlled warehouse in queens, new york, in december, but custom officials at the port of newark seized what turned out to be picasso's 1911 la coiffeuse (the hairdresser). french museum officials traveled to new york last month and confirmed the find is indeed the missing picasso work, which the centre georges pompidou realized was missing from its storerooms in 2001 following a loan request; it was then valued at more than $2.5 million. court documents don't specify whether the sender or would-be recipient have been identified, nbc news reports, but "a lost treasure has been found," us attorney loretta lynch said in a statement. she filed a civil complaint yesterday that will have the painting returned to france. the pompidou's director, who called the painting's rediscovery a "true comfort," hopes the painting can go on display at the museum as early as may, the ap reports. (another picasso work was stolen days before this one was found.) Table 11: Case study (a): Gold and generated summaries by 1-shot text-davinci-002 (GPT3.5), 100-shot PEGASUS and UNISUMM on MultiNews, respectively. "|||||" seperates different reports in the input.

QMSum
Input text what did barry hughes think about draft changes to cps guidance when talking about the divergence between the law in england and wales? suzy davies am: thank you. i was going to ask this question later, but i think it fits in . i've forgotten what i was going to ask; this is terrible. oh no, that's right obviously, this act is on a particular trajectory at the moment, and it's due to be introduced before the end of this assembly. do you think it would be better for us as scrutinisers of this act if we could see the draft changes to cps guidance on the public interest test before we make our final decision? barry hughes: i honestly don't think that would necessarily be helpful. i've had some discussions with kwame, who would have an involvement in this. ...
PEGASUS barry hughes thought it would be better for the members to see the draft changes to cps guidance on the public interest test before making their final decision. he added that they would need to work that up as they went along, and it would be a risk of putting the cart before the horse.
GPT-3.5 Barry Hughes thinks that it would not be helpful for the public to see the draft changes to the CPS guidance on the public interest test before the final decision is made on the bill. He believes that it would be better to take the general principles set out in the code for crown prosecutors and provide more specific details around them that relate to the issues being discussed in the bill.
UNISUMM barry hughes thought that the draft changes to cps guidance would not necessarily be helpful. he thought that they would simply want to take the present public interest factors set out very clearly in the code for crown prosecutors, and provide a degree of detail around those that relates more specifically to the issues that they were discussing. so, it would be taking principles of generality and according them a degree of specificity. and they'd need to work that up as they go along.
Gold barry hughes thought draft changes to cps guidance would not necessarily be helpful to the scrutinisers, and simply wanted to take the present interest factors. barry thought they needed to work for the principles of generality and a degree of specificity. Table 12: Case study (b): Gold and generated summaries by 1-shot text-davinci-002 (GPT3.5), 100-shot PEGASUS and UNISUMM on QMSum, respectively. The blod is the query.

ArXiv
Input text rapid progress in the design and manufacture of optical fiber systems is a result of worldwide demand for ultra -high bit -rate optical communications . this explains the growing interest of the soliton community in soliton -based optical fiber communication systems . this area of research was considerably advanced in recent years @ xcite . the most remarkable results include the application of the concept of the dispersion management to _ temporal optical solitons _ and soliton -based optical transmission systems , and the discovery of the so -called _ dispersion managed soliton _ . high -speed optical communications require effective components such as high -performance broadband computer networks that can be developed by employing the concept of the bit -parallel -wavelength ( bpw ) pulse transmission that offers many of the advantages of both parallel fiber ribbon cable and conventional wavelength -division ...
PEGASUS the physical properties of multi -frequency solitary waves propagating in optical fibers and waveguides are discussed in the context of the soliton -based optical dispersion management. the emphasis is placed on the role of the spatial dispersion in the propagation of such solitary waves. it is shown that the propagation of such waves in optical fibers and waveguides is affected not only by the parametric difference between the pulse and the beam velocity, but also by the spatial variation of the dispersion profile. as a consequence, we argue that the role of the spatial dispersion in the propagation of solitary waves in optical fibers and waveguides is not limited to the parametric difference between the pulse and the beam velocity, but also includes the role of the spatial dispersion itself.  This article discusses the various types of solitary waves that can occur in nonlinear optical systems, with a focus on how to extract information about the propagation characteristics from experimental data. The examples of temporal and spatial solitons are given, along with a discussion of the differences between them. The difficulties associated with experimental measurements are also addressed, along with suggestions for ways to overcome them.
UNISUMM the aim of the present paper is to present a versatile scheme for the study of spatial and temporal solitary waves . this approach is based on the theory of spatio -temporal wave propagation , where the spatial beam profile along the transverse direction is similar to the temporal pulse profile in fibers . in all such systems , the nonlinear interaction between the waves of two ( or more ) frequencies is the major physical effect that can support coupled -mode multi -frequency solitary waves . we will illustrate three common examples : multi -wavelength solitary waves in bit -parallel -wavelength optical fiber links , multi -colour spatial solitons due to multistep cascading in optical waveguides with quadratic nonlinearities , and quasiperiodic solitons in the fibonacci superlattices .
Gold we discuss several novel types of multi -component ( temporal and spatial ) envelope solitary waves that appear in fiber and waveguide nonlinear optics . in particular , we describe multi -channel solitary waves in bit -parallel -wavelength fiber transmission systems for high performance computer networks , multi -colour parametric spatial solitary waves due to cascaded nonlinearities of quadratic materials , and quasiperiodic envelope solitons due to quasi -phase -matching in fibonacci optical superlattices . 2 DIALOGSUM Input text person1: ms. dawson, i need you to take a dictation for me. person2: yes, sir... person1: this should go out as an intra-office memorandum to all employees by this afternoon. are you ready? person2: yes, sir. go ahead. person1: attention all staff... effective immediately, all office communications are restricted to email correspondence and official memos. the use of instant message programs by employees during working hours is strictly prohibited. ... PEGASUS person1 advises dawson to sign an intra-office memorandum restricting office communications to email correspondence and official memos. person2 asks if this applies to internal and external communications. person1 tells them so.
GPT-3.5 person1 needs Ms. Dawson to take a dictation for an intra-office memorandum that is to be distributed to all employees by 4 pm. The memo is to inform all staff that effective immediately, all office communications are restricted to email correspondence and official memos; the use of instant message programs by employees during working hours is strictly prohibited. At first offense, the employee will receive a warning and be placed on probation. The second offense will result in termination.
UNISUMM person1 asks ms. dawson to take a dictation for person1. person1 restricts office communications to email correspondence and official memos and prohibits the use of instant message programs by employees during working hours.
Gold ms. dawson helps person1 to write a memo to inform every employee that they have to change the communication method and should not use instant messaging anymore. Table 13: Case study (c): Gold and generated summaries by 1-shot text-davinci-002 (GPT3.5), 100-shot PEGASUS and UNISUMM on ArXiv and DIALOGSUM, respectively.

WikiHow
Input text before folding the letter, make sure that the address on the inside of the letter matches with the one that you have written or printed on the envelope.this will help avoid any mixups. don't forget to check that you've signed the letter. make sure that the text is facing up and towards you. this is your last chance to proofread and double check that you haven't forgotten anything.for example, have you dated the letter? are there any spelling or grammar mistakes? , take the bottom of the letter and fold it so that it is about half an inch (about 1 cm) from the top of the page.you can place the envelope underneath the letter to use as a guide. make sure that when you have folded the letter up, it is small enough to fit inside the envelope. make sure to line up all the outside edges of the paper before creasing so that you don't end up with a crooked fold. if the fold is crooked, your letter may not fit into the envelope.use a ruler to make the crease nice and sharp. you can do this by holding a ruler sideways and using the thin edge. slide the edge of the ruler along the letter on the creases so they are nice and flat. you will now take the right part of the letter and fold it about one third of the way in.line up the top and bottom edges of the letter and then crease the fold. take the other side of the letter and fold it one third of the way in. this will be the same thing that you did with the right side.line up the top and bottom of the edges so they are perfectly straight before creasing. the last crease you made should go into the bottom of the envelope first. place it so that the folds face the backside of the envelope.this will make it easy for your recipient to find where to begin unfolding the letter.
PEGASUS take the top of the letter and fold it so that it is about half an inch (about 1 cm) from the top of the page. place the envelope underneath the letter. make sure that the envelope is flat on the inside. crease the folded side of the letter. GPT-3.5 The document explains how to fold a letter so that it will fit into an envelope. First, make sure that the address on the letter matches the address on the envelope. Next, fold the bottom of the letter up to the top of the page, making sure to crease the fold. Then, fold the right side of the letter in to the middle, and repeat with the left side. Finally, insert the letter into the envelope so that the folds are facing the back.
UNISUMM check that the address on the inside of the letter matches the text of the letter to make sure that the address on the inside of the letter matches with the one on the page. make sure that the text matches with the one on the inside of the letter. make sure that the envelope is small enough to fit into the envelope.
Gold check the address. lay the letter on a table. fold the bottom half up. crease the fold. fold the right half of the letter inwards. fold the left half of the letter inwards. turn the letter sideways and insert into the envelope.

XSum
Input text the sunday times says the missile veered off course during a test in june last year -weeks before the commons voted to spend 40bn renewing trident . questioned by andrew marr, the pm refused to say four times if she had known about the test ahead of the vote . the snp's nicola sturgeon called for a ' full disclosure ' of what happened . according to the sunday times, an unarmed trident ii d5 missile veered off in the wrong direction towards the us -instead of towards africa -when it was launched from a british submarine off the coast of florida . in july -days after mrs may had become prime minister -mps voted overwhelmingly in favour of replacing trident . during the debate, mrs may told mps it would be ' an act of gross irresponsibility ' for the uk to abandon its nuclear weapons . mps backed its renewal by 472 votes to 117 . however, all 52 snp mps voted against it -as did labour leader jeremy corbyn . when asked on the bbc's andrew marr show whether she had known then that a misfire had happened, mrs may said: ' i have absolute faith in our trident missiles . ' when i made that speech in the house of commons, what we were talking about was whether or not we should renew our trident . ' she was asked a further three times -but did not answer the questions . the ministry of defence did not give details of the test process but said it had been a success . scottish first minister, mrs sturgeon -a long-standing opponent of trident, whose submarines are based at faslane, on the river clyde -said the apparent misfire was a ' hugely serious issue ' . she tweeted: ' there should be full disclosure of what happened, who knew what/when, and why the house of commons wasn't told . ' meanwhile, mr corbyn said the reports called for ' a serious discussion ' . ...
PEGASUS the prime minister has been accused of " side-stepping " questions about a submarine-launched nuclearcapable missile which misfired during a test.
GPT-3.5 trident is the uk's nuclear weapons system, consisting of submarines, missiles and warheads. it is operated by the royal navy and has been in use since the 1990s. reports that a missile test in june 2016 went wrong have raised questions about the system's effectiveness and whether it was deliberately kept quiet ahead of a key parliamentary vote on its renewal. UNISUMM a nuclear missile launched a british submarine failed during a test, according to a report in the sunday times.
Gold theresa may is coming under pressure to say whether she knew about a reported misfire of the uk's nuclear weapons system before a crucial commons vote. ..! what happened? cora: well, there was a meet and greet with james charles in one of the malls in birmingham and about 8000 fans showed up for it. cora: it cause a gridlock around the mall and -of course -british media had to make some (quite negative) comments on it. ellie: they came for sister james?! >:( ellie: i sister snapped!! :p :d cora: haha :d cora: you shouldn't watch so much youtube, you're getting weirder and weirder. :d ellie: sister shut up :p so, what did they say? cora: ;) : * daily mail" was surprised that a meet and greet with a "virtually unknown" youtuber gathered 8000 people. :p cora: a host from lbc tried to find an answer to an unanswerable question: "who is james charles?" ...
PEGASUS ellie's meet and greet with james charles in birmingham got some negative comments from british media. cora thinks it's ridiculous that a man can wear makeup.
GPT-3.5 cora and ellie were discussing the meet and greet with james charles in birmingham. 8000 fans showed up and it caused a gridlock. british media was surprised and some hosts couldn't wrap their heads around the fact that a guy can wear makeup.
UNISUMM about 8000 fans showed up for the meet and greet with james charles in birmingham.
Gold there was a meet-and-greet with james charles in birmingham which gathered 8000 people.

Reddit
Input Text tifu making lunch. made myself a lush sandwich for lunch today. smoked ham, cheddar cheese, a dash of bbq sauce on thick cut wholemeal bread. popped it in a small tupperware box, for freshness. i found some bn type biscuits in the cupboard, so i pop them in the box on top of my sandwiches. think nothing of it. eventually lunch time comes around. i discover, to my horror, that the biscuits have absorbed the moisture from the sandwich. and now i have a dried out sandwich and soggy biscuits PEGASUS made a luch sandwich for lunch, only to have it ruined by soggy biscuits. i'm completely fucked.
GPT-3.5 made a sandwich and it turned out terrible because the biscuits absorbed all the moisture.
UNISUMM made a big sandwich, with cheddar cheese, bbq sauce and bn type biscuits. popped them in a tupperware box, and discover they've dried out Gold don't put your sandwich and biscuits in same box, unless you like dry bread and soggy biscuits.