Towards Summary Candidates Fusion

Sequence-to-sequence deep neural models fine-tuned for abstractive summarization can achieve great performance on datasets with enough human annotations. Yet, it has been shown that they have not reached their full potential, with a wide gap between the top beam search output and the oracle beam. Recently, re-ranking methods have been proposed, to learn to select a better summary candidate. However, such methods are limited by the summary quality aspects captured by the first-stage candidates. To bypass this limitation, we propose a new paradigm in second-stage abstractive summarization called SummaFusion that fuses several summary candidates to produce a novel abstractive second-stage summary. Our method works well on several summarization datasets, improving both the ROUGE scores and qualitative properties of fused summaries. It is especially good when the candidates to fuse are worse, such as in the few-shot setup where we set a new state-of-the art. We will make our code and checkpoints available at https://github.com/ntunlp/SummaFusion/.


Introduction
Leading abstractive summarization methods typically rely on transfer learning and the pre-trainthen-finetune paradigm.In this approach, a deep sequence-to-sequence neural model is first pretrained on a very large text corpus, either with a general-purpose text generation objective like masked span generation as in T5 (Raffel et al., 2019), BART (Lewis et al., 2020) and ProphetNet (Qi et al., 2021), or with a pre-training objective specific to summarization as in PEGASUS (Zhang et al., 2020) and TED (Yang et al., 2020).Then, the model is fine-tuned on the downstream summarization dataset(s) of interest, which can have a wildly varying amount of human labels, from a few thousands to hundreds of thousands.
Table 1: Qualitative sample from the Reddit TIFU dataset.Words in the summary from our SummaFusion model which are not in the source document are underlined, and those which are not among any of the first-stage candidates are in italic.
Sequence-to-sequence models are typically finetuned for generation tasks such as summarization with maximum likelihood estimation (MLE): the model is taught to predict only the ground-truth summary given the source, while all other potential good summary alternatives are not considered.This is not ideal since for a subjective task like summarization, there can be several, or even many satisfying outputs.Besides, at training time, teacher forcing is used (Williams and Zipser, 1989) where the decoder is conditioned on the previous ground truth tokens, while at inference, the model predicts the output sequence auto-regressively by conditioning on its own previous outputs.Once again, this procedure is not ideal as training and inference present a discrepancy known as exposure bias (Bengio et al., 2015).Moreover, generation of the most probable sequence becomes intractable due to a large vocabulary size, and typically a decoding method is used to approximate the best summary.Beam search (Reddy, 1977) has been a common choice for decoding, but other methods such as nucleus sampling (Holtzman et al., 2019) are gaining traction as potential alternatives, usually with a focus on diversity in the generation.
When decoding the summary, the decoding method keeps track of several hypotheses, before outputting a single one and discarding the others.An oracle is defined as the summary candidate from the pool of hypotheses which maximizes the metric of choice (e.g.ROUGE (Lin, 2004)) when evaluated against the target.As observed by several recent studies (Liu and Liu, 2021;Ravaut et al., 2022), the discrepancies between training and inference together with the approximate decoding lead to models not being utilized to their full capacity.As a consequence, there is a wide gap between the top candidate and the oracle performance.For instance, (Ravaut et al., 2022) report a 10.07 ROUGE-1 points difference between the top beam and the diverse beam search oracle on CNN/DM.This motivates second-stage methods to learn to select a better candidate with having access to a more global context, free from the autoregressive constraint which restricts access to only previous context.Summarizing in multiple stages is arguably also closer to how humans compose a summary (Cao et al., 2018).
Existing second-stage summarization methods design a training objective to improve candidate selection among first-stage candidates, either through a new model (Ravaut et al., 2022), or re-using the first-stage model (Liu et al., 2022b).However, sticking to first-stage candidates may not be ideal as they are bounded by the quality of the first-stage model.Despite the oracle average results being high, its variance is very high too (see Appendix A Table 11), and in some cases all candidates are suboptimal.Alternative decoding methods to beam search, while generating more diverse summaries, do not solve the issue as high diversity among candidates usually means loss in performance in the output candidate (Narayan et al., 2022).
To bypass these limitations, in this work we propose SummaFusion, an abstractive second-stage summarization model.Re-using both the source and first-stage summary candidates, SummaFusion generates a new summary from scratch, in an abstractive manner.It produces summary candidates which are closer to the ground-truth ones, resulting in relative ROUGE improvements of up to 17.98% across three very abstractive benchmarks.The model is flexible and can adjust to varying number of summary candidates.Besides, fused summaries present several interesting properties such as being abstractive with regards to both the source and the set of first-stage candidates, more fluent and more factually consistent.It performs well especially on lower-quality pools of summary candidates, where one needs a second-stage summarizer the most.For instance, this happens often in few-shot scenarios, where the first-stage model generally lacks enough supervision in order to learn to produce good summaries.To get a glimpse of our model behavior, we refer the reader to the example in Table 1, where the fused summary is much better than all candidates.
Our contributions in this paper are the following: • We introduce summary candidates fusion, a novel approach to second-stage summarization.• We demonstrate the fixing behavior of our Sum-maFusion: it is designed for very abstractive summarization tasks, and works better on difficult data points for the base model, as well as in fewshot setups.In these cases, it dramatically drives up the ROUGE of the base model.• We conduct a thorough qualitative analysis, and assess that SummaFusion indeed generates summaries deemed better according to humans.

Related Work
Second-stage methods have recently enabled strong progress in the state-of-the-art of abstractive summarization research.GSum (Dou et al., 2021) uses additional discrete guidance signal such as salient sentences predicted by an extractive model to better guide the abstractive system.While abstractive summarization models are trained to maximize MLE at the token-level, second-stage methods usually work at the sequence-level.ConSum (Sun and Li, 2021) and SeqCo (Xu et al., 2021) fine-tune the model with a different, contrastive loss to assign more confidence to higher-quality summary candidates.RefSum (Liu et al., 2021) uses metalearning to re-rank summary candidates produced by several base systems.SummaReranker (Ravaut et al., 2022) and SimCLS (Liu and Liu, 2021) train a RoBERTa to re-rank candidates, the former with multi-label binary cross-entopy, the latter with con-trastive learning and a ranking loss.BRIO (Liu et al., 2022b) re-uses the base model for a secondround of fine-tuning with both the cross-entropy loss and a candidate-level ranking loss.Existing fusion work in summarization focuses on sentence fusion.Fusing several sentences for the purpose of summarization was first proposed by Barzilay and McKeown (2005), paving the way for more abstractive summaries.Weiss et al. (2021) later proposed a much larger dataset for sentence fusion in multi-document abstractive summarization, driving up model performance.Through a thorough human evaluation, Lebanoff et al. (2019) ask annotators to label which type of fusion of sentences is taking place while also rating the sentence properties for sentences generated by several abstractive systems.In a follow-up work (Lebanoff et al., 2020b), the authors build a Transformer model (Vaswani et al., 2017) enriched with sentence structure information for the explicit goal of fusing sentences, and evaluate the model on a dataset dedicated to sentence fusion.The same authors also introduce a cascade model (Lebanoff et al., 2020a) for abstractive summarization which contains a fusion mechanism based on combining highlighted phrases of the source text.
To the best of our knowledge, there is no method yet to fuse or combine in an abstractive manner entire summary candidates (not just sentences).

Model
If x i is a source document and y i its associated target summary, B is the first stage model generating m candidates C i = {C 1 , . . ., C m }, the secondstage fusion model θ is trained to maximize the likelihood of the target given x i and C i : where the joint distribution over the target tokens p(y i |x i , C i ; θ) is modeled as auto-regressive generation with the following cross-entropy loss for a target summary y i = (y 1 , . . ., y l ) of length l: This formulation is essentially the same as the one typically used to train the base model B: autoregressive cross-entropy loss with teacher forcing.The only difference is that in the second stage, the model can also condition on the first stage candidates C i on top of the source x i .Fig. 1 shows the overall architecture of our fusion model.

Fusing Source and Summary Candidates
Knowing that among the pool of first-stage summary candidates, some are of high quality, we give the fusion model access to the entire set C i .At the same time, to enable the model to deviate from the candidates if needed, we also condition on the source x i .We first encode the source x i and each candidate C k ∈ C i separately with the encoder: We then concatenate all these token-level representations from the encoder on the sequence dimension, resulting in a unique, long context vector: Finally, the decoder performs cross-attention on z i to generate the output token probabilities: Our approach of concatenating after encoding is inspired by the Fusion-in-Decoder method (Izacard and Grave, 2021), which is more suited to our problem than concatenating before encoding.Table 2: Statistics on the datasets that we used for experiments.Tokens counts are calculated based on PEGASUS tokenization.Compression ratio is defined as the ratio between the number of sentences in the summary and the number of sentences in the source.
Indeed, if we note n the source length and l the summary length, the complexity of the selfattention when concatenating before would be: O((n + m.l) 2 + n.l + l 2 ), while with our approach, it becomes: Knowing that in summarization, we have l << n, concatenating after is less computationally expensive.Besides, self-attention between summary candidates does not yield any additional value in our problem while being computationally expensive.

Candidate-level Information
Summary candidates C 1 , . . ., C m are initially ordered following their diverse beam search order (by group, and then log-probability within each group).To enrich the model with this ranking information, we append a special token When concatenating the representations, this also gives the model information on where each summary candidate representation starts.
To further make the model aware of the quality of each summary candidate, we also add a classification component.Given a candidate C k and a summary evaluation metric µ (e.g.ROUGE (Lin, 2004)), the model has to predict whether C k is maximizing µ among the summary candidates in C i .We frame this as a binary classification problem and the associated binary cross-entropy loss is: where z µ k is 1 if the candidate C k maximizes the metric µ, 0 otherwise, and p θ (C k ) is the probability predicted by the model that the candidate is positive.Following SummaReranker (Ravaut et al., 2022), we use a multi-label approach and train the classification model jointly for M = {µ 1 , . . ., µ M } several metrics, and we also condition on the source representation as input to the classifier (concatenated with the candidate representation).The final classification loss is: In practice, we use metrics Our final model loss is a combination of both the generation and classification losses: where λ is hyper-parameter to be tuned on the validation set.

Input Dropout
Our model has access to several information streams at once.To make it robust and not only learn to rely on a single channel (for instance, only the source document), we use input dropout (Provilkov et al., 2020).We use two variants of input dropout during training as follows: • Source dropout: To prevent the model from solely relying on source, we replace the source x i with a placeholder token with probability p src .
• Candidates dropout: We subsample with uniform probability k ∈ {2, . . ., m} summary candidates to keep, replacing the other m − k by placeholder tokens.This stops the model from only copying the summary candidate from a fixed position (for instance, the top beam).
4 Experimental Settings

Abstractive Summarization Tasks
We apply our SummaFusion to three popular abstractive summarization datasets, each covering a different domain.Datasets were chosen for their high level of abstraction, and are as follows: • XSum (Narayan et al., 2018) is the task of extreme summarization.It consists in news articles being compressed into highly-abstractive, singlesentence summaries.The dataset spans 227k articles from the BBC from 2010 to 2017.• Reddit TIFU (Kim et al., 2019) corresponds to real-life stories written in the form of long blogs on the popular Reddit social media.It is made of 120k posts.As in other summarization papers (Zhang et al., 2020;Ravaut et al., 2022), we use the TIFU-long subset, containing 37k posts.
Compression ratio is significantly higher on this dataset, as the source conversations are short.
We excluded the popular CNN/DM dataset since it is highly extractive (Hermann et al., 2015;See et al., 2017).Detailed statistics on all our datasets can be found in Table 2.As seen, on each dataset, there is a high proportion of n-grams in summaries which are not found in the source, highlighting the very abstractive nature of these summarization tasks.To download datasets, we use HuggingFace datasets library (Lhoest et al., 2021).

Model Details
As base model B, we use PEGASUS (Zhang et al., 2020), a strong baseline on our selected datasets.To generate candidates, we use diverse beam search (Vijayakumar et al., 2016) with 15 beams, following (Ravaut et al., 2022).Diverse beam search is much more suited than beam search for our setup, due to the greater variety among the whole pool of candidates and consequently higher oracle performance (see Appendix A for oracle results).
For SummaFusion encoder and decoder, we use BART (Lewis et al., 2020) and experiment with both the base and the large versions, referred to as SummaFusion-base and SummaFusion-large.Although re-using PEGASUS is technically feasible, we found that SummaFusion benefits from diverse models.We train SummaFusion in both full-shot and three few-shot setups (10-shot, 100-shot, and 1000-shot).

Training and Optimization
As a second-stage supervised method, SummaFusion suffers from the inherent train-test distribution mismatch.This means that one cannot train Sum-maFusion on outputs of the base model B on the training set, as the generated summaries would be of a different distribution than the generated summaries on the validation and test sets. 1 To alleviate this issue, we follow the 50-50 split approach used in SummaReranker (Ravaut et al., 2022).We split each training set in two equal halves, fine-tune B on each half and infer it on the other half, then train the fusion model on the concatenation of both inferred halves.At inference, we use the transfer approach and apply SummaFusion on candidates generated by another base model fine-tuned on the entire training set.
We follow XSum and SAMSum provided train:val:test splits, and use the same 80:10:10 split as (Ravaut et al., 2022) on Reddit TIFU.On XSum, we use fine-tuned PEGASUS checkpoints hosted by HuggingFace transformers library (Wolf et al., 2020).On the other two datasets, we finetune our own PEGASUS, starting with the pre-  3).Dashed lines correspond to validation results, and full lines to test set results.
trained checkpoint shared on transformers.Hyperparameters used for fine-tuning the base PEGASUS can be found in the Appendix B Table 13, and Table 14 for summary generation hyper-parameters.
We initialize SummaFusion backbone BART with the pre-trained checkpoint from (Wolf et al., 2020).To optimize the model, we train for 5 epochs with Adam optimizer (Kingma and Ba, 2014) and a constant learning rate of 2e-5.We found λ = 1.0 to work well and used it in all the results we report.We generate SummaFusion summaries with beam search using beam width 10.Detailed Sum-maFusion fine-tuning hyper-parameters are in Appendix C.
In the few-shot setups, we use three random seeds, and for each seed sample randomly a training set and a validation set each of the corresponding few-shot size.We show validation and test results averaged over the three few-shot models, alongside corresponding standard deviations.

Evaluation
We compare SummaFusion outputs with our own PEGASUS with 15 diverse beams baseline (PE-GASUS (ours)), as well as PEGASUS-random, a baseline consisting in randomly selecting a summary candidate.We also include the oracle for reference (PEGASUS-oracle) and compare with PEGASUS reported results (Zhang et al., 2020).We use ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) and their mean as quantitative metrics to assess summary closeness to the target.

Full-shot Results
Full-shot results with SummaFusion-base and SummaFusion-large for all datasets are displayed in Table 3. SummaFusion-large improves the ROUGE-1 compared to the PEGASUS with diverse beam search baseline by 0.30 ROUGE-1 points on XSum, 4.41 points on Reddit TIFU and 1.41 points on SAMSum.We notice that SummaFusion helps the most relatively on Reddit TIFU, on the which the baseline performance is sensibly lower.Although not from the same type of technique, Sum-maFusion is comparable to other recent secondstage methods on XSum (Sun and Li, 2021).

Few-shot Results
Next, we apply SummaFusion to three few-shot scenarios: 10, 100 and 1000 labelled data points, re-XSum Reddit-TIFU SAMSum spectively.Results are shown in Fig. 2 and Table 4.
We notice that with a lower quantity of annotated data, SummaFusion relative gain is higher.Notably, SummaFusion dramatically improves ROUGE on 10-shot summarization.As seen in Table 4, Sum-maFusion is on par with or better than state-ofthe-art few-shot (10-shot and 100-shot) abstractive summarization models, including the comparable second-stage method SummaReranker.PE-GASUS remains a very strong 100-shot baseline.We exclude WikiTransfer (Fabbri et al., 2021) from the comparison as it was specially designed for few-shot transfer and it leverages additional data (Wikipedia) before few-shot fine-tuning.These results point to a common direction: Sum-maFusion works better on lower quality candidates, "fixing" them into a much better summary.In the following section, we analyze this hypothesis.

When Is Summary Fusion Helpful?
The previous section suggested that SummaFusion is better on lower-quality base candidates, such as in Reddit-TIFU or few-shot setups.To verify this hypothesis and better characterize the fixing behavior of SummaFusion, we split the test set across four different features: • Summary quality: this is the mean ROUGE with the target averaged over all diverse beam search candidates produced by the base model.This feature assesses the overall quality of the initial set of summary candidates.
• Candidates diversity: we compute 1 − ROUGE-1 for all pairs of summary candidates and average the results.Since a high ROUGE-1 between candidates indicates that they overlap, the average 1 − ROUGE-1 feature measures how diverse is the pool of summary candidates.• Source length: this corresponds to the number of words in the source document.Modeling long inputs is challenging so we expect summarization models to work less well on longer documents.• Compression ratio: this is the ratio between the number of words in the target summary and the number of words in the source.More compressive (lower ratio) data points are expected to be more challenging.
Results are shown in Fig. 3, with one subplot for each of the four features and for each dataset.There are several important takeaways from this figure: • SummaFusion is indeed better on lower quality base candidates (top left subfigures).On every dataset, the green curve is significantly ahead of the blue one for summary bins of lowest quality.In fact, we notice that SummaFusion is even harmful for summary bins of the highest quality (top 20%).• SummaFusion is better on more diverse base candidates (top right subfigures).This is true as both the base model and SummaFusion perform worse when diversity increases, yet SummaFusion cushions the drop.and the precedent confirm the hypothesis that SummaFusion helps on more challenging setups.• SummaFusion is better on longer summaries (on Reddit-TIFU) (bottom right subfigures).
There is no clear trend over all datasets for this feature.It is also not intuitive which case is harder to learn, as a short compression ratio means a higher level of summarizing, while a longer one corresponds to longer output summaries, which is also more prone to decoding errors.

Abstractiveness
Because SummaFusion conditions on both the source and the first-stage candidates, we shall now distinguish between two types of abstractiveness: • Source-abstractiveness: this is the fraction of novel n-grams in the SummaFusion with regards to the source document.• Candidates-abstractiveness: this is the fraction of novel n-grams in the SummaFusion with regards to the entire pool of candidates.
We analyze both abstractiveness in Table 5, comparing with the base PEGASUS, and also the ground truth summaries (for sourceabstractiveness).Following other work, we measure abstractiveness with 1/2/3-grams counts.Surprisingly, SummaFusion is not more abstractive Table 7: Input ablation on all datasets.We experiment with removing either the source or the entire set of candidates at inference.

XSum
Reddit-TIFU SAMSum Figure 4: Pruning input candidates on the three datasets.We make inference with SummaFusion with a gradually increasing number of first-stage summary candidates.SF is SummaFusion.
with regards to the source, as it is slightly less abstractive than PEGASUS.Rather, SummaFusion is maintaining a high level of source-abstractiveness on these highly abstractive datasets, and also offers a satisfactory level of candidates-abstractiveness.

Ablation
To better understand how components of the model are interacting with each other in SummaFusion, we run an ablation study.We also compare our model with a naive baseline simply concatenating the truncated source and all summary candidates, referred to as Concat-baseline.Results are shown in Table 6.Compared to its ablated versions and the Concat-baseline, SummaFusion is able to achieve much higher ROUGE while maintaining high source abstractiveness and candidates abstractiveness.The Concat-baseline, despite reaching good source abstractiveness, is not able to produce a satisfactory level of candidates abstractiveness (only 13.27% new 2-grams compared to 46.31% for SummaFusion).
To assert the importance of each input stream, we perform inference when removing either the source, or the entire set of first-stage candidates.Removal is done by replacing inputs with the tokens used during input dropout §3.3.As we can see in Table 7, both the source and the candidates are highly necessary for SummaFusion to reach full performance.Fig. 4 provides finer-grained insights into how performance varies with a gradually increasing number of input candidates.This confirms our choice of conditioning SummaFusion decoding on both the source and all the first-stage candidates.

Human Evaluation
We run a human study comparing a baseline summary with the SummaFusion one on 50 random samples from each dataset.As baseline, we use both PEGASUS (Zhang et al., 2020), andSummaReranker (Ravaut et al., 2022), in order to compare to another second-stage method.Human volunteers are graduate students with professional English proficiency, and we select three volunteers per dataset.Human graders have to decide which summary they prefer or if it is a tie.In the former, they also have to select at least one reason motivating the preference among the three following: the summary is more informative, more fluent or grammatical, or more factually consistent with the source.
As we see in Table 8, humans clearly prefer SummaFusion summaries over PEGASUS ones on all dataset.The difference is striking in terms of fluency, and informativeness on Reddit-TIFU.On XSum, SummaReranker and SummaFusion summaries are deemed of equal quality, but SummaFusion is preferred on the two other datasets.12).Since during training, SummaFusion sees between 2 and 15 candidates, we have the flexibility to input any number of candidates in this range at inference.We experiment with m ∈ {5, 10, 15}.
Results on Reddit TIFU are summarized in Table 9. Impressively, SummaFusion can outperform the oracle in more than 30% cases with 5 candidates.This unveils yet a new interesting use case for SummaFusion: if computational budget is limited at inference and the beam width is capped to a lower value, it is also very beneficial to use Sum-maFusion to improve on the generated summaries.

Conclusion
We introduced SummaFusion, the first method for abstractive second-stage summarization.Our model encodes the source document and each diverse beam search summary candidate individually, and fuses them in the decoder.It is designed for very abstractive summarization tasks, and works especially well on challenging data points such as longer source documents.We achieve state-of-theart ROUGE results in 10-shot-and 100-shot summarization on XSum, Reddit-TIFU and SAMSum.Besides, fused summaries are favored by humans over first-stage PEGASUS candidates.

Limitations
As a second-stage abstractive summarization model, a drawback of our approach is that it requires to train an additional model on top of the base summarization model.Besides, during training, because we split the training set in halves, we actually require to train two base summarization models.We also need to generate summary candidates for each data point of the training, validation and test sets, which is time consuming.For these reasons, SummaFusion presents some computational overhead.Nevertheless, training and inference fits into a single Nvidia RTX 6000 24GB GPU.
We also observed that SummaFusion worked less well with beam search candidates than diverse beam search ones.While we attribute this to beam search candidates being too similar with each other, it remains an open question how to improve Sum-maFusion on such cases with very similar input candidates.

Ethics Statement
Our proposed approach is an abstractive summarization method.Therefore, it is prone to hallucinations and generating summaries with facts not in the source document, with some of these facts being wrong.Therefore, the model outputs should be analyzed with caution in critical scenarios.

A Oracle Scores
We show oracle scores (in terms of ROUGE-1/2/L) results for the same PEGASUS decoded with 15 beams for both beam search and diverse beam search.In few-shot SummaFusion fine-tuning, we make the following changes to the values from ??:

D More Qualitative Examples
In the following pages, we show examples of Sum-maFusion outputs in the same format as Table 1 on the other two datasets.

Figure 1 :
Figure 1: SummaFusion model architecture.Summa-Fusion encodes the source and each of the m summary candidates separately, then concatenate their representations on the sequence dimension before decoding.

Figure 2 :
Figure 2: Few-shot ROUGE results on the three datasets.Vertical bars on the dots represent standard deviation over the random seeds.Results on "all" available correspond to full-dataset fine-tuning (see Table3).Dashed lines correspond to validation results, and full lines to test set results.

Figure 3 :
Figure3: Fine-grained analysis on the three datasets.We split data points over four features: average quality of summary candidates pools (top left plots), average diversity of summary candidate pools (top right plots), source length (bottom left plots), and compression ratio (bottom right plots).Across each feature, we split the test set into 10 bins of equal size, and within each bin compute the mean ROUGE of the baseline PEGASUS top beam as well as the mean ROUGE of SummaFusion-large.

Table 3 :
ROUGE results on the three datasets with PEGASUS base model.The first block shows performance from generated summaries after the first stage, while the second block corresponds to second-stage summarization models.BS denotes beam search, DBS is diverse beam search, and R-1/2/L means ROUGE-1/2/L.Gain is the relative gain over the mean of ROUGE-1, ROUGE-2 and ROUGE-L from our own PEGASUS DBS baseline.* SummaReranker is trained on its recommended setup of a mix of beam search and diverse beam search summary candidates.Results in italic are not directly comparable, as they either involve accessing the target (oracle), or are obtained on a different split (Reddit).

Table 4 :
Detailed few-shot ROUGE results.On Reddit TIFU, PEGASUS results are in italic as they are not directly comparable (different train:val:test split).Both second-stage methods SummaReranker and SummaFusion (bottom blocks) are trained and inferred on the same DBS candidates from our PEGASUS baseline.

•
SummaFusion is better on longer source documents (bottom left subfigures).This observation

Table 5 :
Abstractiveness.We report proportions of novel n-grams on the test set of each dataset, with regards to both the source and the sets of candidates.

Table 6 :
Model ablation study on Reddit-TIFU.We cumulatively remove components of SummaFusion, and report results on the test set.

Table 8 :
Human evaluation on all datasets.We show mean counts over three humans rating 50 data points in each dataset, with standard deviation in parenthesis.For each dataset, the first block compares the PEGASUS summary with the SummaFusion one, while the second block compares SummaReranker with SummaFusion.

Table 9 :
SummaFusion surpassing the first-stage oracle counts (as percentages) on Reddit-TIFU test set.

Table 10 :
Oracle ROUGE-1/2/L results for two decoding methods on all three datasets with a base PEGASUS with 15 generated candidates.As seen in Table10, diverse beam search leads to higher oracle values than beam search, motivating our choice to use it.

Table 11 :
Oracle standard deviations on all three datasets with a base PEGASUS decoded with diverse beam search with 15 generated candidates.

Table 12 :
Oracle ROUGE-1/2/L results when varying the beam width on all three datasets with a base PEGASUS with diverse beam search.

Table 13 :
PEGASUS fine-tuning hyper-parameters in full-shot and few-shot setups used across all three datasets.LR stands for learning rate, LS means label smoothing, BS is the effective batch size, and Eval.frequency represents the number of training batches between two consecutive evaluations of the model during training.

Table 14 :
PEGASUS generation hyper-parameters used to obtain 15 first-stage summary candidates with diverse beam search.

Table 15 :
SummaFusion fine-tuning hyperparameters (full-shot) for each dataset.LR stands for learning rate, Optim. is the optimizer, BS is the effective batch size, and Eval.frequency represents the number of training batches between two consecutive evaluations of the model during training.We truncate each of the input first-stage candidates to the 95th-percentile value of the summary length distribution on each dataset (represented in the Max tokens per candidate column).