How Far are We from Robust Long Abstractive Summarization?

Abstractive summarization has made tremendous progress in recent years. In this work, we perform fine-grained human annotations to evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of implementing them to generate reliable summaries. For long document abstractive models, we show that the constant strive for state-of-the-art ROUGE results can lead us to generate more relevant summaries but not factual ones. For long document evaluation metrics, human evaluation results show that ROUGE remains the best at evaluating the relevancy of a summary. It also reveals important limitations of factuality metrics in detecting different types of factual errors and the reasons behind the effectiveness of BARTScore. We then suggest promising directions in the endeavor of developing factual consistency metrics. Finally, we release our annotated long document dataset with the hope that it can contribute to the development of metrics across a broader range of summarization settings.


Introduction
Pre-trained Transformers (Devlin et al., 2019;Raffel et al., 2020) have brought tremendous progress in summarizing text in an abstract manner (Rothe et al., 2021).Unlike extractive summarization (Xiao and Carenini, 2019;Cui and Hu, 2021;Ju et al., 2021;Shi et al., 2022), abstractive summarization presents a blue-sky potential of generating summaries that are fluent and relevant to the source by intelligently paraphrasing salient contents rather than merely copying from source texts (Beltagy et al., 2020;Ju et al., 2020;Zaheer et al., 2020;Huang et al., 2021).Nevertheless, even under a short document setting, Transformer-based abstractive models often generate summaries that are repetitive (See et al., 2019;Holtzman et al., 2019), ungrammatical, and factually inconsistent with the source (Durmus et al., 2020;Kryscinski et al., 2020;Maynez et al., 2020).Furthermore, current pre-trained Transformers have an input length limit that restricts them to be directly adapted to long document summarization (Lewis et al., 2020;Zhang et al., 2020) as it would lead to a significant loss of salient information in the remaining text.These naturally bring us to a question: How far are we from building a robust abstractive summarization system for long documents?
A robust abstractive summarization system should at least have (i) models that can generate high-quality summaries, and (ii) evaluation metrics that can critically assess the relevance and factuality of a summary1 .However, research on analysis and critiques of models (Wilber et al., 2021;Ladhak et al., 2022) and metrics (Gabriel et al., 2021;Pagnoni et al., 2021) mainly focus on the shortdocument (Kryściński et al., 2019;Fabbri et al., 2021) or long dialogue (Zhang et al., 2021).Consequently, our work aims to fill the gap by systematically analyzing abstractive models and evaluation metrics under the long document setting.
To analyze the quality of current state-of-the-art long document abstractive models, we lack a set of model-generated summaries with sufficient diversity under long document settings.To this end, we implement BART (Lewis et al., 2020) and PE-GASUS (Zhang et al., 2020) models under arXiv (Cohan et al., 2018) and GovReport (Huang et al., 2021) as they have been found to be the most effective pre-trained Transformer in a large-scale evaluation of summarization models (Fabbri et al., 2021).However, their 1,024 token input limit would lead to a significant loss in the information required to generate a high-quality summary.
Hence, by closely following prior works in extending the pre-trained models using sparse attention (Beltagy et al., 2020;Zaheer et al., 2020;Huang et al., 2021) and reduce-then-summarize mechanism (Pilault et al., 2020;Zhang et al., 2022), we implement different variants of Longformerbased BART and PEGASUS to obtain a diverse set of summaries.We then perform fine-grained human analysis on the model outputs by three human annotators to qualitatively assess whether long document abstractive models can generate relevant and factually consistent summaries.
Effective evaluation metrics are also paramount as they can critically assess the model performance before releasing it to target users.We adapt recently proposed metrics (Durmus et al., 2020;Kryscinski et al., 2020;Nan et al., 2021;Yuan et al., 2021;Laban et al., 2022) to long document settings and thoroughly analyze their strength and weaknesses to measure the relevance and factual consistency on our annotated dataset.To our best knowledge, we are the first to assess abstractive models and evaluation metrics under the long document setting.
Our contributions are as follows: (1) We analyze pre-trained Transformer summarizers to encourage a rethinking of architectural designs under long document settings.(2) We release human-annotated long document abstractive model outputs to further research in human-correlated evaluation metrics across a broader setting.(3) We investigate summarization metrics using our annotated long document datasets to expose the limitation of metrics and provide promising directions for the future development of evaluation metrics.

Long Abstractive Models
To implement pre-trained Transformers (Devlin et al., 2019;Raffel et al., 2020) for long document summarization tasks, they have to be adapted with long document mechanisms to improve models' efficiency and extend their input limit (Koh et al., 2022).In this work, we focus on analyzing abstractive models after incorporating the two following long document mechanisms: Sparse Attention It aims to reduce the quadratic complexity of Transformers into sub-quadratic complexity (Child et al., 2019;Kitaev et al., 2019;Choromanski et al., 2020) while exploiting the benefits of pre-training (Beltagy et al., 2020;Zaheer et al., 2020;Huang et al., 2021;Guo et al., 2022;Pietruszka et al., 2022).The gain in efficiencies allows Transformer to be fine-tuned on downstream summarization tasks with a substantially longer input text.Despite a plethora of proposals on sparse attention, Xiong et al. (2022) recently showed that simple local attention remains competitive.
Reduce-then-Summarize This approach aims to reduce the source text into a shorter subset, allowing it to fit within the input token limit of a Transformer.The source text can be reduced into a more condensed text through extraction of salient sentences (Pilault et al., 2020;Zhao et al., 2020;Bajaj et al., 2021) or generation of shorter texts from segments of the source (Gidiotis and Tsoumakas, 2020;Zhang et al., 2022).These models often train Transformer-based summarizers using reduced source texts which greedily maximize ROUGE scores and utilize separate retrievers during the testing stage to avoid "cheating" (Pilault et al., 2020;Manakul and Gales, 2021;Mao et al., 2022).Importantly, the retriever will also be trained to maximize ROUGE to avoid a significant disconnect between the training and testing stage.

Evaluation Metrics
Given the limitations of the ROUGE metric (Chaganty et al., 2018;Kryściński et al., 2019), new metrics are proposed to better measure two fundamental qualities of summary: relevance and factual consistency.Relevance metrics such as ROUGE variants (Ng and Abrecht, 2015;Ganesan, 2018;ShafieiBavani et al., 2018) and BERTScore (Zhang et al., 2019) measure whether a summary contains the main ideas of the source.A factual consistency metric assesses whether a summary is factually consistent with the source (Goyal and Durrett, 2020;Wang et al., 2020).Due to the high rate of factual errors in the summaries generated by short-document models (Cao et al., 2018;Maynez et al., 2020), there have been substantial efforts in developing effective metrics which can measure the factuality of a summary (Honovich et al., 2021;Xie et al., 2021;Ribeiro et al., 2022).

Generation of Model Summary
To investigate the robustness of long document abstractive systems, we need a set of model-generated summaries that can roughly represent the state of current research progress.In this section, we describe our methodology to obtain such samples.

Model Variants
Pretraining Task We implement BART (Lewis et al., 2020) and PEGASUS (Zhang et al., 2020).Both models have a 1,024 input token limit with extra text tokens to be truncated.We extend the input limit of BART and PEGASUS using the sparse attention and reduce-then-summarize mechanism.
Sparse Attention We extend the input limit of the pre-trained Transformer using Longformer's adaptation to have a maximum input of 1K, 4K, and 8K tokens (Beltagy et al., 2020).Xiong et al. (2022) recently showed that local-window attentions (i.e., only attending to neighborhood tokens) are sufficient and competitive against other variants.The Longformer sparse attention adaptation thus gives us a reasonable baseline representation for current long document abstractive summarizers.

Reduce-then-Summarize
To explore the effectiveness of the reduce-then-summarize approach, we implement an oracle retriever by greedily extracting salient sentences that maximize ROUGE-2 up to the input limit of Transformer during the training and inference stage.Although using reference summaries to extract the salient sentences at the testing stage is considered cheating, contemporary approaches are trained to retrieve oracle summaries and are thus trained to become an oracle retriever (Manakul and Gales, 2021;Mao et al., 2022).Using an oracle retriever allows us to analyze whether the reduce-then-summarize approach will generate desirable summaries given that the retriever is perfectly trained with its upper bound performance.This allows us to analyze whether the summary generated from a ROUGE-maximizing model with a reduce-then-summarize mechanism will be desirable for target users.We implement models with 1K, 4K, and 8K tokens of the reduced subset.

Long Document Dataset
We implement the model configurations above on the ArXiv (Cohan et al., 2018) and GovReport (Huang et al., 2021) because they cover a wide range of topics in the scientific and general domains respectively.Both have an average source length of greater than 6,000 tokens, sufficiently long to challenge pre-trained Transformers.Besides, arXiv requires models to paraphrase more as compared to GovReport.Both datasets are chosen after analyzing the characteristics of datasets across 10 benchmark datasets with details in Appendix A.5. Kryściński et al. (2019) has shown that 60% of most important sentences lie within the leading one-third of the CNN-DM articles (Nallapati et al., 2016).However, the linguistic styling and structure of a short document would often differ significantly from a long document.To investigate how much information a model would lose when processing only the leading text, we plot the distribution of salient content of arXiv and GovReport.This is done by performing human annotation on 10% randomly sampled document-summary pairs from arXiv (700) and GovReport (100) test set.For each sentence in the reference summaries, we trace back the leading source paragraph position in the original document that contains the idea required to generate a sentence.Distribution plot in Figure 1 shows the source position frequency in terms of the total percentage of the occurrence.The line plots illustrate the total information loss given an input limit.This reflects the information loss of a model when it only takes the leading source tokens.The line plot suggests that an input limit of 1K, 4K, and 8K tokens would equate to roughly 80%, 40%, and 20% average information loss respectively on both datasets.
Importantly, we see more salient information to be distributed from 1K to 2K tokens than 0 to 1K tokens, suggesting that the strategy of vanilla BART and PEGASUS to process the leading 1K input limit is sub-optimal.We hope that the result here would also provide directions for future architectural designs to identify salient contents.

Training Details
Given two pre-training tasks with three input limit settings for Longformer-based Sparse Attention and Reduce-then-Summarize settings, this gives us Sparse Attention  (Pang et al., 2022), and GovReport, DYLE (Mao et al., 2022).Red represents best dataset result and Bold represents best result under the sparse attention or reduce-then-summarize setting.
12 model configurations per dataset.For 1K token configurations, we use BART-large and PEGASUSlarge.For 4K and 8K token configurations, we follow Longformer's implementation in extending the position embedding to 4K and 8K tokens by repeatedly copying position embeddings of BART and PEGASUS.To ensure comparability, all 24 models have a fixed output length of 512 tokens and are fine-tuned independently on RTX 3090 GPU with 24 GiB of GPU memory.We follow original authors in train/validation/test split of ArXiv (Cohan et al., 2018) and GovReport (Huang et al., 2021).Implementation details in Appendix A.3.

ROUGE Validation
Table 1 shows that sparse attention models achieve competitive but lower ROUGE than state-of-theart models, arXiv-TDT (Pang et al., 2022) and GovReport-DYLE (Mao et al., 2022).Extending the vanilla BART and PEGASUS using Longformer also provides a performance boost as the information loss is reduced exponentially when the input limit increased from 1K to 4K and 8K.The reduce-then-summarize models achieve ROUGE that either match or exceed arXiv's and GovReport's state-of-the-art.As increasing the input length would place more burden on reduce-thensummarize models to identify tokens that maximize ROUGE over long sequences, we see a slight decrease in ROUGE as the length is increased.
The above results indicate that the implemented Longformer-based sparse attention models can reasonably reflect the current long abstractive summarization baselines, while the reduce-thensummarize models can roughly represent the summary outputs of state-of-the-arts under arXiv and GovReport.In the next two sections, we will investigate whether the advancement in summarization research has brought us far enough to build a robust summarization system (i.e., model and metric) based on the summaries generated from all of the 24 implemented summarizers in this section.For consistency, we will refer to Longformer-based sparse attention BART and PE-GASUS as BART/PEGASUS (LEAD #K) as it only takes the leading input token, whereas, reducethen-summarize models will be referred to as BART/PEGASUS (ORACLE #K).The # symbol represents the token input length limit of the Transformer-based summarizer.

Human Evaluation of Models
To assess the overall quality of summaries, we randomly sampled 204 model-generated summaries from each dataset to be evaluated by three annotators based on the relevance and factual consistency aspect.To ensure comparability between model variants, we randomly sampled document IDs from the test set and extracted all 12 corresponding model summaries to annotate.As each summary ranged from 5 to 15 sentences, we annotated 4,190 sentences, matching a large-scale human evaluation by Pagnoni et al. (2021) of 2,250 short-document articles.

Annotation Procedures
Relevance Relevance measures whether a summary contains the main ideas of the source.As the author is arguably the best person to summarize the source, we assign relevance scoring based on the percentage of the reference summary's main ideas contained in the generated summary.The relevance score of each summary is the average of three annotation samples.
Factual Consistency Factual consistency measures whether a candidate summary is factually consistent with the source.Following Pagnoni et al. (2021), we classify each summary sentence's factuality based on seven types of errors: i) PredEpredicate in summary inconsistent with source, ii) EntityE -primary arguments or its attributes are wrong, iii) CircE -predicate's circumstantial information is wrong, iv) CorefE -co-reference error, v) LinkE -multiple sentences linked incorrectly, vi) OutE -out of article error and vii) GramEunreadable sentence(s) due to grammatical errors.Similarly, the factual consistency of a summary is the percentage of factually consistent sentences and the final score is the average of three samples.
Inter-Annotator Agreement Following Fabbri et al. ( 2021), the inter-annotator interval kappa of relevance score between the three annotators is 0.5874, computed based on Krippendorff's alpha coefficient (Krippendorff, 2011) where each score is assigned to a multiple of quarter intervals.To calculate inter-annotator agreement of factual consistency, we follow Durmus et al. (2020); Pagnoni et al. (2021) in using Fleiss Kappa, κ, and the percentage, p, of annotators that agree with the majority class.With a total of 4190 sentences, we observe κ = 0.52 and p = 84%, slightly lower but comparable to Pagnoni et al. (2021)'s result (κ = 0.58 and p = 91%).

Long Abstractive Model Analysis
Relevance Benefiting from processing the oracle inputs, Figure 2 shows BART/PEGASUS (ORA-CLE #K) to achieve a higher relevance score than BART/PEGASUS (LEAD #K).On average, PE-GASUS also performs better than BART.Looking at the models with the same pre-training tasks, we observe that BART (ORACLE #K) did not significantly outperform BART (LEAD #K) on arXiv.On the other hand, PEGASUS (ORACLE #K) shows a significant improvement over PEGASUS (LEAD #K) under both the arXiv and GovReport dataset.We hypothesize that when models take the oracle inputs, the text becomes incoherent and the immediate connection between sentences is less obvious, causing it harder for BART models to understand the contextual dependencies between the tokens.In contrast, PEGASUS's Gap-Sentence Generation pre-training may help models in reasoning the contextual dependencies of an incoherent text.Factual Consistency On average, we also observe PEGASUS makes fewer factual errors as compared to BART across most settings.Unlike the relevance aspect, the BART/PEGASUS (ORA-CLE #K) setting often achieves lower factual consistency results as compared to BART/PEGASUS (LEAD #K).This indicates that while models can more easily capture relevant text, incoherent texts may cause them to make more factual errors.As BART/PEGASUS (ORACLE #K) utilize an oracle retriever during testing that is not allowed under normal settings, similar issues could potentially be exacerbated when a model-based retriever (Pilault et al., 2020;Manakul and Gales, 2021;Mao et al., 2022) is used to extract salient sentences from the source.Finally, this also indicates that maximizing ROUGE itself leads us to models with more relevant summaries but may not be necessarily factual.
Summary Quality v.s.Input Limit Other than high-level analysis of the different pre-training and mechanism results, we investigate the relationship of the adjustment in input limit of different Transformer variants against the human-annotated relevance and factual consistency scores.Table 2 shows that the relevance score increases when the input limit of the BART/PEGASUS (LEAD #K) While we see an improvement in factual consistency scores when vanilla pre-trained Transformers increase their input limits using Longformer, only BART (LEAD #K) under arXiv shows a statistically significant result.The BART/PEGASUS (ORACLE #K) models do not show conclusive results as to which configurations will generate summaries that are most factually consistent.

Fine-grained Analysis of Factual Errors
Under real-world scenarios, a model will not be evaluated based on the percentage of factual sentences and is only considered robust if it generates summaries that are almost entirely error-free.However, the models generate factually inconsistent summaries, on average, 35% and 81% of the time under arXiv and GovReport respectively.The least errors are made by PEGASUS (LEAD 8K) in arXiv (21%) and PEGASUS (ORACLE 1K) in GovReport (60%).Given the unacceptably high amount of factual errors, it is fair to conclude that the models are not sufficiently robust.Thus, it is more important that we analyze the type of errors they made and how we can improve their performance in the factuality aspect.To this end, we investigate the proportion of summaries with different types of factual error instances in Figure 3.
As arXiv articles are pre-processed when the dataset was introduced by Cohan et al. (2018) while GovReport articles closely resemble the original documents (Huang et al., 2021), the task is made less challenging under arXiv, and mistakes related to CorefE, EntE and CircE are greatly reduced.Still, models under the arXiv setting generate higher LinkE errors as they are required to paraphrase the source text more.We also see BART (ORACLE #K) and PEGASUS (ORACLE #K) to make more LinkE errors as the oracle input text is less coherent as compared to the leading input text.We again observe PEGASUS makes fewer errors as compared to BART.The better performance of PEGASUS mostly comes from making fewer CorefE, EntE, GramE and PredE errors.We conclude this section by noting that while ROUGE scores show minor differences between BART and PEGASUS, human evaluation relevance and factual consistency scores reveal that PEGA-SUS is considerably better than BART.This conflicts with the findings of Rothe et al. (2021) that PEGASUS task-specific pre-training did not bring improvement in summarization performances, emphasizing the need of evaluating summaries based on the quality judged by a summary user rather than solely relying on the ROUGE metric.

Human Evaluation of Metrics
With high factual inconsistency rates, long abstractive summarizers remained unready for real-world implementation.It is thus paramount to ensure that the performances of future proposed models can be evaluated by metrics that are well correlated with user judgment.However, previous works on evaluation metrics have mainly focused on short document summarization research settings due to (i) the lack of human-annotated long document modelgenerated summaries and (ii) the reliance of metrics on pre-trained language models that are fine-tuned on short document datasets (Maynez et al., 2020;Durmus et al., 2020;Wang et al., 2020).Relying on our annotated dataset, we adapt evaluation metrics proposed in prior works to the long document settings and correlate their metric scores with average human relevance and factual consistency scores.
Factual Consistency Factual consistency metrics we assess are: OpenIE (Goodrich et al., 2019) that extracts semantic triples from source and summary, then compute scores through embedding matching (Reimers and Gurevych, 2019).FactCC (Kryscinski et al., 2020) adopts a weaklysupervised model approach.FEQA (Durmus et al., 2020) and QUAL (Nan et al., 2021) evaluate factuality using a question-generation and answering (QGA) approach.TE-MNLI (Maynez et al., 2020) and SummaC (Laban et al., 2022) are text entailment approach, TE-MNLI evaluates probability of entailment at the document-level while SummaC at the sentence-level.For metrics with short input limits, we extend the input limit of FactCC using Longformer and use the oracle summaries as a substitute for the source for FEQA, QUAL and TE-MNLI.Implementation details in Appendix A.4.

Overall Result
Relevance Contrary to past research under shortdocument setting (Kryściński et al., 2019;Bhandari et al., 2020;Akter et al., 2022), Table 3 shows that ROUGE scores still correlate best with the human judgment of relevance score in our settings.This provides comfort for future research to rely on the ROUGE metric for benchmarking long document abstractive models in generating relevant summaries.We hypothesize that the effectiveness of ROUGE metric is due to the linguistic styling of long document datasets that are often written in formal languages.We caution that similar results may not be achieved by ROUGE metric when the dataset and model-generated summaries are sufficiently abstractive.

Factual Consistency
The metrics that achieve the best overall correlation with the human factual consistency scores are fine-tuned BARTScore, followed by SummaC, FactCC, and OpenIE.Interestingly, zero-shot BARTScore also achieves third and fifth-best results on arXiv and GovReport respectively.Consistent with Pagnoni et al. (2021), QGA approaches do not seem to achieve statistically significant results, except for QUAL under GovReport.From the perspective of efficiencies, BARTScore and FactCC require approximately 4 days of fine-tuning per dataset on an RTX 3090 GPU while zero-shot SummaC and OpenIE can be implemented immediately without dataset-specific training.On balance, SummaC and BARTS-FT seem to stand out from the rest as the most effective zero-shot and fine-tuned metric respectively.Nevertheless, it is more important to thoroughly investigate why and when the metrics will identify factual inconsistencies in model outputs.

Identification of Factual Error Types
Overall correlation with human factual consistency score does not reveal the limitations of a metric in identifying different types of factual errors (Goyal and Durrett, 2021;Pagnoni et al., 2021).Hence, we plot the contribution of each error type to the overall correlation in Figure 4.It shows the change in correlation when the error type is excluded from the calculation.As compared to Table 3, a higher positive bar value shows that the error type contributed more to the metric performances, causing a decrease in overall correlation.
)DFW&& 6XPPD& %$576)7 48$/ 2SHQ,( Figure 4 shows that OpenIE and BARTScore are not able to identify entity errors (EntE) well.We hypothesize that this is because OpenIE relies on the soft-embedding similarity while BARTScore finds reasonableness in generating closely related entities in the source document.Nevertheless, BARTScore and OpenIE show better ability at identifying sentence linkage (LinkE) errors as BARTScore takes the full context of the entire generated summary into account while OpenIE assesses the relationship between semantic triples.FactCC, SummaC and QUAL which only relied on sentence-or questionlevel granularity did not see a high correlation with LinkE as they do not take the overall contexts of the generated summaries.SummaC shows strong correlations with entity (EntE) and out-of-article (OutE) errors.As different metrics can better identify different factual error types, combining the advantages of various metrics to address their limitations may be worthwhile.For a simple illustration, by taking the average normalized metric scores of BARTS-FT and SummaC, we are able to increase Table 3's best Pearson correlation result of arXiv from 32% to 38% and GovReport from 51% to 59%, representing an absolute percentage point increase of 6% and 8% respectively.

On the Effectiveness of BARTScore
Given the superiority of BARTScore as a factuality metric, we further analyze it in detail.BARTScore relies on a BART's average log-likelihood of generating the evaluated summary conditional on the source document: 1 m m t=1 log p(y t |y <t , d) where y t represent generated tokens in the summary at generation step t while d represents source (Yuan et al., 2021).Under the fine-tuned variant, BARTScore is fine-tuned as a summarization model.Thus, a lower BARTScore indicates that the BART model shows a lower likelihood of generating the evaluated texts.This suggests that summarization models are "aware" of potentially making factual errors in the form of lower generation probability.Similar to our findings, Xu et al. (2020) has found that lower generation probability (and higher entropy value) leads to greater novelty in the tokens generated but a higher chance of factual inconsistencies under short-document settings.Consequently, solving the factuality aspects of abstractive models and metrics from this perspective may be a fruitful direction to explore.
In addition, we fine-tuned BARTScore on different datasets and compute its correlation with human factual consistency scores in Table 4. BART shows a better correlation when metrics are fine-tuned on in-domain datasets.In particular, we find the best results are achieved for arXiv when BART is fine-tuned on arXiv or PubMed and for GovReport when BART is fine-tuned on GovReport.
To validate this hypothesis, we further imple- ment FEQA with Sci-BERT (Beltagy et al., 2019) fine-tuned on SQuaD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) ) and QUAC (Choi et al., 2018) and we obtain statistically significant Pearson correlation (ρ = +0.22) on arXiv, a four-fold increase as compared to the original variant.This finding strongly emphasizes the importance of fine-tuning metrics on in-domain datasets.Future work on metrics could thus benefit from incorporating fine-tuning strategies (Kryscinski et al., 2020;Laban et al., 2022) rather than relying merely on publicly available models (Maynez et al., 2020;Durmus et al., 2020).Importantly, the fine-tuning strategy should be efficient and generalizable to other domains to ensure that it is not limited to short news articles.

Conclusion
In this work, we perform human evaluations of model-generated summaries to critically analyze the relevance and factual consistency aspect of models and metrics under long document settings.
For models, we highlight that the constant strive for higher ROUGE scores leads us to long document models with more relevant summaries but not necessarily factual ones.We also show that PE-GASUS pre-training allows long document Transformer to make fewer factual errors and can comprehend incoherent text better, suggesting that PE-GASUS can be more beneficial than BART for reduce-then-summarize architectures that are common for long document summarizers.For metrics, we observe that ROUGE remains superior at assessing the relevance of summaries, while a fine-tuned BARTScore can be most effective in evaluating the factuality of long document summaries.
We also release the annotated dataset to encourage analysis of summarization systems across a broader range of settings.We hope that this work can provide practical insights into future research to develop long document summarization systems that can be relied upon in our daily lives.

Limitations
Our findings and conclusions relied on human annotation efforts by three annotators.To balance the quality and quantity of annotation, three annotators evaluated the same 408 summary-document pairs across two datasets.While having three annotations per summary-document pair reduces the variability and enhances the final quality of annotation, increasing the size and diversity of our annotated dataset would further enhance the statistical significance of our findings.
Prior works on summarization metrics have assessed their performances on short summarydocument pairs and often relied on pre-trained models with token limits that cannot be easily extended.While we have taken reasonable steps in adapting their methods to long document settings, it is plausible that better adaptation approaches can be discovered.
Finally, our experiments are conducted on the arXiv and GovReport benchmark datasets.The documents in both datasets are written in formal language.While formal language is common across long document benchmark datasets, this may result in domain bias.Our experimental processes and findings may also be limited to the English language.This is especially the case for our humanannotation process as we relied on English grammatical rules to determine the qualitative aspects of model-generated summaries.Thus, our processes and findings are likely not applicable to long documents that are not written in English.Nevertheless, we hope that our work can indirectly inspire or be extended to the research in multilingual long document summarization.

A.1 Broader Impacts
Abstractive models implemented are in general neural conditional generation models that have a wide range of capabilities due to their ability to carry out arbitrary language generation tasks.This may have a negative society such as generating texts that are biased towards certain minorities or unfairly discriminate against a certain group.This risk may, for example, arise from the human-annotated model dataset that we aim to release along with this work.Nevertheless, we have taken sufficient care to ensure that the potential risks of broad negative impacts are minimized.Based on our annotation, we believe that the risks of negative broader impacts are well manageable.

A.2 Human Annotated Dataset
All of the 408 human-annotated summaries are randomly sampled from the summaries generated from our implemented models on arXiv and GovReport dataset.To ensure that our model summaries are annotated by human experts, we recruited three volunteers.One has years of industry experience in accounting and finance with CIMA certification while the other two are Ph.D. students of public health and computer science.Our aim for the release of the human-annotated dataset is to encourage the development of a factual consistency summarization system (model and metric).The dataset is intended for research use only.Other than that of what is already publicly available, we have taken extra steps to ensure that the factual inconsistencies generated by the summarization models do not discriminate against any individual or uniquely identify a certain person, thereby leaking information.

A.3 Model Implementation
Our model experiment in section 3 was implemented on the arXiv and GovReport with train/validation/test split of 203,037/6,436/6,440 and 17,519/974/973 respectively.Given two different pre-trained Transformers with three different input limit lengths that were tested on the baseline Longformer-only BART/PEGASUS models as well as upper-bound reduce-then-summarize models.This gives us twelve model variations per dataset.For 1K token configurations, we use BART-large and PEGASUS-large.For 4K and 8K token configurations, we follow Longformer's implementation in extending the position embedding to 4K and 8K tokens by repeatedly copying BARTlarge and PEGASUS-large's 1K position embeddings multiple times.All models are trained with teacher forcing on the same RTX 3090 GPU with 24 GiB of GPU memory.To save memory, we implemented gradient checkpoint.For all models with have an effective batch size of 16 where the batch size is set to be 2 and gradient accumulation step set to 8. The most expensive experiments of 8K limit require approximately 3 and 4 days respectively for Longformer-BART and Longformer-PEGASUS.As ROUGE tends to prefer longer summaries (Sun et al., 2019), we fix the maximum model output length to be 512 tokens.Generation parameters of beam search is 5 and length penalty is set to 2.0.

A.4 Factual Consistency Metric Implementation
FEQA, QUAL, FactCC, and TE-MNLI were proposed to evaluate the factual consistency of modelgenerated summaries under short document settings.They relied on pre-trained Transformerbased models where the input limit of 1024 tokens or lower.To extend these metric models to the long document domain, we adopt two approaches: if (i) the model requires data specific fine-tuning like FactCC, we extend the input limit of the metric model using Longformer, or (ii) the model relies on a pre-trained model that is fine-tuned on other datasets, we extract the oracle summaries of the source document where the length is the input limit of the pre-trained model.
FactCC FactCC (Kryscinski et al., 2020) implements a BERT-based factual consistency classifier that is trained on synthetic data, where the positive data labels are non-paraphrased and paraphrased sentences from the source document, and the negative labels are artificially corrupted sentences from the source document.The starting point of the BERT model is uncased, base BERT model pre-trained on English data with 512 token limits.We extend this model to 8,192 tokens using Longformer's implementation.Then, we follow the original author's work in generating the synthetic data to train our extended BERT classifier on RTX 3090 GPU with 24 GiB of GPU memory.
TE-MNLI TE-MNLI (Maynez et al., 2020) is a BERT-large classifier fine-tuned on the Multi-NLI dataset (Williams et al., 2018).The classifier judged if a summary entails the document, is neutral to the document, or contradicts the document.Multi-NLI is a sentence-level classifier.We tokenize the candidate summary into sentences and separately evaluate the factual consistency of each sentence.The score for a candidate summary equals 1 minus the average probability of contradiction for all sentences in the candidate summary.To adapt the Multi-NLI BERT-large classifier on the long document domain, we limit the total length of summary sentence and document to be less than 512 token lengths by replacing the source document with its oracle summary.
FEQA and QUAL FEQA (Durmus et al., 2020) and QUAL (Nan et al., 2021) measures factual consistency of summaries using a question-generation and question-answering (QGA) approach.This approach employs a question-generation model to generate questions from a given summary output.
The generated questions are then measured in two different i) answering the question conditioning on the source and ii) answering the question conditioning on the summary.If the answers match between the source and the summary, the answer is then considered consistent, otherwise, it is inconsistent.QUAL attempts to improve the efficiency of such an approach by combining the question-generation and question-answering steps into a single model.We limit source and candidate summary length to less than 512 tokens by replacing the source document with its oracle summary.

A.5 Benchmark Dataset Comparison
Long document benchmark datasets studied in this work have been used in prior research to test and compare long document summarization models.arXiv and PubMed (Cohan et al., 2018) are scientific long document summarization datasets.Big-Patent (Sharma et al., 2019) is collected from U.S. patent documents.BillSum is a dataset on summarizing state bills (Kornilova and Eidelman, 2019).
Compression Ratio measures the ratio of a source document length against its reference summary length.A higher compression ratio indicates larger information loss in the original document after being summarized.Compression ratios are measured based on tokens and sentences: Extractive Coverage and Extractive Density are introduced by Grusky et al. (2018) based on the notion of matching fragments.Fragments are obtained by greedily matching the longest shared token sequence where F(D, S) reflects a set of fragments with each fragment having a length represented by |f |.Extractive coverage calculates the percentage of tokens in summary that is a derivation of the original source text, whereas, extractive density relates to the average squared length of the extractive fragments in the summary.The former indicates the need for a model to coin novel tokens that are not in the original source text while the latter measures whether a model can match the ground truth summary merely by extracting from the original source text without rearranging or paraphrasing text.

COV ERAGE(D, S)
Uniformity measures whether content that are considered important by the reference summary are uniformly scattered across the entire source document.A higher score indicates that important content are scattered across the entire document with no obvious layout bias to take advantage of.This is calculated based on the normalized entropy of the decile positions of salient unigrams in the source text, where salient unigrams are the top 20 keywords extracted3 , excluding stopwords, from the reference summary.Fundamentals of Long Document From Table 5, the long document datasets differ from the short documents datasets in two important aspects: document length and compression ratio.Not only that long document datasets have an average document length that is 8.3 times longer than the short document datasets, they also have a considerably higher compression ratio.As compared to short documents, this suggests that either (i) there is a greater compression in the summaries, and/or (ii) the source document contains significantly more redundant information.Both aspects significantly challenge a model's ability to summarize a long document as it is required to reason over long-range dependencies.

U N F (unigram
Extractiveness and its Relationship with Compression Ratio Looking at the density value, Big-Patent and arXiv are significantly less extractive than Pubmed, BillSum and GovReport.Thus, a summarizer is required to have a greater ability at paraphrasing the original document under Big-Patent and arXiv.This finding is important as past work in analyzing abstractive summarization of short documents has found that the quality of model-generated summaries (Tejaswin et al., 2021;Wilber et al., 2021) and effectiveness of evaluation metrics (Gabriel et al., 2021;Pagnoni et al., 2021) to vary based on the extractiveness of benchmark datasets.Intriguingly, we further observe a strongly negative correlation, ρ = −0.9186, between the extractive density and the compression ratio metrics.We hypothesize that this is because, under a scenario where summary length is extremely limited, the summary writers are forced to intelligently paraphrase the source concisely so that the reference summaries can cover the salient contents.
Based on the findings above, we choose Gov-Report as it is the most extractive dataset with an average compression ratio, and arXiv as it is the second most abstractive dataset with the greatest compression ratio in terms of token for our systematic analysis of long document summarization systems (i.e., models and metrics).

A.6 Human Evaluation Results for Each Model Variant
Figure 5 shows human evaluation results for each model variant made in the arXiv and GovReport datasets as annotated by our volunteers.
A.7 Fine-grained analysis of Abstractive Summarizer's Factual Consistency Figure 6 shows the types of factual errors that the abstractive models made in the arXiv and GovReport datasets as annotated by our volunteers.As a long document summary have multiple sentences and can have multiple types of errors, the total proportion may exceed 1 but the proportion of errors for each type should be lower than 1.
A.8 Human Correlation Results for Precision, Recall, F1 of ROUGE and BERTScore Table 6 shows the correlation of ROUGE and BERTScore for precision, recall and F1 scores.

Figure 1 :
Figure 1: Distribution of salient content against the document length according to human annotators (left) ; Information loss of Transformer-based abstractive summarizers based on different input limits (right).

Figure 2 :
Figure 2: Average human relevance (top) and factual consistency (bottom) scores for BART and PEGASUS models with 1K, 4K and 8K input limit.

Figure 3 :
Figure 3: Average Proportion of Factual Error Type for all generated summaries of BART and PEGASUS models with 1K, 4K, and 8K input limits.As a long document summary have multiple sentences and can have multiple error types, the total proportion may exceed 1.

Figure 4 :
Figure 4: Change in Pearson correlation when error types are omitted.Higher value indicates a greater influence of the error type on overall correlation result.

Figure 6 :
Figure 6: Factual Consistency across different model variants.The proportion for each type of error is shown based on the percentage of summaries with the same type of error.As long document summaries may have multiple sentences, each summary may have more than one type of error.

Table 2 :
Coefficient of simple linear regression of Relevance and Factual Consistency against Input Limit.

Table 3 :
Statistical Relationship between human judgment (relevance and factual consistency) and metric scores based on Pearson correlation, ρ, and Spearman rank correlation, r, coefficients and their p-values.Upper and lower part show results for general metric and factual consistency metric respectively.

Table 4 :
Human Factual Consistency correlation with BARTScore variants fine-tuned on different datasets.All results are statistically significant, where p < 0.05.

Table 5 :
Comparison of Short and Long Document Summarization Datasets.Intrinsic characteristics are computed based on the average result of test samples.Average Ratios are computed based on the average long over short document statistics.

Table 6 :
Statistical Relationship between human judgement (relevance and factual consistency) and metric scores based on Pearson correlation and Spearman rank correlation coefficients and their p-values.