Leveraging Locality in Abstractive Text Summarization

Neural attention models have achieved significant improvements on many natural language processing tasks. However, the quadratic memory complexity of the self-attention module with respect to the input length hinders their applications in long text summarization. Instead of designing more efficient attention modules, we approach this problem by investigating if models with a restricted context can have competitive performance compared with the memory-efficient attention models that maintain a global context by treating the input as a single sequence. Our model is applied to individual pages, which contain parts of inputs grouped by the principle of locality, during both the encoding and decoding stages. We empirically investigated three kinds of locality in text summarization at different levels of granularity, ranging from sentences to documents. Our experimental results show that our model has a better performance compared with strong baseline models with efficient attention modules, and our analysis provides further insights into our locality-aware modeling strategy.


Introduction
Neural abstractive summarization (Rush et al., 2015;Nallapati et al., 2016) is mainly formulated as a sequence-to-sequence (Sutskever et al., 2014) (Seq2Seq) problem.Neural attention models, e.g., Transformers (Vaswani et al., 2017), have been widely used for such Seq2Seq tasks, allowing effective modeling of various dependencies in the input and output sequences.However, the self-attention module in such models introduces a quadratic memory growth with respect to the input sequence length.Consequently, for long-text summarization datasets,2 recent works (Beltagy et al., 2020;Kitaev et al., 2020;Zaheer et al., 2020) have explored The X-axis represents the distance of two sentences in source documents measured by the difference of their locations (indexes).Y-axis represents the average semantic similarity calculated by the cosine similarity between sentence embeddings, which are generated by a pre-trained sentence embedding model (Gao et al., 2021).The dash line shows the average similarity.
using efficient attention to reduce the memory footprint while still maintaining the same global context of a full-attention model -every input token can receive information from all the other input tokens.However, efficient attention is just an approximation of full attention and can show lower performance compared with its counterpart (Kitaev et al., 2020).To investigate an alternative memoryefficient modeling approach, we argue that models with a restricted context, where each token only receives a subset of tokens as its context during the entire computation, can be competitive with efficient attention models if they can effectively leverage locality in text summarization.Locality, or the principle of locality, is one of the fundamental principles of virtual memory systems (Denning, 2005), 3 and exists in a wide range of domains (Koopman et al., 2013;Fonseca et al., 2003;Zamanian et al., 2015).A classic example of locality is the spatial locality in computer memory systems -data units that are stored closely on the disk are likely to be accessed during a short time period by a computer process, therefore it is beneficial to read a block of data as a page in the memory instead of reading only one data unit at a time.Such patterns also exist in text summarization.For example, on the arXiv dataset, we observe an intrinsic spatial locality in source documentsthe closer in the document two sentences are, the more semantically similar they are (Fig. 1).This observation supports the inductive bias of window attention (Beltagy et al., 2020;Zaheer et al., 2020), which allows each token to interact with its neighboring tokens within the window size.
We introduce a framework of leveraging locality for text summarization, which reduces the memory complexity of full-attention models while still maintains competitive performance.Instead of viewing the input document as an entire sequence, we represent an input document as a number of pages which are constructed according to the principle of locality (Fig. 2).Each of these pages is encoded independently by the encoder of our abstractive model, and the decoder makes local predictions over each page along with local confidence scores of its predictions, which are used to combine the local predictions into final outputs.In this framework, tokens in different pages never directly interact with each other during encoding and decoding, which highlights the role of locality in text summarization.In contrast, one of the key assumptions of efficient attention models is that all tokens in the input text should interact with each other, which is made possible because (1) global tokens (Beltagy et al., 2020) or overlapping win-dow attention maintain a global context during encoding; (2) the encoder-decoder attention takes the source document embeddings as an entire sequence during decoding.
Using the proposed framework, we are able to investigate several types of locality in text summarization: (1) spatial locality or sequential locality -neighboring sentences are grouped into the same (non-overlapping) page; (2) discourse locality -different sections in a scientific paper may cover different aspects, therefore they are viewed as different pages (Cohan et al., 2018); (3) document locality -for multi-document summarization, each document in a document cluster can be viewed as an individual page (Jin and Wan, 2020).Our approach also has other advantages: (1) Our model can take the most advantage of pre-trained fullattention models (e.g.BART (Lewis et al., 2020)) because it preserves the same attention mechanism as the full-attention models, unlike most of the efficient attention models; (2) It reduces the overall complexity of encoder self-attention to a linear relationship with the input document length.We empirically demonstrate that our model outperforms strong baseline models built upon various efficient-attention modules on several summarization datasets.Furthermore, we conduct detailed analyses on different modeling options for our framework, shedding lights on its broader uses.

Preliminaries
Abstractive summarization models aim to generate a shorter text sequence as the summary of an input document.Given an input document D and a reference summary S, the standard training algorithm of a neural abstractive summarization model g adopts the cross-entropy loss, which requires the model to predict the next token of the reference summary given the input document and the prefix of the reference summary before the current token: where θ is the trainable parameters of the model g, p g θ is the predicted probability over the vocabulary, l is the length of the summary S, {s 1 , • • • , s i , • • • , s l } are tokens in S, S <i denotes the partial reference sequence {s 0 , • • • , s i−1 } and s 0 is a pre-defined start token.

Encoder-Decoder Model
The encoder-decoder model formulates abstractive summarization as a Seq2Seq task, where h i is the hidden representation.The generation probability is where L vocab is a linear projection layer.
Neural Attention and Its Limitations Neural attention modules are essential to the success of Transformers (Vaswani et al., 2017) and pre-trained language models (Radford et al., 2019;Lewis et al., 2020;Zhang et al., 2020) for language generation tasks such as machine translation or text summarization.Given a query matrix Q, a key matrix K, and a value matrix V , the output of the dot-product attention is: To compute Eq. 4 in a parallel manner, it requires O(l Q • l K ) memory space to store the intermediate result of QK T where l Q and l K are the length of Q and K respectively.This becomes a bottleneck of the self-attention module for long input documents, where Q, K, V come from the same input D, and the space complexity becomes O(l D 2 ), where l D is the length of the input document and can be very large (e.g. more than 10,000 tokens).

Locality-aware Abstractive Text Summarization
To avoid the quadratic growth of memory with respect to the length of the input, we introduce a different view for modeling the input text.Specifically, instead of viewing the input document as an entire text sequence, we view it as a series of nonoverlapping pages with a fixed maximum length: where P i is the i-th page and n is the number of pages.We hypothesize that with the principle of locality, the abstractive summarizer can make local predictions about the output summary based on individual pages without having each input token interact with the entire input document: where h (j) i is the local hidden state of the i-th token of the summary given the j-th page.Apart from the hidden state, we also require the decoder to predict a confidence score of its local prediction: where L conf is a linear layer projecting the hidden state h (j) i to a scalar.The confidence scores are normalized: and used to combine the local hidden states for predicting the final output: Fine-tuning from Pre-trained Models Our model can be directly initialized from a pre-trained language model (e.g.BART (Lewis et al., 2020)) except for an additional linear layer L cong (Eq.7).
The cross-entropy loss (Eq. 1) with label smoothing (Szegedy et al., 2016) is used for training.
Space Complexity Our model has a linear space complexity with respect to the length of input documents.Specifically, given a pre-defined maximum page length L page , a document of which the length is l D will be split into at most ⌈ l D Lpage ⌉ pages.The space complexity of the encoder self-attention for one page is O(L 2 page ), and the complexity for all pages is Locality in Abstractive Summarization We mainly explore three types of locality for abstractive summarization, which provide the principles of splitting an input document or document cluster (in the case of multi-document summarization) into different pages.
(1) Spatial Locality: in the most direct form, an input document can be sequentially split into different pages.The underlying intuition is that neighboring sentences are likely to focus on the same topic.Under this setting, each document will be equally split into n p pages, which is a pre-defined number.
(2) Discourse Locality: long documents usually have a hierarchical discourse structure, and discourse units at the same level have different focus.
For example, a scientific paper usually has multiple sections with different purposes (e.g.introduction, related work, etc.), and this discourse structure can be a useful inductive bias (Cohan et al., 2018).Under this setting, each discourse unit (e.g. a section in a scientific paper) is viewed as a page.
(3) Document Locality: for multi-document summarization, we can view each single document in the document cluster as a page.Previous work (Jin and Wan, 2020) has shown that multi-document summarization can benefit from single-document summarization models by first summarizing each document then combining the predictions.
4 Related Work

Efficient Attention Models
Efficient attention models aim to reduce the memory complexity of full attention models, of which the most important and commonly used building blocks are window attention (Beltagy et al., 2020;Zaheer et al., 2020) and low-rank approximation (Liu* et al., 2018;Wang et al., 2020;Peng et al., 2021;Choromanski et al., 2021).
Window attention means that each token can only receive information from its neighboring tokens that are located in the same window.However, multi-layer models with overlapping window attention (Beltagy et al., 2020;Zaheer et al., 2020;Manakul and Gales, 2021;Guo et al., 2021) can still maintain a global context.On the other hand, non-overlapping window attention (local attention) with fixed windows (Liu* et al., 2018;Zhao et al., 2020;Pietruszka et al., 2020) has a restricted context since tokens in different windows cannot interact with each other.Instead of using fixed windows throughout the model, using window attention with learnable patterns (Kitaev et al., 2020;Tay et al., 2020;Huang et al., 2021) offer more flexibility because windows can be dynamically constructed at different layers of the model, which allows a larger context.Headwise sparse attention (Qiu et al., 2020;Huang et al., 2021) is another method of reducing memory usage while preserving global context.
Compared to these methods, our model has a dis-tinct feature in that we maintain a local context of the input tokens at both the encoding and decoding stages.Zhao et al. (2020) proposed a similar blockwise encoder-decoder attention module which only uses a subset of input tokens (blocks) at each decoding stage.However, our method differs from theirs in that our model dynamically combines the local predictions based on all the individual pages into the final output (Eq.9).

Hierarchical Summarization Models
Hierarchical attention (Yang et al., 2016)  Hierarchical models have also been widely used for multi-document summarization.Hierarchical attention can focus on the sentence level (Fabbri et al., 2019), paragraph level (Liu and Lapata, 2019), and document level (Zhang et al., 2018;Jin and Wan, 2020;Jin et al., 2020).Ernst et al. (2021) porposed a proposition-level clustering algorithm, which generates summaries from each of the proposition clusters extracted from source documents.
The multi-stage method of text summarization (Chen and Bansal, 2018;Xu and Durrett, 2019;Pilault et al., 2020) also has a hierarchical structure.In particular, Zhang et al. (2022) first generates a coarse summary for each part of the input document, then further summarizes the generated summaries.Mao et al. (2022) first extracts sentences from the source documents, and generates the summary based on the selected sentences.
Our method introduces pages as a new, unified abstraction for hierarchical models which can be instantiated as sentence clusters, scientific paper sections, and entire documents in a document cluster.Furthermore, unlike previous work, our model emphasizes the role of locality by preventing explicit interactions among different units (pages) at the higher levels of the hierarchy.Baselines We use the following top-performing models as baselines for comparison.
(3) PRIMERA (Xiao et al., 2022) shares the same architecture as LED, but has task-specific pretraining for multi-document summarization.
It uses full attention and not sparse attention.
Implementation Details We use BART 8 as the backbone of our model, except for the linear layer computing the confidence scores (Eq.7).We initialize the model from either a checkpoint pre-trained on CNN/DailyMail dataset (Hermann et al., 2015;Nallapati et al., 2016), or its counterpart without the CNN/Dailymail pre-training.We select the model checkpoints based on their performance on the validation set, using cross-entropy loss (Eq.1).We use ROUGE (Lin, 2004) as the automatic evaluation metric for performance comparison.More specifically, we report the F1 scores of ROUGE-1/2/L in our experiments.We name our model as PageSum for the following experiments.

Exp-I: Spatial Locality
We first investigate the case of spatial locality, where the sentences in the source document are sequentially split into different pages with the same number of sentences.The maximum number of tokens for one page is 1,024.
We report the model performance 9 in Tab. 1 on 8 It contains around 400M parameters. 9For a fair comparison, we used public-available checkpoints of LED from Hugging Face's Transformers (Wolf et al., 2020) on arXiv ('allenai/led-large-16384-arxiv') and PubMed ('patrickvonplaten/led-large-16384-pubmed') to generate the summaries and used our own evaluation script.The performance difference between the original result and the ours is likely because the original implementation uses window-attention with 512 tokens while HF uses 1,024 tokens.the arXiv, PubMed, GovReport datasets.We make the following observations.(1) PageSum achieves better ROUGE scores on all three long text summarization datasets compared with the baselines that leverage efficient attention modules.

System
(2) On Pubmed, HAT-BART achieves slightly better performance than PageSum, likely because HAT-BART uses full attention instead of efficient attention.
(3) On GovReport, increasing the maximum input length helps to improve PageSum's performance.

Exp-II: Discourse Locality
We use the arXiv dataset to explore another locality principle -discourse locality.Specifically, we view each section of the input document as an individual page.The maximum number of tokens for one page is still 1,024, however, here we allow each example to have a different number of pages because documents can have different numbers of sections.For each page, we concatenate the name of the section and the content together as the input.
The results in Tab. 2 show that PageSum with discourse locality achieves higher ROUGE scores than PageSum with spatial locality.In addition, we note that with discourse locality, PageSum can also generate more coherent summaries.Specifically, following Bommasani and Cardie (2020), we evaluate the semantic coherence of the generated summaries using the next sentence prediction task (Devlin et al., 2019)  model10 to predict the probability (p BERT ) of one sentence S (i−1) in the summary S being followed by the next sentence S (i) : where N S is the number of sentences in the summary.Tab. 3 shows the average semantic coherence of summaries.The summaries generated by PageSum with discourse locality have higher semantic coherence, suggesting that grouping the sentences based on discourse structures helps to generate more well-structured summaries.

Exp-III: Document Locality
For multi-document summarization, we evaluate PageSum with document locality on MultiNews, where we view each document in the document cluster as a page.The other experiment setting is the same as in §5.3.In addition to the baseline systems in §5.1, we add another model BART-Long-Graph (Pasunuru et al., 2021) for comparison, which is specifically designed for multi-document summarization and achieves top performance on MultiNews.The results are shown in Tab.4.11 PageSum achieves strong performance in this setting, outperforming the previous state-of-the-art models.We also note that PageSum with document locality achieves much better performance than its counterpart with spatial locality, suggesting the importance of choosing the suitable locality for a specific task.

Analysis
We analyze several important aspects of our method to gain further insights.
Page Size To investigate how the maximum length of a page affects the model performance, we conduct experiments with different page sizes on arXiv.For a fair comparison, we first truncate each document in arXiv to 4,096 tokens, then split the document into different pages based on the page size.The results are shown in Tab. 5. We observe that increasing the page size generally helps to improve model performance.However, model performance stops increasing after the page size reaches 512 tokens.
Page-wise v.s.Global Decoding Both the encoder and decoder in PageSum are designed to follow the principle of locality.Specifically, the decoder in PageSum first makes local predictions based on each encoded page (Eq.6), which are later combined into final predictions.An alternative approach is to directly make global predictions based on the entire input document -the encoded pages are concatenated as a single sequence, which serves as the input to the decoder.We compare this option with our modeling strategy in Tab. 6. 12The results show that on arXiv, page-wise decoding with spatial locality has a similar performance Visualizing Locality The confidence scores calculated by PageSum's decoder (Eq.7) can be interpreted as the importance scores of different pages at each decoding step.That is, a page associated with a higher score will contribute more to the decision at the current step.Fig. 3 depicts how the importance scores changed during the decoding of the reference summaries on MultiNews and arXiv using two examples.We observe two phenomena: (1) space locality -at each decoding step only a subset of pages are making large contributions to the current prediction; (2) time locality -PageSum's decoder tends to focus on the similar subset of pages at neighboring decoding steps.

Human Evaluation for Coherence
Summary coherence is a critical aspect of the summary quality, especially when the summaries are very long.Fabbri et al. (2021) shows that automatic metrics have a low correlation with human evaluation results w.r.t.summary coherence, while Goyal et al. (2022) demonstrates that recent stateof-the-art summarization models can still make many coherence errors on long text summarization datasets.Therefore, we conduct human evaluation for the coherence of system-generated summaries on GovReport13 dataset to investigate this important aspect.
Error Type

RefE
The Part D program, administered by the Centers for Medicare & Medicaid Services (CMS), pays Part D plan sponsors to provide drug coverage, and plan sponsors may charge beneficiaries monthly premiums in exchange for coverage.Plan sponsors and PBMs negotiate reimbursement rates for the drugs provided to beneficiaries.... Seventy-four percent of the drug benefits management services provided under 624 Part D plans sponsors' contracts were performed by a pharmacy benefit manager (PBM) alone or in conjunction with a plan sponsor in 2016.
The word, PBM, is an abbreviation for pharmacy benefit manager, which is mentioned without first introducing the full name.

TopicE
. . .The President may implement the recommendations suggested in the Commerce report, take other actions, or decide to take no action.After making a decision, the President has 15 days to implement the action and 30 days to submit a written statement to Congress explaining the action or inaction; he must also publish his findings in the Federal Register.While there is no specific definition of national security in the statute, it states that the investigation must consider certain factors, such as domestic production needed for projected national defense requirements; domestic capacity; . . .Following Goyal et al. (2022), we use a finegrained human evaluation protocol which requires the annotators to identify different types of spanlevel coherence errors in the summaries.We adopted the taxonomy of coherence errors proposed by Goyal et al. (2022) and modified it for GovReport, which results in four types of coherence errors (the definitions are taken and modified from the definitions in Goyal et al. (2022)): (1) Missing Information/Reference about an Event/Object (RefE).These refer to coherence errors where an event or object is mentioned the first time without the proper context or introduction.On GovReport, a common error is referring an entity using its abbreviation without introducing the entity and its whole name before.
(2) Abrupt Transition from the Previous Topic (TopicE).These refer to coherence errors where there is a sudden topic shift in the summary.
We show examples of these types of errors in Tab. 7. We randomly sampled 30 examples from the test set of GovReport, and counted the number of text spans containing the coherence errors in the summaries generated by PageSum and LED.All examples are annotated by three of the authors. 14We anonymized the examples for a fair comparison.The results are shown in Tab. 8.
14 The Krippendorff's alpha (Krippendorff, 2011)  Aligned with the findings in Goyal et al. (2022), we found that both LED and PageSum make a nontrivial amount of errors.However, PageSum is able to make fewer errors for each of the error types except for the InconE error type.

Case Study: Long-Distance Dependencies
A global context can be much more important in the presence of long-distance dependencies for text summarization models (Fernandes et al., 2019;Xu et al., 2020a).To study this phenomenon, we leverage the notion of sentence fusion (Barzilay and McKeown, 2005) to investigate sentence-level dependencies.Specifically, following Lebanoff et al. (2019a,b), we define a fusion sentence in the reference summary to be a sentence that has significant overlaps with two or more sentences15 in the source document.Then, we define two sentences ŝ1 , ŝ2 in the source document D to be interdependent if they have the most significant contribution to a   fusion sentence h: (ŝ1, ŝ2) := arg max More details can be found in Appendix B. We found that PageSum can fail to capture the dependencies where two interdependent sentences are far away from each other.We show such an example in Tab. 9, where the 14th sentence and 410th sentence in the source document both contribute to the same fusion sentence.PageSum's output only captures the information in the 14th sentence.However, the impact of the potential failures is restricted.As shown in Fig. 4, there are much fewer interdependent sentence pairs with long distances.

Conclusions
We empirically investigate three kinds of locality in abstractive text summarization by using them as important inductive biases.Using a new abstraction of viewing the input document as a series of pages, our model emphasizes the role of locality in both encoding and decoding stages.The experimental results show that our model has strong performance by following the principle of locality.We also show that it is important to select the suitable kind of locality for different application scenarios.We note that the fact that our model has better or competitive performance comparing with the models equipped with efficient attention modules suggests that those models may fall short of their designing objectives.Therefore, for future work, our findings call for more rigorous examinations of the memory-efficient abstractive summarization models that aim to capture global features (e.g.long-distance dependencies) and maintain a global input context.

Limitations
Computation Resources While our approach can reduce the memory footprint of full-attention models, it still requires GPUs with large memory sizes (e.g.48 GBs) and long time (more than 7 days with a single GPU) to train our model.We note that our model has a similar memory footprint as the efficient-attention models such as Longformer (Beltagy et al., 2020).Therefore, the requirement of computation resources is a common challenge in long text summarization.
Long-Distance Dependencies The inductive bias of our approach is to emphasize the role of locality in abstractive text summarization.As a result, our approach can fail to capture long-distance dependencies.We have discussed this potential problem in §5.7.While we have shown that the ratio of sentence-level long-distance dependencies are relatively low in the datasets we investigated for this work, it is worthwhile to be aware of this limitation when extending our method to other datasets.
Human Evaluation While we have presented a fine-grained human evaluation on summary coherence in §5.6, there are other important aspects of summary quality such as factual consistency (Maynez et al., 2020).However, it is even a more non-trivial task to evaluate an inputdocument-based aspect such as factual consistency on the datasets we used as it requires reading the entire input documents which can be more than 10K words long and having domain-specific knowledge to understand the context of scientific papers or government reports.We believe the research of long text summarization will benefit greatly from better human and automatic evaluation.

Figure 1 :
Figure1: Intrinsic spatial locality in the arXiv dataset.The X-axis represents the distance of two sentences in source documents measured by the difference of their locations (indexes).Y-axis represents the average semantic similarity calculated by the cosine similarity between sentence embeddings, which are generated by a pre-trained sentence embedding model(Gao et al., 2021).The dash line shows the average similarity.

Figure 2 :
Figure 2: Model architecture.Our model views the source document as a number of non-overlapping pages, and the final output is a weighted combination of local predictions on the individual pages.

Figure 3 :
Figure 3: Visualization of importance scores of different pages at each decoding step on MultiNews and arXiv.Darker colors represent greater importance.

Figure 4 :
Figure 4: Number of interdependent sentences with different distances on GovReport and arXiv datasets.X-axis represents the ratio of sentence distances normalized by the number of sentences in the entire document.

Table 1 :
System performance comparison for spatial locality.R-1/2/L are the ROUGE-1/2/L F 1 scores respectively.The numbers in parentheses indicate the maximum input length (tokens).*: results reported in the original papers.‡: results from our own evaluation script (and own checkpoints).†: significantly better than LED ‡ (p < 0.01).PageSum denotes the model fine-tuned from a BART checkpoint pre-trained on the CNN/DailyMail dataset, while PageSum ⋆ is its counterpart without the CNN/DailyMail pre-training.For PageSum ⋆ , the maximum token number is 7168 on arXiv and PubMed, and 20480 on GovReport.
(Fabbri et al., 2019)datasets (Tab.10) in our experiments.arXivandPubMedaretwoscientificpapersummarizationdatasets introduced byCohan et al. (2018).5Theabstracts of the papers are used as the summaries of the main content of those papers.GovReport 6(Huang et al., 2021) is a long document summarization dataset based on reports published by the U.S. Government Accountability Office and Congressional Research Service.MultiNews 7(Fabbri et al., 2019)is a multidocument summarization dataset, with news articles and summaries collected from newser.com.

Table 2 :
System performance comparison for discourse locality on arXiv.R-1/2/L are the ROUGE-1/2/L F 1 scores respectively.The numbers in the parentheses indicate the maximum input length.PageSum-Spatial is with spatial locality.PageSum-Discourse is with discourse locality.†: significantly better (p < 0.05).

Table 3 :
Semantic coherence (Eq.11) of summaries on arXiv.reference is the reference summary.random is an oracle which randomly shuffles reference summary sentences.spatial is PageSum with spatial locality while discourse is with discourse locality.discoursehas significantly higher (p < 0.01) coherence than spatial.

Table 5 :
Performance comparison of different page sizes on arXiv.Page Size denotes the number of tokens in one page.#Pages denotes the number of pages.R-1/2/L are the ROUGE-1/2/L F 1 scores respectively.

Table 6 :
Comparison of page-wise decoding and global decoding on arXiv and MultiNews.R-1/2/L are the ROUGE-1/2/L F 1 scores respectively.
InconE... To do this work, GAO selected seven states Arizona, Florida, Kansas, New Jersey, Pennsylvania, Tennessee, New York, Virginia, and Pennsylvania based on factors such as population size, Medicaid enrollment, and geographic location and interviewed CMS officials....RepE. . .The high productivity helped the operation come in under budget by $118 million a 36 percent reduction while the operation's cost was $185 million, 36 percent below the anticipated cost. . . .The 36 percent reduction are mentioned twice in one sentence.

Table 7 :
Examples of different coherence errors on GovReport dataset.RefE: Missing Information/Reference about an Event/Object.TopicE: Abrupt Transition from the Previous Topic.InconE: Inconsistent, Conflicting Information.RepE: Repetition.

Table 8 :
is 0.5719.Human Evaluation for Coherence on GovReport.We report the number of different coherence errors made by PageSum and LED on 30 examples (averaged across three annotators).RefE: Missing Information/Reference about an Event/Object.TopicE: Abrupt Transition from the Previous Topic.InconE: Inconsistent, Conflicting Information.RepE: Repetition.

Table 9 :
Case Study on GovReport about longdistance dependencies.Both 14th and 410th sentences to the same reference sentence.PageSum's output fails to capture this long-distance dependency.