Entity-based SpanCopy for Abstractive Summarization to Improve the Factual Consistency

Discourse-aware techniques, including entity-aware approaches, play a crucial role in summarization. In this paper, we propose an entity-based SpanCopy mechanism to tackle the entity-level factual inconsistency problem in abstractive summarization, i.e. reducing the mismatched entities between the generated summaries and the source documents. Complemented by a Global Relevance component to identify summary-worthy entities, our approach demonstrates improved factual consistency while preserving saliency on four summarization datasets, contributing to the effective application of discourse-aware methods summarization tasks.


Introduction
Abstractive text summarization, the task to generating informative and fluent summaries of the given document(s), has attracted much attention in the NLP community.While early neural approaches focused more on designing customized architectures or training schema to better fit the summarization task (Nallapati et al., 2016;Tan et al., 2017;Liu* et al., 2018), recent works have shown that generation models, pre-trained on large corpora (Lewis et al., 2020;Zhang et al., 2020;Raffel et al., 2020), generally have a better performance when fine-tuned on in-domain datasets.
However, even if these pre-trained&fine-tuned generation models achieve state-of-the-art performance with respect to standard automatic evaluation metrics, e.g.ROUGE score (Lin, 2004) and BERTScore (Zhang* et al., 2020), the generated summaries still suffer from the problem of factual inconsistency, which means the generated summaries may not be factually consistent with the content expressed in the source documents (Kryscinski et al., 2020).Inconsistencies may exist either at the entity or the relation level (Nan et al., 2021).The former case is when the summary mentions an entity that does not appear in the source documents.The latter is when the summary does mention entities from the source documents, but expresses a relation between them which is different than the one stated in the source documents.
In this paper, we focus on the entity-level inconsistency problem, i.e. to make the model generate summaries with less entities which do not appear in the source document(s) i.e., 'hallucinated' entities.Note however, that hallucinated entities are not necessarily 'unfaithful' or 'wrong' (Cao et al., 2021), so the goal is to reduce them without excluding entities that do appear in the reference summary i.e., without penalizing saliency.Table 1 shows an example of entity-level factual inconsistency from the XSum dataset.Although the content of the summary generated by the SOTA summarizer PEGASUS (Zhang et al., 2020) is roughly similar that of the ground-truth summary, it does not accurately summarize the original documents with the proper entities.Specifically, it totally misses the entity 'Royal Marine', which appears in both the source document and the reference summary, and the entity 'Hampshire' is 'hallucinated', as it does not appear in the source document.Despite the fact that the city 'Portsmouth' is located in 'Hampshire' county, the entity itself is still an instance of factual inconsistency (i.e., an unnecessary generalization).
Prior work (Dong et al., 2020;King et al., 2022) mainly address the entity-level inconsistency problem in the post-processing stage.However, those methods either requires additional sophisticated models, e.g.Dong et al. (2020) uses a pre-trained QA model to 'revise' the generated summaries, or being built on arguably brittle heuristics (King et al., 2022) .Recent work (Nan et al., 2021)   auxiliary task, which is to recognize the summaryworthy entities in the source document using the hidden states from the encoder, or jointly generating the entities and the summaries, i.e. generating a chain of entities in the summary followed by the summary.Yet, both methods do not explicitly encourage the model to generate the summaries with more valuable entities, as both of them aim to guide the model to detect the summary-worthy entities without any changes in the summary generation process.Instead, aiming for a lean and modular solution, we propose the SpanCopy Mechanism to explicitly copy the matched entities2 from the source documents when generating the summaries.One key advantage of our proposal is that it can be easily integrated into any pre-trained generative sequence-to-sequence model.
Since often only a few of the entities in the source documents can be included in the summary, which we call 'summary-worthy entities', we also explore an additional Global Relevance component to better recognize the summary-worthy entities by automatically generating a prior distribution over all the entities in the source documents.
We test our proposal on four summarization datasets in the news and scientific paper domain, comparing it with the SOTA PEGASUS system (Zhang et al., 2020).In a first set of experiments, as a sanity check, we assess our models on arguably easier subsets of these datasets, where all the entities in the reference summaries belong to the source document.In these cases, SpanCopy should definitely dominate PEGASUS, which is confirmed by the results.In a second set of experiments, we fine-tune and test on the full datasets.On this realistic and more challenging task, we find that SpanCopy (without Global Relevance) can strongly improve the entity-level factual consistency (+2.28) on average across datasets, with essentially no change in saliency (−0.06).

Abstractive Summarization
Early neural abstractive summarization models (Nallapati et al., 2016;Paulus et al., 2018;See et al., 2017) are mainly sequence-to-sequence models based on different variants of RNN, e.g.LSTM or GRU, with additional components targeting different properties of the summaries, like redundancy (Tan et al., 2017) and coverage (See et al., 2017).However, all the recurrent models suffer from serious weakness like long-term memory loss, and requiring excessive time to train.
To tackle these problems, researchers in the area of abstractive summarization started to use attention-based transformer models (Liu and Lapata, 2019a,b); recently reaching SOTA performance when pre-trained generative transformers are applied to the task, e.g.BART (Lewis et al., 2020), PEGASUS (Zhang et al., 2020) and PRIMERA (Xiao et al., 2021).The SpanCopy mechanism we propose in this paper can be advantageously injected into any pre-trained models.

Factual Consistency
Despite the large improvements with respect to automatic evaluation metrics, recent studies (Cao et al., 2018;Kryscinski et al., 2020) show that around 30% of the summaries generated by the SOTA summarization models contain factual inconsistencies.Ideally, the assessment of factual consistency should rely on human annotations (Maynez et al., 2020), but these are costly, time consuming and lack a unified standard.Thus promising automatic evaluation metrics for factual consisten- cies of generated summaries have been explored in recent years.To assess relation-level factual consistency two kinds of metrics have been proposed: one based on classification (Kryscinski et al., 2020), and one based on Question-Answering (Maynez et al., 2020;Durmus et al., 2020).For entity-level factual consistency, the focus of this paper, Nan et al. (2021) propose a simple but effective evaluation metric, based on the matched named entities in both generated and ground-truth summaries.In our work, we use such metric to evaluate whether the generated summaries are consistent with both the source documents and the reference summaries at the entity-level.2017) first apply pointer-generator network in an abstractive summarization model, which facilitates copying words from the source documents by pointing, i.e., generating a distribution of probabilities to copy each word from the source.Following their work, Bi et al. (2020) propose PALM, in which the copy mechanism is applied on top of the transformer model, and with a novel pre-training schema, the model achieves SOTA on several generative tasks, such as abstractive summarization and generative QA.More recently, Li et al. (2021) further explores how to make use of the copy history to predict the copy distribution for the current step.However, all the aforementioned works focus on copying at the word level, which tends to be sparse and noisy.Instead, we aim to train the model to copy spans of text i.e., the named entities, in this paper.

See et al. (
Admittedly, some previous work has also investigated span-based copy mechanisms.Yet, those models either predict the start and end indices of a span (Zhou et al., 2018), or predict the BIO labels for each token (Liu et al., 2021).Even if such strategies can copy any kinds of spans (clauses, n-grams, entities, phrases or longest common sequence) from the source document, they may introduce unnecessary noise and break the coherence of the generated text.In this work, we focus on copying the spans of the Named Entities, extracted by a high-quality NER tool, aiming to improve factual consistency of the generated summary without negatively affecting saliency.

Transformer-based Summarizers
Typically, transformer-based summarization (Lewis et al., 2020;Zhang et al., 2020) consists of two steps (i) The Encoding Step (by the Encoder shown in yellow in Fig. 1), which encodes the source input(s) into an hidden space; (ii) the Decoding Step, which computes a probability distributions on the output vocabulary to generate each token of the resulting summary.In this paper, to better describe our methods in the context of a generic summarization models, we split the Decoding process into two components, the Decoder itself (shown in green in Fig. 1), which outputs the representations of predicted tokens, and the Generator (shown in purple in Fig. 1), an MLP layer mapping the representations to the final probability distribution on the output vocabulary.
More formally, for a document with n tokens

SpanCopy Mechanism
A key problem with generic sequence-to-sequence transformer-based summarizers is that the decoding step is prone to generate factual inconsistencies, i.e. the model may make up entities or relations that are not entailed by the source documents.To address entity-level factual inconsistency, we introduce in the Decoding Step the SpanCopy mechanism, which can be conveniently plugged into any pre-trained models.Specifically, we first identify and match the entities in both source document and summary, and then instead of generating the entire summary word by word, we add an additional Span Copier to directly copy entities from the source document, with a Copy Gate predicting the likelihood of whether the model should generate the current token from the vocabulary or directly copy an entity from the source document.
Span Copier (shown in blue in Fig. 1) is an attention module over all the entities in the input document.Suppose there are |E| entities in the input document, with each entity j being a span over tokens [d js , d je ], then the entities can be simply represented as e j = avg([h e js : h e je ]), where h e i represents the output of the encoder for each token d i .At each decoding step i, we compute the logit vector of copying each entity at the current step as: indicating how likely it is to copy the entities from the source document at each step.Notice that to better balance the numeric difference caused by the size of selection space (|V | and |E|), we generate and combine the raw logit vectors 3 from the Span 3 The vector of raw (non-normalized) predictions that the classification model generates Copier and Generator, and take softmax over the combined space to get the final probability.
Copy Gate (shown in red in Fig. 1) is a classifier to map the hidden states to a singular value, i.e.
which indicates the probability of copying an entity at each step.On the contrary, 1 − p copy i represent the probability of generating a token from the vocabulary at step i.
Then the final probability, combining both generation over the vocabulary and the copy mechanism over the entity space, is computed as |+|E|) , where is the logit vector of token generation and o c i ∈ R (|E|) is the logit vector of entity copying.As a result, the first |V | dimensions of the final probability represent the probability of generating all the tokens from the vocabulary, while the following |E| dimensions contain the probabilities of copying the entities from the source document.
Note that the input of the original Decoder in the transformer model at each step is the embedding of the previous token (which is the ground-truth one during training, and the predicted one for inference), but a span of text longer than 1 does not naturally have an embedding to match.We simply use the average of the embedding of all the tokens in the entity, following previous work using average embedding to represent a span of text (Xiao and Carenini, 2019).

Loss
We use the standard loss for abstractive summarization, i.e. the cross entropy loss between the predicted probability and the ground truth labels, However, notice that, since the predicted probability distribution is over the combined space of vocabulary size and entity size (p final i ∈ R |V |+|E| ), the corresponding ground truth labels can be either indices of words to be generated from the vocabulary, or the indices of entities to be copied from the source document, i.e.Table 2: Statistics of all the datasets (original/filtered), on the lengths (L doc ,L summ ) and number of entities (N doc , N summ ) in the source documents and ground truth summaries, as well as src p (gt), the entity level source-precision of the ground-truth summary.

SpanCopy with Global Relevance
Among all the entities in the source documents, there are only a few summary-worthy entities that should be copied into the summary (e.g.around 10% in CNNDM and 1.5% in arXiv).To make the model better recognize such summary-worthy entities, we explore a Global Relevance (GR) component, which takes all the entities in the source document as inputs, and predicts how likely each entity is to appear in the final summary.We use the generated 'entity likelihood' as a prior distribution for the Span Copier component, with GR also trained as an auxiliary task.
Global Relevance is a classifier mapping the hidden state of a source document entity into a value within [0, 1], indicating the probability that such entity should be included in the summary.
Then p f inal i in Eq.3 is updated with gr as New Loss As an auxiliary task, we also train the model with the ground-truth GR labels to make it more accurate.Specifically, the label y gr i = 1 if the i-th entity in the input document is included in the ground truth summary.Then we update the loss function with L gr balanced by β: 4 Experiments and Analysis

Settings
SpanCopy can be plugged into any pre-trained generation model.In this paper, we use PEGA-SUS (Zhang et al., 2020)  has delivered top performance on multiple summarization datasets.We recognize named entities with an off-the-shelf NER tool4 .The balance factor β of GR is set by grid search on small subsets of each dataset (2k for training and 200 for validation).

Evaluation Metrics
To evaluate the saliency and entity-level factual consistency of the generated summaries, we apply the following metrics: Saliency metrics assess the similarity of the generated summary with the reference summary.ROUGE scores (Lin, 2004) measure the n-gram overlaps between generated and ground truth summaries.We apply the metrics R-1, R-2 and R-L.
Summary-precision, -recall and -f1 (sum p , sum r and sum f ) (Nan et al., 2021) measure the precision/recall/f1 score of the matched entities in the generated summaries and the reference summaries.we use N E(S ref ) and N E(S gen ) to represent the named entities in the reference summaries and generated summaries, respectively.
These three metrics measure the entity-level saliency of the generated summaries, i.e. recognizing how many copied (and generated) entities are salient, and should be included in the summary.Entity-level factual consistency metric: measures the named entity matching between the generated summaries and the source documents.(Nan et al., 2021) With N E(D) and N E(S gen ) representing the named entities in the source document and generated summaries, respectively, Source-precision(src p ) measures how many entities in the generated summaries are from the source documents, i.e.
It is an evaluation metric for entity-level factual consistency, as it directly measures how consistent the generated summaries are with the source.

Datasets
We test and compare our SpanCopy model with the original PEGASUS on four datasets, in the domains of news (CNNDM (Nallapati et al., 2016), XSum (Narayan et al., 2018)) and scientific papers (Pubmed and arXiv (Cohan et al., 2018)).As a sanity check, we initially assess our models on subsets of these datasets, where all the entities in the reference summaries belong to the source document (we call these filtered datasets).In these cases (src p (gt) = 1), Spam Copy and GR should dominate PEGASUS, because by design they tend to generate entities from the source document.We compare the size of filtered and original datsets in Table 3.
The statistics of the filtered and original datasets, on the lengths and number of entities in the document and summaries, can be found in Table 2. src p (gt) measures the entity-level factual consistency between the source document and the groundtruth summary, with lower value meaning that there are more novel entities in the ground-truth summaries.The table shows that the datasets in the news domain have higher density of the entities with respect to the lengths (number of words) of both documents and ground-truth summaries, i.e.N doc /L doc and N summ /L summ are larger for the news articles.a possible explanation is that news articles tend to describe an event or a story, which may contain more names of people, organizations, locations, etc., as well as dates.Interestingly, CN-NDM and Pubmed contain less novel than the other two datasets (with higher src p (gt)), something that the proposed SpanCopy mechanism may benefit from.Comparing the filtered datasets with the original ones, we can see that the number of entities in the summaries drops for all the datasets, especially for arXiv, as the more entities in the summary, the less likely they can be all matched to the source documents.

Results and Analysis
The results on the filtered and original datasets are shown in Table 4 and Table 5  Filtered Datasets We first evaluate our models, with the backbone model, PEGASUS on the filtered datasets, which is an easier task, and the results can be found in Table 4.All the models are fine-tuned and tested on the filtered datasets.Since we only keep the examples with all the entities in the summaries being matched with the entities in the source documents, the theoretical ceiling of src p is 100.Comparing SpanCopy and PEGA-SUS, SpanCopy performs better than PEGASUS regarding both saliency and entity-level factual consistency.Plausibly, this is because all the entities in the ground-truth summary can be copied from the source document, in which case the SpanCopy mechanism can better learn to copy.The SpanCopy model with the GR component performs better regarding the entity-level saliency on three out of all the four datasets.On arXiv, the performance of SpanCopy with the GR component regarding both entity-level saliency and factual consistency is quite low.One likely reason might be that it is a rather difficult task to identify the salient entities in the arxiv dataset, as there is a large amount of entities in the source documents, but only very few entities are summary-worthy (164.1 v.s.2.3 as shown in Table 2), which might bring in excessive noise.
Original Datasets In a second set of experiments, we fine-tune and test on the full/original datasets.On this realistic and more challenging task results are encouraging.As shown in Table 5, when the SpanCopy model is compared to PEGA-SUS, it improves the factual consistency of generated summaries with the source documents (src p ) on all the datasets, maintaining a very similar performance on the saliency metrics, i.e.ROUGE and entity-level saliency.Comparing across the four datasets, SpanCopy outperforms PEGASUS on both the saliency and factual consistency metrics on the Pubmed dataset.For better comparison, we show the relative gains/loss regarding PEGASUS on all the datasets, as well as the overall average results in Table 6.It is clear that the SpanCopy model performs much better regarding entity-level factual consistency (+2.13) with essentially no change in saliency (−0.06 on average ROUGE and −0.04 on entity-level saliency).Admittedly, despite the success of the GR component on the filtered datasets on both word-level and entity-level saliency, it fails to deliver any gain on the original datasets.A plausible explanation is that GR makes the model focus excessively on the entities in the source document, therefore penalizing generation of new, potentially summary-worthy, entities.
Comparing the entity-level factual consistency on the filtered datasets and the original datasets, the filtered datasets always have higher src p than the original ones, and the gain is especially larger on the XSum and arXiv datasets, as both of them contain more entity-level hallucinations in the original datasets.Remarkably, the performance gain of the SpanCopy model over PEGASUS on the filtered XSum dataset is much larger on the original XSum datasets (7.98 v.s.0.66) , which might be because original XSum is more abstractive, the entity-level guidance is especially helpful for the abstractive examples with consistent entities in the summary.

Qualitative Analysis
For illustration, we examine a real example from the CNNDM dataset in Table 7, which is a news article on the evacuation of Americans during the time of the crossfire of warring parties in Yemen.While all of the three system generated summaries are able to capture the main statement that 'it's too dangerous to evacuate the Americans', the person 'Ivan Watson' mentioned by PEGASUS's summary does not exist in the source document, i.e., it is an 'hallucinated' entity.Most likely, PEGASUS is generating such hallucination because 'Ivan Watson' is a senior CNN correspondent several time associated with Yemen in other news article in the training set, and the model automatically 'picked the entity from the memory' to generate the summary without tightly adhering to the given document.In contrast, both of our models do not contain entities that are not in the source document, as the SpanCopy mechanism tend to guide the model to use more the entities in the source document.In addition, with the GR component, although the generated summary contains more matched entities with the source document, it pushes the model too far towards copying entities which are not salient (e.g.The State Department).

Conclusion and Future Work
In this paper, we tackle the problem of entity-level factual consistency for abstractive summarization, by guiding the model to directly copy the summaryworthy entities from the source document through the novel SpanCopy mechanism (with the optional GR component), which can be integrated into any transformer-based generative frameworks.By running a sanity check on arguably easier subsets of four diverse summarization datasets, SpanCopy with GR is confirmed to perform better on both entity-level factual consistency and saliency.More tellingly, the experiments on the original test sets show that the SpanCopy mechanism can effectively improve the entity-level factual consistency with essentially no change in the word-level and tokenlevel saliency.In the future, we plan to extend our approach towards controllable generation with given entities.Specifically, instead of using the learnt GR scores, the model could generate summaries with desired entities provided by human.

Figure 1 :
Figure 1: Structure of the model with Entity-based SpanCopy Mechanism, with five components: Encoder, Decoder, Span Copier, Copy Gate and Generator.The upper left bar plot shows the Global Relevance component, predicting the prior probability of all the entities {e 1 , e 2 , e 3 , e 4 } to be copied to the summary.
proposes two ways to directly improve the end-to-end summarization model, either by training with an Entities in Source Doc: Royal Marine, Falklands, Portsmouth, Falklands War Memorial....

Table 1 :
An example of entity-level factual inconsistency from the XSum dataset.The summary generated by PEGASUS totally missed one entity (Royal Marine) and one entity indicates a larger area than the correct one (Hampshire).
Finally, the Generator maps those vectors to the distributions over the vocabulary, i.e. {p 1 , p 2 , ..., p m }, where p t i ∈ [0, |V | + |E|].Specifically, if t i < |V |, then the t i -th token should be generated, and if t i > |V |, the (t i − |V |)-th entity should be copied from the source document.doc L summ N doc N summ src p (gt) L doc L summ N doc N summ src p (gt)

Table 3 :
as our base model, since it Number of data examples in all the datasets (original v.s.filtered).

Table 4 :
Result of our models and the compared backbone model (PEGASUS) on the filtered datasets.ROUGE score and Entity(Summ) are mainly used to measure the word-level saliency and entity-level saliency, respectively.Entity(Doc) is used to measure the entity-level factual consistency.Red represents the lowest among all the three models, while Green represents the highest.

Table 5 :
. Result of our models and the compared backbone model (PEGASUS) on the unfiltered datasets.See Table4for the details of the columns.

Table 6 :
The relative ROUGE score (avg of R-1, R-2 and R-L), the entity-level summary-f1 and sourceprecision of our models, compared with the PEGASUS model on the four datasets (original).The last block shows the overall performance for all the datasets.

Table 7 :
Example of the entity-level factual inconsistency, taken from the CNNDM dataset.The first block shows the entities in the source document with high GR scores (shown in parenthesis) from the SpanCopy + GR model.