Faithful to the Document or to the World? Mitigating Hallucinations via Entity-linked Knowledge in Abstractive Summarization

Despite recent advances in abstractive summarization, current summarization systems still suffer from content hallucinations where models generate text that is either irrelevant or contradictory to the source document. However, prior work has been predicated on the assumption that any generated facts not appearing explicitly in the source are undesired hallucinations. Methods have been proposed to address this scenario by ultimately improving `faithfulness' to the source document, but in reality, there is a large portion of entities in the gold reference targets that are not directly in the source. In this work, we show that these entities are not aberrations, but they instead require utilizing external world knowledge to infer reasoning paths from entities in the source. We show that by utilizing an external knowledge base, we can improve the faithfulness of summaries without simply making them more extractive, and additionally, we show that external knowledge bases linked from the source can benefit the factuality of generated summaries.


Introduction
Despite producing fluent summaries with excellent automatic evaluation scores, current abstractive summarization methods routinely hallucinateproducing content that is not directly supported by the source text (Maynez et al., 2020).For instance, Pagnoni et al. (2021) demonstrated that 92% of the summaries generated by various summarization models on XSUM (Narayan et al., 2018) contain at least one factual error.In addition, the majority of these hallucinated errors are entity-based.Maynez et al. (2020) divided these hallucinations into intrinsic and extrinsic.While intrinsic hallucination focuses on errors that are incorrectly aggregated from the source, the majority of hallucinations are extrinsic, meaning they cannot be inferred directly from the source.Most prior work considers * *This work was done when the first author was an intern at Google Research.

Source
A fire crew remains at Plasgran in Manea Road, Wimblington, more than 16 hours after the incident began on Wednesday afternoon.Road closures are expected to stay in place until midday, the fire service said.About 75 firefighters worked into the night to put out the fire.They also prevented its spread to neighbouring properties.The incident was scaled down at 2300 GMT, when the fire was brought under control.

System Generated Summary
A large fire has broken out at a recycling centre in Oxfordshire, the fire service has said, forcing the closure of a road.

After Correction
A large fire has broken out at a plastic recycling centre in Cambridgeshire...

Target
An investigation has begun into the cause of a fire which has severely damaged a plastics factory in Cambridgeshire.
Reasoning Paths 1. Plasgran → industry → plastic recycling; 2. Wimblington → historic county → Cambridgeshire; 3. Wimblington → also known as → Wimblington, Cambridgeshire Table 1: The target summary contains out-of-article entities -plastics factory and Cambridgeshire -that are important for comprehension.We can see that our model was able to correct the entities in the systemgenerated summary successfully with additional world knowledge linked from source entities (relevant facts).
Surprisingly, human annotations from Maynez et al. (2020); Ladhak et al. (2021) suggest that many of these unfaithful hallucinations are actually factual.Thus, while they are unfaithful to the source text, they are still faithfully consistent with commonsense and world knowledge.These facts, when included in the summary, may provide additional information that is important to under-Figure 1: Schematic view of building the summarization pipeline with knowledge enhanced entity correction.A) A standard seq-to-seq T5 model produces a generated summary.B) An entity linker is used to identify and mask out entities in the generated summary to produce a skeleton summary.C, D, E) The revision model (FILM) uses the source text, skeleton, and external knowledge base to revise and correct the masked entities.stand the content (Cao et al., 2022).In this work, we investigate how external knowledge/facts in open-domain KBs can both lend provenance to extrinsic hallucinations and improve the factuality of generated summaries.Table 1 shows an example of this, where our method generates the target entity 'Cambridgeshire' through a reasoning path in the knowledge base originating at the source entity 'Wimblington'.We show that the training data frequently relies on facts that are not explicitly expressed in the text but instead require external knowledge to infer correctly (Section 2).This contradicts the widely held belief that hallucinations occur as a result of ineffective learning.
As a result, contrary to most of previous work that improves faithfulness by filtering training examples to contain only extractive entities (Nan et al., 2021;Narayan et al., 2021;Mao et al., 2020), we focus on improving the factuality of generated abstractive entities by providing additional facts that are relevant to the source.We focus on entities (e.g., person, event, location, organization) in summaries, as they are the most commonly hallucinated (Pagnoni et al., 2021;Kryscinski et al., 2020) and often contain the most salient information.
Our contributions in this work are: • We provide a comprehensive study over XSUM and CNNDM abs analyzing the connection between source document entities and those in target summaries via facts in external knowledge bases.We find for example, 59.9% of location entities in the gold reference summaries of XSUM are not in the source; 31.6% of these out-of-article entities can be found by following edges in a KB originating at source document entities (Section 2.1).
• We explore methods to improve the factuality of summaries by incorporating KB facts.
For instance, we propose a two-stage revision model to edit entities in the generation and consider a method that incorporates facts from an open-domain KB (Section 3).
• We propose entity-based metrics that evaluate the factuality of generations by linking entity mentions to canonical IDs in a KB, and comparing those to linked entities in the gold reference targets.This allows us to account for variations in the surface forms of entities in our evaluation (Section 4).
2 Case Study: Faithful to the Document or the World?
The motivation for this work stems from the finding that many gold reference summaries in widely used summarization datasets, such as XSUM (Narayan et al., 2018), contain entities that are not explicitly mentioned in the source.Instead, they require additional knowledge to resolve.We show that much of this knowledge can be found by identifying facts in KBs that involve source entities.On XSUM, for example, 59.9% of target location entities are not in the source.Our experiments show that 31.6% of these entities can be found in the one-hop facts linked from the source entities in the knowledge graph.

Setup
The purpose of this investigation is to see if any unmentioned entities can be found in knowledge that is closely related to the source.For our analysis, we look at two widely-used summarization datasets: XSUM and CNN/Daily Mail (CNNDM) (Hermann et al., 2015;Nallapati et al., 2016).Compared to XSUM, CNNDM is largely extractive.We create CNNDM abs , an abstractive subset of the original dataset.On CNNDM abs , at least one location entity in the target is not present in the source.We obtained 95387/4357/3769 data instances for train/dev/test on this subset.For a given example, we seek to supplement the source document with additional world knowledge by constructing a subgraph made up of facts in the Wikidata KB originating at source entities.To identify entities and their types in the source and targets, we use Google Cloud NLU 1 for named entity recognition (NER) (Ratinov and Roth, 2009) and entity linking (Bunescu and Paşca, 2006).This  is necessary as the surface forms of entities often vary; for example, Wikidata ID Q30 -United States of America -has 18 different surface forms. 2 For each of the extracted entities in the source, we collect the set of one-hop facts in Wikidata that originate at those entities, which we call our knowledge subgraph.

Findings
Figure 2 shows that depending on the entity category, 60% ∼ 79% of target entities do not appear in the source in XSUM.However, a large portion of these so-called out-of-article entities can instead be found in the knowledge subgraph that we constructed.Depending on the entity category, 8% ∼ 19% of target entities can be identified exclusively in the knowledge subgraph, resulting in 20.6% ∼ 57.1% improvement of entity coverage when compared to the set of source entities.
Following single-hop KB links alone does not yield all related entities.This is mostly due to both the limited schema of the KB and the fact that the KB itself is highly incomplete (Lin et al., 2018;Ebisu and Ichise, 2019).However, relevant information can often be encoded in multi-hop paths through the graph.For example, the KB schema may not include the relation type born in state and therefore lack a single hop path connecting a person to the state they were born.However, this fact could be instead inferred via the two-hop path: (Barack Obama → born in → Honalulu → capital of → Hawaii).To investigate this, we check whether an increase in the number of hops through the KB would result in higher coverage of target entities.Table 2 shows that in these datasets, most of the facts that are needed to connect source entities to abstractive target entities are within one hop of the KB.The benefits of including longer reasoning paths to create the knowledge subgraph results in a negligible entity coverage gain, but adds significantly more facts to reason over.For example, the average number of facts in the knowledge subgraph in XSUM after one hop of traversing is 790.This number increases drastically to 5365 when including two hop paths.

Conclusions
Contrary to the common belief that the source contains all of the information in the summary, we have shown that a large portion of the gold references in XSUM and CNNDM abs require external knowledge to generate.We provide one way to discover this external knowledge by following reasoning paths in an external KB that originate from the set of source entities.However, there is still a considerable fraction of target entities that are neither in the source nor in the knowledge subgraph.This could be attributed to a variety of factors.First, like other KBs, our seed KG (Wikidata) is incomplete.Many links are missing (Min et al., 2013) as entries are provided by users from manual edits.Furthermore, the data needed to summarize XSUM and CNNDM abs may be temporally out of sync with the KG.For example, the President of the United States (Q11696) in Wikidata is Joe Biden (2021 -present); however, this was not the case when the summarization datasets were constructed.Additionally, different KBs and subgraph selection methods could increase the entity recall further while reducing the number of spurious links; however, we leave this exploration for future work.

Improving Factual Consistency with World Knowledge
In this section, we explore whether we can incorporate additional knowledge into a summarization model to reduce content hallucination and produce more faithful summaries.Two approaches for incorporating external knowledge into summarization are investigated: (i) directly concatenate the knowledge subgraph to the source for a standard sequence-to-sequence summarization model; (ii) refine initial sequence-to-sequence output with a second-stage entity revision model that has access to relevant facts.
It is worth noting that there are a lot of options for the second-stage entity-revision model, and determining the "optimal" revision model will require more research.Instead, we aim to show that improving the factuality of generated summaries with a two-stage revision technique is a viable option.

Implementation Details
We use T5-3B (Raffel et al., 2020) as the base summarization model.The models are fine-tuned for 200k steps with a batch size of 128 on the Cloud TPU v3 Pod with a 128-core Pod slice.When finetuning, we utilize a constant learning rate of 1e-4 and execute early-stopping based on a held-out validation set.The Google Cloud NLU API is used for NER and entity linking.Our second-stage entity-revision model is the Fact Injected Language Model (FILM) (Verga et al., 2021) which predicts new entities based on the additional facts linked from the source (Section 3.3).

Generation with Facts Concatenation
Sequence-to-sequence models have become the most prevalent summarization approaches (Celikyilmaz et al., 2018;Gehrmann et al., 2018;Bae et al., 2019;Zhang et al., 2019;Liu and Lapata, 2019;Dong et al., 2019;Zhang et al., 2020;Qi et al., 2021;Liu et al., 2022).These models take a source document x = (x 1 , . . ., x n ) as input and produce a generated summary y = (y 1 , . . ., y N ), where y = f (x).Building on this paradigm, concatenate additional facts with the source input is the most straightforward way to provide the model with new external knowledge.In the case of our knowledge subgraph, this yields x = concat([x, k 1 , . . ., k n ]) where (k 1 , . . ., k n ) is the linearization of the facts in the knowledge subgraph.
In more details, we linearize facts into literal strings with "[SEP]" as the fact separator.For example, if the knowledge subgraph contains two facts: ("Simon Coveney", "country of citizen", "Ireland") and ("Taoiseach", "subclass of", "prime minister"), the linearized form of this knowledge subgraph is "[SEP] Simon Coveney country of cit- izen Ireland [SEP] Taoiseach subclass of prime minister [SEP]".As each linearized fact contains at least three tokens, the linearized facts for all entity categories drastically exceed the input length limitation of popular transformer-based summarization models.We opt to train the model to correct only location type of entities, as they appear most frequently in the source (Figure 3) and have the best coverage improvement by fact linking (Figure 2).Intuitively, the direct concatenation approach aims to teach the seq2seq model to learn to select useful facts in the source for the summary generation.Similar to the approach described in Section 2.1, we construct training data as follows.For each input document, we extract all entities in the source which are linkable to a Wikidata ID.We then build the knowledge subgraph for this document by extracting all one-hop relations on the Wikidata knowledge graph that originate at any of the source entities.On average, this construction leads to 1837 facts in the knowledge subgraph for all source entities in the categories of {Location, Person, Organization, event, Art, Consumer Good, Other} per source document.
Table 3 shows the results of appending location facts to the source for XSUM.One intriguing finding is that adding any additional information to the end of the source input can boost the ROUGE and FactCC scores.We can observe that simply appending random words 3 to the input resulted in higher ROUGE scores.Furthermore, by attaching random or linked location facts (Section 2.1) to the seq2seq model, higher ROUGE and FactCC scores can be achieved.This suggests that using linked facts directly in seq2seq models can assist enhance the factual consistency of summaries.Limitations: Despite restricting the knowledge subgraph to facts about location entities, this approach quickly becomes intractable.Most summarization models have a length limit of up to 1024 (Lewis et al., 2020a;Raffel et al., 2020), or 4096 for 3 Obtained by sampling uniformly over the T5 vocabulary.longer transformers (Beltagy et al., 2020;Zaheer et al., 2020).However, the location-based knowledge subgraph yielded 790 facts on average, with a fact length of 7.8 on XSUM.Even Longformers (Beltagy et al., 2020) can only accommodate roughly 500 facts, not counting the input text itself.In the experiments, we set the T5 input token length to 1024 and facts that exceed this length will be automatically pruned.Heuristics or trained models may be used to further prune the facts, but given the lack of direct supervision, this is itself a challenging task.

Two-Stage Revision
We also consider a generate-and-revise approach that is less constrained by the number of facts.First, a conventional seq-to-seq model is used to produce a candidate summary from the source text (Figure 1-A).Next, an entity linking model identifies typed entities in the generated summary.These entities are then masked out, producing a skeleton summary (Figure 1-B).Finally, a second-stage revision model is used to predict new entities to fill those masks, yielding a final summary (Figure 1-D). 4For steps 1 and 2, we use the same T5-3B and Google Cloud NLU models described in Section 3.1.
For the revision model in step 3, we consider two options: a second, separately trained standard T5 masking prediction model (T5m), and the Fact Injected Language Model (FILM) (Verga et al., 2021).While language models like T5 have been shown to implicitly store knowledge akin to a KB (Petroni et al., 2019;Roberts et al., 2020), FILM is a neural language model that includes an explicit "fact memory" populated from a KB.Importantly, the model does not concatenate a seed set of facts to the input, but instead, stores them in a separate memory.The model learns to retrieve a small subset of facts from this memory and then incorporates those retrievals into its final prediction.This addresses the scaling issues of the previous section as the model can store millions of facts, learn to retrieve a set of relevant candidates, and incorporate that factual information into its predictions.
The input to the revision model is x = concat([source, skeleton]).T5m is used in the standard sequence-to-sequence setup to predict the masked tokens to produce the final summary.In the case of FILM, the model first produces hid- den states z = f (x).For hidden state z i corresponding to a mask appearing at the ith token, the model performs an attention over the fact memory as a = g(z, M key ).Each entry in M key corresponds to a single fact and is formed as a function of the subject and relation of that fact.
The model then retrieves the corresponding values for the top K scoring fact keys, where each value is a function of the object set corresponding to the fact.For example a single fact would be M key j = h([Barack_Obama, born_in]) with the corresponding value being M val j = ĥ(Hawaii).Finally, the model predicts an output entity as ŷi = f (z i , a, M val ).Refer to Verga et al. (2021) for additional details on the FILM model.

Main Results
This section discusses the results of using the revision model to correct entity errors.We divide the results into two parts: 1) oracle correction on gold-reference summaries, and 2) revision model based on system-generated summaries.
We propose using entity correctness as the criteria for assessing the factuality of generated entities. 5Entity correctness matches predicted entities to the the target entities.Surface forms are linked to their Wikidata IDs to resolve entity matching in account for entity paraphrase (Section 2.1).All target entities in the summaries are further divided into two categories: abstractive entities that do not appear in the source and extractive entities that do.The abstractive subset frequently require external knowledge to infer the target entity.

Oracle Correction
To isolate the effects of the revision model from the summarization model itself, we first evaluate the revision model on entity-masked gold target summaries.Table 4 shows the oracle results of using FILM and T5m for entity revision.We observe that FILM, which incorporates the additional knowledge subgraph, outperforms T5m considerably on the abstractive subset when external knowledge is required to infer the target entity.Relative boosts of 6.1% (73.1 → 77.5) and 8.8% (31.0 → 33.8) are observed on XSUM and CNNDM abs respectively.This indicates that when target items are absent from the source, models utilizing external knowledge bases can better enhance the factuality of generated summaries.We also observe both revision models are struggling on the abstractive subset of CNNDM abs .This is probably because both revision models learned extractive strategies that prefer to predict source entities on this dataset.This is consistent with observations from Pagnoni et al. (2021).Additionally, as FILM frequently tries to generate abstractive entities with external knowledge, it upderperforms T5m on extractive subsets.
However, this does not imply that the entities replaced by FILM are incorrect.It is worth noting that entity correctness, as an automatic evaluation metric, has its own limitations.It only matches the generated entities to target entities with the consideration of paraphrasing.It is possible that, in the extractive subset, FILM might appropriately replace entities based on the facts in the KB, but this is not counted as a correct replacement.For example, if " Manhattan" appears both in the source and the target, and FILM decides to swap "Manhattan" for "New York City".Despite this generation being factual, it does not match the entity IDs in the target.

Revision Model
Entity Factuality Table 5 shows the results of using FILM and T5m for entity revision based on the masked T5-generated candidate summary.We observe a pattern that is consistent with the oracle results.Compared to T5m, using FILM in conjunction with external open-domain knowledge results in higher correctness on abstractive entities.We observe relative boosts of 6.8% (68.7 → 73.4) and 4.5% (29.0 → 30.3) on XSUM and CNNDM abs respectively.
Additionally, FILM achieves greater overall correctness and correctness on extracting entities on XSUM.On the other hand, T5m performs better on extractive entities and generally on CNNDM abs , because the majority of entities there are extractive (72.8%) as opposed to (33.5%) on XSUM.This does not imply that the entities replaced by FILM are inaccurate due to the limitations of entity correctness measure, as stated in Section 4.1.In order to determine whether the substitution by FILM improves the factuality of generated summaries, we conduct human evaluation described as follows.
Human Evaluation We present entity pairs before and after the revision by FILM (in randomized order) with the masked candidate summary sentence to three annotators.The annotators are asked to rate the two entities based on the factuality to the target sentence by choosing from the following four options6 : 1) entity A is better 2) entity B is better 3) equally factual 4) equally non-factual.In total, 288 annotations were obtained.Compared to the original entities, we observe a relative boost of 19.7% preferences for revised entities (22.3% → 26.7%).By using two-sample t-test (Cressie and Whitford, 1986), it was concluded that the revised entities were significantly preferred above the baseline (p< 0.05 (κ = 0.74), we can also conclude that the interannotator agreement for four-category annotations is adequate (more details in Appendix).

Analysis and Discussion
Faithfulness vs. Factuality Although factuality and faithfulness are frequently used interchangeably in the literature, in our notion, they measure the generation from completely different perspectives.Additionally, they can occasionally have a negative correlation if the targets have a high proportion of abstractive entities that require external knowledge to resolve, like on XSUM.This is because extractive entities are encouraged to enhance the faithfulness of the generated summaries.In contrast, abstractive entities that match the targets are frequently encourages by the factuality metric.
In section 4, we demonstrated that models using external knowledge may increase the generation's factuality.This section assesses the faithfulness of the generations.We can see from Table 6 that FILM have lower faithfulness scores despite having higher factuality scores.By incorporating external knowledge, FILM seems to make the generated summary more abstractive.This may be penalized by faithfulness scores.

Quality of KB vs. Factuality by Entity Category
While our primary experiments focused on location entities, we also analyzed the performance of other entity categories using FILM as our revision model.We show that the performance of FILM is highly dependant on the quality and coverage of the knowledge subgraph.In Table 7 last column we see a large range on how the added coverage provided by the knowledge subgraph for the different entity types.Additionally, column 2 shows a similar wide range on the oracle prediction accuracy over those different entity types.
Looking at full pipeline results in Table 8, we see that the revision accuracy of location entities is substantially higher than person or organization.This is possibly due to variance in the coverage in the KB across types or a greater difficulty in identifying relevant facts.For example, even if a fact links a source entity to a target entity, there is no guarantee that it is relevant for making a prediction about a particular piece of text.

Related Work
Faithfulness in Summarization Faithful consistency of summarization has drawn much research interest since the proposal of FactCC (Kryscinski et al., 2020), an evaluation model that classifies the generated summary as consistent/inconsistent to the source.Later, several question answering-based summarization evaluation methods were proposed (Wang et al., 2020;Durmus et al., 2020;Nan et al., 2021;Zeng et al., 2021), in addition to diagnostic datasets (Gabriel et al., 2021b).These models measure the faithfulness by evaluating answers that are produced by a QA model with inputs of (question, source) and (question, generated summary).
Numerous strategies are also proposed to increase faithfulness by imposing constraint w.r.t the source, such as quantity entity matching (Zhao et al., 2020), intermediate planning with entity chains (Narayan et al., 2021), extensible guidance (Dou et al., 2021), document's knowledge graph (Zhu et al., 2021), and simple filtering (Nan et al., 2021).In addition, Filippova (2020) controls hallucination with unconditional and conditional LMs.Dong et al. (2020); Cao et al. (2020) propose post-error corrections with QA-based models or denoising BART.Cao et al. (2018); Zhu et al. (2021) utilize dependency parsing tools to identify and match the relations in an input document to its summary.
Factuality in Summarization Cao et al. (2022) propose a novel detection approach that separates factual from non-factual hallucinations.Gunel et al. (2020) proposes to prime summarization models with embeddings that are learned through TransE on knowledge graphs.Additionally, many recent models have been proposed for retrieval augmented language models using passages (Guu et al., 2020;Lewis et al., 2020b), mentions (Sun et al., 2021), and facts (Verga et al., 2021).In this paper, we experiment with incorporating facts that are directly linked to the entities in the source.Several models have been proposed to combine symbolically interpretable factual information and subsymbolic neural knowledge (Cohen et al., 2020;Ren et al., 2020;Narayan et al., 2021).Different from the previous works, we investigate how facts in external open-domain knowledge can help with entity factuality of the generated summary.

Conclusion
In this paper, we show that a large portion of socalled external hallucinations in text summarization can be explained by external knowledge and verified by KB facts connecting source entities to target entities.We have explored multiple ways to combine this knowledge into a faithful and factual generation of summaries.Our research proposes a pipeline that, with a solid knowledge base as a foundation, can guarantee better factuality.Fur-thermore, we discuss some valuable insights about current limitations and promising directions for knowledge-grounded text generation.

Limitations
Alternative Approaches We presented two promising methods for utilizing external knowledge to improve the factuality of generated entities in summarization.It is important to note that the purpose of this study was not to claim to have found the optimal method, but instead, to demonstrate that external knowledge can be used to enhance the factuality of generated summaries.For example, countless other models could have been chosen as the revision model and a future research direction would be to more exhaustively examine more approaches and analyze the trade-offs between them.

A Appendix
A.1 Datasets

A.2 Finetuning FILM for summarization tasks
We modify the entity correction task in summarization as an open-domain question answering task, which FILM is designed for.The setup is as follows.We treat the source document as the context and the masked skeleton sentence, obtained from masking entities in either the gold reference summary or system-generated summary, as the question.FILM learns to extract useful information from the open domain (knowledge base) to provide evidence for the entity prediction.We focus on a subset of entities that are answerable using entities from the knowledge base.For example, the answer "United States" is an entity in Wikidata whose identity is Q30.Same as described in Verga et al. (2021), at finetuning time, we freeze entity embeddings E and relation embeddings R. All transformer layers with four transformation matrices are finetuned with the loss: loss finetune = loss fact + loss ans .
The number of base parameters, including the encoder and decoder transformer parameters and the finetuning optimizer, is derived from the original papers.We set the max length of FILM to 512, as it only needs the skeleton summary and the original source document as the input for entity correction.The FILM models for XSUM and CNNDM abs are trained on Google Cloud TPU v3 Pod with 128-core Pod slices.

A.3 Example of usefulness of abstractive entity
Source Mr. Cowen had to deny being drunk or hungover during the RTE interview.The taoiseach was interviewed live from his party's conference, which is taking place in Galway. . . .I would hate to think the reputation of the country or the office of taoiseach would in any way be affected by what I had to say."Mr. Cowen again denied any suggestions he was hungover.... Simon Coveney, also of Fine Gael, who said in a Twitter message on Tuesday that Mr. Cowen sounded "half-way between drunk and hungover" in the interview, has said he accepted the taoiseach's apology. . . .

Target
Irish Prime Minister Brian Cowen has admitted that a controversial radio interview he gave on Tuesday was not his "best performance". Gen.
Taoiseach Irish Prime Minister Brian Cowen has apologised for the " hoarseness" of his voice in an interview on Tuesday.
Table 10: The target summary contains out-of-article entities -Irish Prime Minister and Brian Cowen -that are important to be included in the summary.We can see that a summarization model is able to generate this example successfully with additional world knowledge that "Taoiseach" is equivalent to "Irish Prime Minister" and "Taoiseach Mr. Cowen" refers to "Brian Cowen".
Figure 4: Example of facts in Wikidata KB that connect the source entities to abstractive target entities (entities that do not appear in the source).

A.4 Annotation Guidelines
We conduct human evaluation based on randomized pair comparison for entities before and after revision.With the following guidelines, we present the three annotators with 1) the source, 2) the target, 3) the masked system-generated summary, and 2) the entities before and after revision in random order: 1. Read the source and the target completely.
2. Based on the masked system-generated summary, compare entity 1 and entity 2 in terms of "factuality" with respect to the source and the target.
3. Try to check on search engines if an entity cannot be directly inferred from the source or the target.
4. Select the entity that you think is more factual with respect to the source document and the target summary, or choose that they are equally factual/non-factual.

A.5 Inter-Annotator Agreement
Our common set contains 96 samples for interannotator agreement evaluation.Three annotators are asked to annotate the common set.We report Fleiss's Kappa (κ) to evaluate the validity of the agreement between annotators.With κ = 0.7391 (0.70 ≤ κ ≤ 0.80), we achieve a decent agreement on the four-category annotation.

Figure 2 :
Figure 2: Increase of entity coverage by including external knowledge subgraphs.The knowledge subgraphs are constructed by including Wikidata facts that are one hop away from the set of entities in the source document.

Table 2 :
Target location entity coverage before and after including facts from different number of hops beginning from source entities of the KB. Green highlights the percentage of coverage improvement by the KB.

Table 4 :
Accuracy of entity ID matching to the targets on XSUM and CNNDM abs .The entities are predicted using gold-reference summaries with MASK.FILM outperforms T5m on abstractive subsets where target entities are not in the source.The abstractive subset contains document-summary pairs where the gold reference summary contains at least one entity that is not in the source.

Table 7 :
Entity matching accuracy (1st column) of FILM and entity coverage by the knowledge subgraph (Section 2) by categories on XSUM.We can see that the quality of the KBs varies across different entity categories, and this is reflected in the performance of FILM.

Table 8 :
Entity matching accuracy using FILM to revise T5 outputs on XSum by entity type.

Table 9 :
Dataset statistics in terms of number of examples in train, dev, and test splits for three summarization datasets used in our experiments.