Don’t Say What You Don’t Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search

Abstractive summarization systems today produce fluent and relevant output, but often “hallucinate” statements not supported by the source text. We analyze the connection between hallucinations and training data, and find evidence that models hallucinate because they train on target summaries that are unsupported by the source. Based on our findings, we present PINOCCHIO, a new decoding method that improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations. Given the model states and outputs at a given step, PINOCCHIO detects likely model hallucinations based on various measures of attribution to the source text. PINOCCHIO backtracks to find more consistent output, and can opt to produce no summary at all when no consistent generation can be found. In experiments, we find that PINOCCHIO improves the consistency of generation by an average of 67% on two abstractive summarization datasets, without hurting recall.


Introduction
Abstractive text generation is an important task with the promise of compressing lengthy source material into concise summaries, satisfying application or user needs.Pretrained abstractive summarizers (e.g.BART (Lewis et al., 2020)) have recently achieved new state-of-the-art (SOTA) across multiple datasets (Fabbri et al., 2020).However, these systems remain unusable in most real world scenarios, because they frequently hallucinate information that is inconsistent with the input (Maynez et al., 2020).
Many researchers have proposed methods to assess and improve the consistency 1 of summarization systems.Two popular approaches are 1) incorporating extracted knowledge (Zhu et al., 2021) Method Text Source ...The PSNI said the tablets were "as yet unidentified" but warned of the "potential dangers" they posed... BART A 17-year-old boy has been charged after a teenager was taken ill after taking what police have described as "potentially lethal" ecstasy tablets.PINOCCHIO A 17-year-old teenager has been charged with drugs offences after a teenager was treated in hospital after taking what police described as an "unidentified" drug.(possibly in the form of questions (Durmus et al., 2020)), and 2) incorporating a consistency text classifier (Kryscinski et al., 2020) (often based on natural language inference (NLI) (Falke et al., 2019)).These methods tend to reduce the problem of generating consistent text to another difficult problem (e.g.information extraction (IE) or NLI).Given a strong IE system or a structured representation of the source information, it is possible to dramatically improve the consistency of generated text (Zhang et al., 2020b;Tian et al., 2019), but such resources are only available in a narrow subset of domains.
We propose a different approach for generating more consistent summaries.It is based on the observation that today's abstractive summarizers are often trained on target summaries that contain statements unsupported by the source text (Matsumaru et al., 2020).This disconnect arises because the training datasets are acquired from noisy "silver" sources in order to scale, e.g.treating a news headline as a summary of its article or an encyclopedia entry as a summary of a portion of its references.We conjecture that a model optimized for likelihood and trained on target summaries containing unsupported statements will have a strong tendency to hallucinate information rather than say something less "likely," but supported ( §3).Further, common automatic evaluation metrics like ROUGE reward lexical similarity significantly more than consistency, preferring hallucinated lexically similar summaries to completely consistent lexically different ones.
Our method, called PINOCCHIO, is a novel decoding algorithm that constrains beam search to only consider predicted tokens that are likely to be supported by the source text.PINOCCHIO estimates which tokens are likely supported using simple but effective heuristics based on the model's confidence and attention distribution, and word frequency.When PINOCCHIO reaches a state where no supported token can be generated, it backtracks the search.It can also opt-out from generating a summary at all, rather than produce one expected to be hallucinated.We show how PINOCCHIO significantly improves consistency on two abstractive summarization datasets with only a small decrease in fluency, measured using careful human evaluations.
To test PINOCCHIO on diverse domains, we also develop a new abstractive summarization dataset called Scientific Concept Description (SCD).Inspired by the WikiSum (Liu* et al., 2018) dataset, SCD uses Wikipedia descriptions as the target summaries and the referenced papers as the source documents, detailed in ( §5).SCD is motivated by the goal of automatically generating a high-quality encyclopedia for the long tail of scientific concepts described in papers, and presents a challenging workload for abstractive summarization.It comes with a total of 60k samples of scientific concepts and 118k corresponding paper identifiers, with full text for 8k of the papers.
We make the following contributions: 1. We analyze the relationship between hallucination and training on targets that are not fully supported by the source.2. We introduce PINOCCHIO, a decoding algorithm that improves generation consistency by constraining beam search to focus on inputsupported tokens.It improves consistency by an average of 68% in two abstractive summarization datasets at the expense of a minor decrease to fluency.3. We introduce Scientific Concept Description, a challenging new abstractive summarization task, and release a dataset.
The SCD dataset, along with our code, trained models, and human evaluations, is available at https://github.com/allenai/pinocchio.

Related work
Pretrained language models have recently taken the top spots on summarization leaderboards (Fabbri et al., 2020;Huang et al., 2020).This includes models like BART (Lewis et al., 2020), PEGASUS (Zhang et al., 2020a), and UniLM (Dong et al., 2019).In a recent large scale evaluation of summarization models, Fabbri et al. (2020) found BART and PEGASUS to be the top performing models.We choose to focus on BART in this work.
It is widely known that SOTA summarization models tend to hallucinate facts (Maynez et al., 2020), and the most closely related works to ours are those on factual summarization.However, we avoid the term "factuality" and instead use "consistency" to denote that the generated summary is supported by the input text.As noted in Maynez et al. (2020), a summary could be hallucinated but still be factually correct.In this work, we aim to improve consistency and reduce hallucinations, which indirectly improves factuality, without directly optimizing for it.
Prior works attempt to improve consistency by correcting already-generated summaries (Dong et al., 2020;Zhu et al., 2021), using a knowledge graph (Zhu et al., 2021), filtering training data (Nan et al., 2021), constraining generation with keywords (Mao et al., 2020), using NLI models (Barrantes et al., 2020;Mishra et al., 2020), among others.Some have focused on the data-to-text setting, which presupposes structured input (Tian et al., 2019;Wang et al., 2020b).Some works control the extractiveness of generations (Song et al., 2020).There have also been multiple works on automatically measuring consistency (Durmus et al., 2020;Kryscinski et al., 2020;Wang et al., 2020a).Matsubara and Singh (2020) noted that hallucinations come from a source-target discrepancy, where many training targets are not fully supported by their source text, and suggested to address it by removing samples with unsupported summaries.We extend their empirical findings with similar measurements on three additional datasets, conjecture that hallucination is unavoidable in such settings, and provide evidence for the conjecture in terms of the lexical statistics of output summaries.We use beam search for decoding, which has become standard practice for neural seq2seq models (Graves, 2012;Sutskever et al., 2014).Our approach can be viewed as a version of constrained decoding (Hokamp and Liu, 2017) but with dynamically identified constraints and the ability to backtrack.Our constraints come from various model internal signals that indicate attribution to the source text.One such signal is entropy, where Xu et al. (2020) found that low next token entropy indicates the model is copying.Unlike previous work, we do not attempt to imbue models with a new level of textual understanding, but rather show that we can improve consistency of generated text using simple signals based on model internals.

Why Do Models Generate Inconsistent
Summaries?
In this section, we analyze why models generate inconsistent summaries.Here, we use the definition of consistent from Fabbri et al. (2020), i.e., the factual alignment between the summary and the summarized source.We hypothesize that there are two factors that contribute to inconsistency: 1) the maximum likelihood training and generation strategy used in summarization models, and 2) imperfect training datasets that contain many instances where the target is difficult or impossible to deduce from the source.Specifically, we conjecture that in the presence of these two factors, models are guaranteed to hallucinate because they either 1) default to a background distribution of the most common relevant terms during generation or 2) learn spurious correlations between the source and target texts.In either case, the model generates text that is often inconsistent with the inputs.
We present our analysis in terms of a motivating example below, and provide empirical support for it in Sec. 7. The analysis inspires the design of the PINOCCHIO method in Sec. 4.

Motivating example
Consider the target summary of an article about a team signing a football player, from XSUM: Many of the details in this summary are difficult for a model to predict because they are not supported directly by the input passage.2For example, the player's first name ("Brian") and position ("striker"), and the lack of signing fee ("free transfer") are nowhere mentioned.This mismatch between typical summary fields and the text available in the input passage is not restricted to summaries about player signings, but is more generally observed across a variety of article types in XSUM and also our new SCD data set.
Achieving a high likelihood on the training dataset requires that the trained models output the aforementioned fields anyway: e.g., in summaries of player signings, from a sample of 43 summaries, 100% mention the player's full name, 88% the player's position, 78% the length of the signing, etc, even though they are often not supported in the source.As a result, the BART summarizer outputs the following summary for the example: This summary begins nearly identically to the target, but then outputs the three field values incorrectly (first name, position, and length of the contract).
The errors make sense when you consider the model's calculus for choosing a summary.Consider a single field that can be present or absent in a summary, and make the simplifying assumption 3 that the probability of the most-likely summary with a field value is strictly monotonic in the probability of the field value (see App. B for formal details).In that case, a model that maximizes likelihood will output the field if and only if its best guess of the field value is more probable than the field's absence.In practice, the probability of field absence is often low because training summaries of certain topics reliably cover certain fields, and the best guess probabilities are often higher because the model can do some inference to narrow the choice set to a limited and typically peaked distribution (e.g., to a small number of football player positions).Thus, hallucinating a best guess is often preferred by the model-even, in some cases, when the model estimates that the guess is less likely than chance to be correct.In the example, since the estimated probability that the player is a "midfielder" is relatively high ("midfielder" is relatively common, shown in Fig. 1), and position going unmentioned is rare (about 12% of the time), the model chooses to incorrectly output "midfielder." Of course, the assumptions in our analysis may not always hold, and hallucination is likely more complex than the single phenomenon analyzed here.But our approach, motivated by the above conjecture, can improve the consistency of summaries in practice.Further, in Section 7 we validate two aspects of our analysis empirically, showing that ground truth training summaries for abstractive summarization do contain unsupported statements, and that summarizers do disproportionately produce more common terms in their output.

PINOCCHIO: Constraining Beam Search to Improve Consistency
Inspired by the previous analysis, we introduce PINOCCHIO; a modification to standard beam search for supported-decoding (Alg.1).Beam search for text generation typically works by adding to a small set of candidate generations one token at a time, keeping the top B generations according to model-predicted likelihood after each prediction timestep.After <end> has been predicted in B beams, those B candidates are rescored with a length penalty (Wu et al., 2016), and the best one is chosen as the final output.PINOCCHIO differs from regular beam search only in its use of the set R, which holds a set of disallowed generation paths; if R is always empty, Alg. 1 simplifies to standard beam search.PINOCCHIO modifies the model predicted token scores to avoid inconsistent predictions.
In particular, PINOCCHIO applies a function f c (model state, candidate next generation) to the predicted likelihood of the top predicted tokens.If ending in "<end>"; CG := CG ∪ {d ∈ T : d scores higher than min in P Q and ends in "<end>"}; end return top-ranked element of CG; all top predicted tokens for a given timestep are inconsistent according to f c , PINOCCHIO backtracks by removing the last predicted token from each beam, and predicts again without the ability to predict the removed tokens.The number of times this backtracking occurs η, combined with the average entropy of the token predictions in the final output is a good indicator of whether the model succeeded in producing a good summary or not.Thus, we eliminate generations with multiple backtracks (e.g., η > 2) and high entropy, as well as individual sentences with high entropy (>2.75) from multi-sentence outputs.
Within this framework, we present an instantiation of f c based on a set of carefully curated heuristics, determining if a token is allowed to be predicted or not.
The function f c consists of a series of binary checks, which take into account both model internals as well as language features.If any of the checks succeeds, f c is 1 and the model continues generating, but if all of the checks fail f c = 0 and the model disallows the generation path.First, we consider the model confidence for the current prediction-based on the intuition that a low entropy of the token prediction probability distribution corresponds to more certain, and potentially more correct, predictions.Second, we keep track of the source text with high attention scores during the generation process: when the attended texts are semantically or lexically similar to the token to be generated, that suggests that the token may be supported by the source.Third, PINOCCHIO also allows tokens that are especially common (such as stopwords), as we expect these are less likely to be hallucinations.We develop a total of 8 different binary functions within the three categories above (details in §D.1).The heuristics do not require additional training steps, and all the associated thresholds or hyperparameters were determined by manual inspection on a small number of samples (e.g., n=20) from each dataset.Different from prior work Matsubara and Singh (2020), this non-machine learning approach is based on scrutiny of the model generation process.It is easy to execute and more explainable compared to black-box models.

Tasks and Datasets
We evaluate PINOCCHIO on two distinct summarization tasks: news summarization (XSUM and CNN / Daily Mail) and scientific concept description (the newly proposed SCD dataset).

News Summarization
XSUM (Narayan et al., 2018) is a popular abstractive news summarization dataset.XSUM is a challenging dataset; the source text frequently does not entail the target text, the target task is not exactly summarization (XSUM is closer to headline generation than summarization), and data is noisy (e.g.there are articles in another language, Welsh).Challenges aside, XSUM is highly regular, as mentioned in Sec. 3.Although this seems to make the task easier, a strong pattern matcher will reproduce dataset patterns (see Appendix G and Tab.7 for example patterns), whether or not it is able to fill in all the details in the pattern correctly.
CNN / Daily Mail Dataset (Nallapati et al., 2016) is another commonly used dataset for news summarization.Different from XSUM, the summaries are relatively longer (one sentence vs more than 2 sentences) and are considered to be nearly extractive (see Sharma et al. (2019) and our results in Tab. 6) as the summaries are based on summary bullets from the original news article.

Scientific Concept Description
We introduce the novel task of scientific concept description (SCD): automatically generating a brief description of a scientific concept, given the concept name and some papers discussing the concept.Test data has been manually evaluated to ensure quality.

SCD training corpus
Training an SCD system requires a large set of ground-truth descriptions.Inspired by the WikiSum dataset (Liu* et al., 2018), we construct our training set using Wikipedia intro sections4 as the target descriptions,5 with the papers cited in each description as source text.Manually-evaluated SCD test corpus The motivating use case for the SCD task is automatically generating a high-quality encyclopedia for the long tail of scientific knowledge presented in papers.As a result, we construct a second test set of SCD evaluation examples not from Wikipedia, but instead from a much broader set of scientific concepts mined from computer science papers using ForeCite (King et al., 2020).This set lacks target descriptions, so it requires manual evaluation.
Training on surrogate data that differs somewhat from the intended use case but can be obtained at scale is common in summarization research (e.g.abstracts as paper summaries (Cohan et al., 2018); headlines as news summaries (Narayan et al., 2018)).In our case there are two major discrepancies between train and test: the textual domain (train is mostly biomedical, test is largely computer science), and the level of supporting text (the Wikipedia-cited training inputs often have less support for the concept description than the ForeCite-mined test inputs do, as ForeCite pairs concepts with their likely introducing paper(s)).

Metrics
We rely on human evaluation, as current automatic metrics are unreliable for evaluating factuality (see §6.4).We are not targeting ROUGE metrics (Lin, 2004), but present them for completeness. 6 For human evaluation, we use standard dimensions of consistency (does the source entail the target?), fluency (is the target grammatical, understandable English?), relevance (does the target contain important information for understanding the source?), and coherence (do the sentences flow together coherently?) 7 , with definitions adapted slightly from (Fabbri et al., 2020) via calibration with our annotators.We also decided to rate consistency and fluency on a five-point 1-5 scale, but relevance and coherence on a coarser three-point 1,3,5 scale.See App.E for annotation guidelines.

Manual evaluation
In Tab. 2, we report manual evaluation results, with each example annotated by one annotator (interannotator agreement is reported in Section 6.5).PINOCCHIO improves overall consistency.Expressing the results in terms of precision and recall, treating perfectly consistent output (i.e., a consistency score of 5) as a true positive, Table 2 shows that PINOCCHIO improves precision by 68% on average (47% on XSUM, and 89% on SCD) without hurting recall, yielding an F1 improvement from 0.209 to 0.345 and 0.287 to 0.361 on SCD and XSUM respectively.The improvements in consistency arise from two cases: first, when PINOCCHIO produces output, it is rated more consistent than BART on 44% and 24% of the examples from SCD and XSUM respectively, whereas BART is more consistent for only 16% and 13% (in the remaining cases, the two systems are equallly consistent).Second, on the examples where PINOCCHIO produces no output, BART's output is tends to be less factually consistent than its average, scoring 0.30 and 0.44 points lower (on the 5-point consistency scale) than its average for SCD and XSUM respectively.We see that PINOCCHIO does reduce fluency with respect to the base BART model, and further that the sentence level entropy filter applied denotes agreement with the annotator on which model is better, when the annotator did not rate the models as equivalent, "Mean/σ pairwise ties" gives the mean/std of absolute value of difference between the metric's rating for each model, for pairs where the annotator rated the models as the same, and "Mean/σ pairwise not ties" is the same but for pairs where the annotator rated the models as different.A well-calibrated metric should have mean near zero and low standard deviation when the models are annotated as equivalent.We find the automated metrics exhibit low agreement with our annotators.
in PINOCCHIO sometimes removes the key first sentence that defines the entity in SCD, resulting in a decrease in relevance.Pretrained language models are capable of producing incredibly fluent text and prior work on steering them over-optimizes for maximizing the highest likelihood output (Subramani et al., 2019;Subramani and Suresh, 2020).As a result, steering them away from their highest likelihood output as PINOCCHIO does is bound to reduce fluency.Our results suggest that some of this fluency is coming at the cost of factual consistency, as the model has learned how to follow patterns to produce plausible sentences, but not necessarily while sticking to the source text (see §3 and Appendix §G).

Automatic evaluation
For completeness, we report ROUGE 1, 2 and L (Tab.3), for the two tasks along with results on the CNN/Daily Mail dataset (Hermann et al., 2015a) for reference.We note that PINOCCHIO elects not to generate for a much higher portion of samples in SCD.This can be partially explained by the abstractiveness of these datasets, which we detail in Section 7.1.For the examples where PINOCCHIO generates, PINOCCHIO lowers ROUGE moderately for the two abstractive datasets compared to BART, and by somewhat more on XSUM than on SCD.We analyze the ROUGE drop in XSUM in Appendix §G).By contrast, PINOCCHIO fires only rarely for the extractive CNN/DM dataset, and therefore its ROUGE scores are unchanged from BART.

Comparison against existing correctors and factuality metrics
We also compare with three recent methods for automatically correcting summaries or measuring their factuality.Here we evaluate on XSUM, which we expect to be more suitable for these methods (each were evaluated on XSUM in previous work, whereas SCD is out of domain show that PINOCCHIO yields an improvement on ~8.5% of XSUM, more than a factor of eight higher.Finally, we assess two representative automatic factuality metrics, FactCC (Kryscinski et al., 2020) and FEQA (Durmus et al., 2020).FactCC trains a <source, summary sentence> classifier; FEQA generates/answers questions from the summary, checking if answers are the same when using the source.We find neither metric suitable for our highly abstractive setting; each has low agreement with our XSUM annotations (Tab.5), a result in line with a very recent evaluation of factuality measures (Pagnoni et al., 2021).

Inter-annotator agreement
In Tab. 4, we report various inter-annotator agreement measures.We had three expert annotators, and the agreement stats are averaged between all pairs of annotators, on a set of 30 examples (15 from each model) from each dataset.For model comparison, the most important metrics are the "compare" metrics, which measure how often the annotators agree on which model's output is better for a given example.The "compare" metric is the fraction of examples for which the pair of annotators agree on which model's output is better or both say the outputs are equivalent.The "com-pare~" metric is similar but more lenient, as it only counts as disagreement the examples where one annotator says one model is better, and the other annotator says the opposite.These kinds of strong disagreements are very rare in our data, suggesting that the relative comparisons between models in We report the scores for the CNN / Daily Mail dataset (See et al., 2017;Hermann et al., 2015b) for comparison because it is highly extractive.
Table 6: Analysis of the abstractiveness of three summarization datasets.The abstractive XSUM and SCD datasets contain a substantial fraction of unsupported words, measured in terms of either automated n-gram overlap measures or manual annotation.BART+PINOCCHIO performs more backtracks η on more abstractive datasets.our experiments are reliable.

Empirical validation of the intuition motivating PINOCCHIO
We now present two empirical analyses to verify the intuition sketched in Section 3. First, we verify our claim that the ground truth summaries in our data sets contain unsupported terms (Table 6).We define Dataset Abstractiveness as the ratio of n-grams that appear in the summary but not in the source text.The two abstractive datasets (XSUM and SCD) show high abstractiveness, with approximately half or more of the terms in the summaries not appearing in the source.Of course, a lack of lexical overlap could arise from summaries stating supported information but in different terms from the source.Thus, we also manually examine twenty examples for XSUM and CNNDM and ten for SCD and measure the fraction that are not directly supported by the source.8This fraction is substantial (18-24%) for the abstractive datasets, but much smaller (2%) for the more extractive CNN/DM dataset.Finally, η, the number of times our proposed method BART + PINOCCHIO backtracks, which is a measure of how often the method estimates that generated tokens are unsupported, also correlates with the abstractiveness measures.
We also verify one expected consequence of our hypothesized mechanism of hallucination.If indeed BART is defaulting to a background distribution of field values (based on frequency in the training summaries), then we would expect the more frequent training values to become even more probable in BART's output, as the model defaults to these as best guesses.We observe this effect for positions in player signings, as shown in Fig. 1.It is notable that while this distribution is more peaked, it is not entirely concentrated on the most-likely field value, suggesting that the model has learned spurious correlations that lead it to output other more rare field values, even when unsupported.
More generally, we also observe a similar bias across all n-grams; compared to the original ground truth summaries, the BART output tends to be less heavy-tailed, including disproportionately more of the high-likelihood n-grams.We show this by plotting the n-gram frequency distributions (which follow a power law) on a log-log scale in Fig. 2. The BART output generally has a less negative slope than the ground truth distribution on these plots.BART + PINOCCHIO method results in a distribution that is closer to the ground truth for 2-and 3-grams.

Errors analysis
To provide insight into dominant error types, we sample 20 PINOCCHIO generations from the SCD evaluation with inconsistent outputs, and identify three common error causes that each occur in ~20% of the samples: 1) Incorrect paraphrasing or omission of meaning-changing information (e.g.X has a long history of being used for Y vs. X is the model of choice for Y) 2) Incorrect treatment of entities as coreferent/synonymous 3) Difficulty with heavy mathematical notation.
We also provide additional qualitative analysis on the generated outputs in Appendix G.We conclude that BART tends to exploit specific patterns in the dataset that contribute to its better ROUGE scores, but it fails to reliably apply commonsense or facts learned during training.Targeting these challenges in generative models is a promising future direction.

Conclusion
In this work, we present PINOCCHIO, a simple, no-additional-machine-learning required, method for reducing hallucination in generative encoderdecoder models.PINOCCHIO provides a substantial lift in consistency, with only a small decrease in fluency.We analyze why existing summarizers hallucinate, showing that silver abstractive summarization datasets can contain unsupported target summaries, and presenting evidence for our conjecture that models that maximize likelihood trained on such data will tend to hallucinate.We also show that existing factuality metrics are insufficient, and further explore how patterns in the training dataset can produce misleading results on the test test.We also introduce the task of scientific concept description and release a Wikipedia-based dataset for it.
We would like to clearly acknowledge the limitations of our approach.PINOCCHIO does not add new learned behavior to the model, using simple heuristics and single-step backtracking to steer the model towards more consistent output.The heuristics have settings that require some adaptation for each data set, and while limited manual tuning was sufficient for the two data sets in our experiments, further experiments with additional data sets are necessary.Further, preliminary experiments suggest that the settings that were effective for BART do not simply work out of the box for another summarizer, PEGASUS.We also acknowledge that while PINOCCHIO offers improvements in consistency, the results are still far from perfect, and thus the system is not suitable for certain applications.We hope the approach and insights in this paper help spur further development of models that generate consistent text.

A Example Full Text
A.1 Example from Table 1 Police said the 14-year-old reported feeling unwell and required hospital treatment.He was later discharged from hospital and is recovering at home.The incident happened in Holywood, County Down, on Saturday.The PSNI said the tablets were "as yet unidentified" but warned of the "potential dangers" they posed.The 17-yearold, has been charged with possessing a Class A controlled drug with intent to supply; possessing a Class B controlled drug with intent to supply; possession of a Class A controlled drug; possession of a Class B controlled drug and supplying a Class A controlled drug.He is due to appear at Newtownards Youth Court on 14 February.

A.2 Player signings
The 29-year-old Scot has signed a two-year contract with the Gloucestershire outfit.Prior to joining Hibs in August 2016, Graham had spells at six other Scottish sides, including Dundee United, St Johnstone and Ross County.He will be available for Saturday's league visit of Crawley Town, subject to receiving international clearance.Find all the latest football transfers on our dedicated page.

B Mathematical Details of Hallucination Analysis
Formally, a summarization model is defined by a distribution P (S|P ) over output textual summaries S conditioned on an input passage P .We assume that the summarization system aims to maximize the probability of the summary S given the text passage, i.e. it outputs arg max S P (S|P ).While in practice (including in our experiments), summarization models use imperfect search procedures like beam search to find high-likelihood generations, and may rescore complete generations using factors other than likelihood (like length), in this analysis we ignore these details and assume the generator simply maximizes likelihood.Analyzing the impact of more complex generation aspects is an item of future work.Let F (S) be a function denoting the value of a given "field" in the summary S, equal either to some string value or to ∅ if the field does not occur in S. A "field" is a typical piece of information that is often mentioned in a summary of a given topic (e.g., participating teams, in a summary of a sporting event; or the university where an idea was developed, in a scientific concept description).Then the model's distribution over a field value for a given passage is Our analysis uses the following assumption: Assumption A1: The model's most likely summary probability is strictly monotonic in the probability of its included field values.That is, whenever: then That is, when the model thinks a field value is more likely in a summary for a given passage, then it can find a more likely summary that uses that field value.This assumption seems likely to hold often in practice (for example, we would expect that by simply swapping out a less likely field value in a summary for a more likely one, we would often arrive at a more probable summary).
The observation used in the analysis in Section 3 is then: given a passage p, a field F , and a summarization model P (S|P ), if assumption A1 holds, then a generator that maximizes likelihood will choose to output f = arg max f P (F = f |P ) for the field's value (or omit the field, if f = ∅).This fact is straightforward from the definitions.

C Manual Examination of the Unsupported Dataset Samples
Identifying parts of a summary that are not supported by the source document is a challenging annotation task.In this section, we explain how we formalize this task as a binary token tagging problem, and we show one example that illustrate the difficulty of annotation.

C.1 Annotating unsupported words
Naturally, words that appear only in the summary but not the source document tend to have a higher chance of being "hallucinated", and vice versa.Hence, we select such words from the summary, and the goal is try to identify whether the meaning of these words can be deduced from the source documents.Compared to the automated measurements, the manually inspected labels are considered to be a better approximation of the true abstractiveness of the dataset or the samples.

C.2 One challenging example
In practice, understanding the source document involves multiple (common sense) reasoning steps and subjective judgements.
Considering the following document text: For one sentence in the summary, we highlight the words that do not appear in the source in red: 'The stings of most of these species (Bees) can be quite painful, and are therefore keenly avoided by many people.' The source text mentions several dangerous aspects of bee stings, but whether it can be concluded that they are avoided by many people (a plausible commonsense implication) is subjective to judge, and annotators often had differing opinions on these judgments.

D.1 Heuristics in f c
We develop 8 binary checks that constitute the heuristics for f c , which fall into three categories.Two categories use model internals, model confidence and source text attribution for the predicted token.The third category uses language features, allowing generations that are common words.

Model confidence
• entropy of next-token distribution < τ for a token in the top 2 predictions • from the top 10 predicted next tokens, the number that match a top 5 attended-to piece of source text 9 is >= 1 2 (10−the number that 9 All reference to "top attended-to pieces of the source text" means a max across locations in the source text across attention heads in the final layer of the decoder's cross-attention, and a 10-wordpiece window around the attended-to location.

are stopwords) Source text attribution
• the most attended-to piece of the source text contains the predicted token • 3 out of the top 5 attended-to pieces of the source text contain the predicted token • sum of the attention scores of the attended-to pieces of source text (out of the top 5) that contain the predicted token is greater than 1 3 of the sum of the top 5 attention scores • max cosine similarity between the embedding of the predicted token and that of any word in the top 5 attended-to pieces of source text is greater than 0.15 (and the word is not capitalized or a number word)* 10 Common word • predicted token is a stopword* • prediction matches 11 one of the top 5 predictions of roberta-base 12 All of the components and hyperparameters above were determined via inspection on a small number of samples (e.g., n=20) from the XSUM and SCD dataset.In the subsequent sections we detail the configurations of the parameters on each dataset.

D.2 XSUM modeling details
For configuration of PINOCCHIO for XSUM, we set τ = 1.0 and do not use the optional stopword condition, in order to accommodate the highly abstractive nature of the XSUM dataset and attempt to prevent the use of stopwords in hallucinations.
One other important detail is that XSUM has a surprising property with respect to first names.If a person appears in the source as "Mr/Ms" X, and also in the headline, they always appear as <FIRST NAME> X in the headline.This leads to BART always guessing the first name of a person, frequently incorrectly.Our f c often identifies the first name as unsupported, but because BART is essentially unable to predict anything other than a first name in this situation, it is unable to recover from this error.For this reason, when an unsupported token is identified as a name using spaCy (Honnibal et al., 2020), we deterministically replace it with Mr/Ms. 1310 Items marked with an asterisk * are optional. 11For all string matching, we lemmatize first. 12https://huggingface.co/roberta-base 13 For real applications, we suggest using a gender neutral honorific, as gender is not possible to infer using first names

D.3 SCD modeling details
For SCD, the source consists of full papers and is too long to input to BART directly, so we train a separate BERT-based model to extractively rank chunks of the input text based on predicted ROUGE-L F1 score against the target text.This setup of ranking extractive chunks and then passing them to an abstractive model is similar to prior work on long text summarization (Liu and Lapata, 2019).We pass the concept name/aliases and each chunk of text to rank to SciBERT-base (Beltagy et al., 2019), with a final linear layer to predict the ROUGE-L score.We then finetune BART, with the ranked extractive chunks as source, again concatenated with the concept name/aliases.For inference, we also filter the chunks to those that include the concept name or an alias.
Extractive ranker for descriptions The extractive ranker uses SciBERT14 , followed by a linear layer, and is trained with MSE loss.We also use dropout of 0.1.We train on chunks containing three sentences, and use the average ROUGE-L as the label.To reduce the size of the training set, for each target description, we select the top 5 and bottom 5 chunks by ROUGE-L, and an additional 5 random chunks from the middle.We train for 3 epochs, with a batch size of 1, 8 gradient accumulation steps, and the AdamW (Loshchilov and Hutter, 2017) optimizer, with weight decay 0.01, and a slanted triangular learning rate scheduler with peak learning rate 5e-5.

D.4 Finetuning BART on descriptions
BART was finetuned with the standard settings,15 a batch size of 4 with 8 gradient accumulation steps, for 10 epochs, selecting the epoch 5 model based on validation loss.The same optimizer as above was used, with 500 warmup steps.The model was trained for 5.5 hours on 3 NVIDIA Quadro RTX 8000s.We additionally filter out examples that have a target length less than 150 characters, and examples where the source and target have less than 0.2 token overlap.
For configuration of PINOCCHIO for SCD, we set τ =0.75 and do not use the optional cosine similarity condition, to encourage more extractiveness.

• Consistency
-1: completely made up -2: some phrases supported, but largely made up -3: some full details correct, but key details made up -4: minor details not fully supported (e.g.acronym wrong, location abstracted a bit wrong) -5: fully supported -Other notes: An unresolved "it" should be assumed to refer to the main concept.If this makes it not factual, that counts against consistency, otherwise it counts against coherence.

• Coherence
-1: all sentences/phrases don't make sense together -3: some sentences/phrases don't make sentence together, separate from whether they are factual -5: no issues with how phrases/sentences are put together • Fluency (at the sentence level) -1: not fluent English to the point that it is impossible to understand/meaningless -2: not fluent English to the point that it is very hard to understand -3: semi fluent English (including major fluency errors resulting from copying source text), but still largely understandable -4: Mostly fluent English (including minor fluency errors resulting from source text), does not impact understanding -5: Fluent English

G Patterns and Hallucination
We provide additional discussions for some qualitative aspects of our results.First, we need to discuss the substantial drop in ROUGE on XSUM.
As alluded to in §3, we believe this is due to a pervasive regularity in the XSUM dataset, which BART is able to capture very well.In Tab. 7, we show the top examples sorted by ROUGE-L difference between BART and PINOCCHIO, along with a hand-crafted regex matching the example, how many times it matches target outputs from the training and validation set, how many times it matches BART predictions on the test set, and how many of those predictions are completely factually consistent.Most of these examples straightforwardly map to patterns of text that occur in the training data.We also see that test set predictions matching these patterns are largely not consistent.As discussed in §3, this is because BART assigns high likelihood to the general pattern, but guesses to fill in the details.Some of these patterns are straightforward to identify, but many are likely to be more complicated.Broadly speaking, XSUM contains a lot of regularity in the mapping between the source topic, phrases, and vocabulary used in the target summary.BART exploits this, whereas PINOCCHIO steers the model away from the patterns, which are often not supported by the source text, which lowers ROUGE.A related question is if BART trained on XSUM applies facts learned during training correctly.Does it learn that Antonio Conte is the coach of the Italian football team, thus someone named "Conte" who coaches the Italian team is Antonio Conte?Or does it merely learn the first name most commonly associated with "Conte" in train is "Antonio", and so everyone named "Conte" is Antonio Conte? 16 It is difficult to assess this automatically, so we present an example of BART's tendency to guess world knowledge.We create one three-sentence source, "Sometime last week, a fire burned down a <BUILDING>, killing a number of people.The fire took place in <LOCATION>.Investigators believe at least four people to be missing.",filling in the blanks with three made up locations and three building types.BART produces plausible but inconsistent summaries.Nine out of nine outputs hallucinate the location, eight discuss arrests or hospitalizations, and three mention the police or fire service reporting the details of the situation.These characteristics are all due to biases present in the training data.Locations are often abstracted, reported fires often result in someone being arrested or hospitalized, and they are usually reported by authorities.We present this example as evidence that BART is not learning how to reliably apply commonsense and learned facts, but rather, is naively reproducing patterns and word associations.

H Comparing Generated Summaries with and without PINOCCHIO
In Tab. 8 and 9, we include example summaries generated with and without PINOCCHIO.We additionally include the annotator ratings and their comments to illustrate how PINOCCHIO improves the quality of the summaries.

Source
Tourism NI said it expects a strategy to be in place by early next year.Janice Gault from the Hotels Federation told the BBC's Inside Business programme it was crucial for the industry.She said a "partnership" approach was essential."I mean we've really urged people to get a strategy at sort of quite a high level so that everybody can buy into that," she said."Hotels have probably spent about a billion pounds in the last decade and are set to spend more."Ms Gault said another big boom was expected in the hotel market which would probably generate another half a billion pounds."The funny thing about the strategy is we still have the target, but we don't have the strategy.We only have one way to go and that's growth and the way for us to get that is to partnership," she added.
Generation BART BART+PINOCCHIO The Northern Ireland Hotels Federation has called on the Northern Ireland Executive to set out a strategy for growth in the hotel industry.
A new strategy for Northern Ireland's hotel industry has been urged by the Hotels Federation and Tourism NI as the industry is set for another big year.

Annotator Comment
Executive abstracted; didn't call on her; strategy is not to "grow" the industry

Source
The University and College Union says the 1.1% rise offered by the universities is "an insult".But the Universities and Colleges Employers Association said the walkout was "disappointing given the very good pay offer".Unions representing university support staff are balloting on the offer, with strike action possible in the autumn.UCU says its members have suffered a real-terms pay cut of 14.% since 2009 and complains the squeeze on staff salaries has come as university leaders enjoyed hefty increases."A 1.1% pay offer is an insult to hardworking staff, especially in light of the 5% pay rise vice-chancellors have enjoyed while holding down staff pay," said general secretary Sally Hunt."Industrial action which impacts on students is never taken lightly, but members feel that they have been left with no alternative."If the employers wish to see a swift end to this dispute, and avoid further disruption, they need to come back to the table with a much-improved offer."Summer exams are still running at some universities, though many have finished.
A spokesman for the employers anticipated only "minor impact and minimal student disruption"."Even for examinations which are still taking place at some higher education institutions, more than nine out of 10 report that a no to low impact is anticipated," said the spokesman."We would like to see the UCU consulting its members on the final offer." The employers say the offer is "at, and, for some, beyond, a limit of affordability for higher education institutions and the very best offer that will be available this year".They maintain the weighting of the offer means the worst paid university staff will get a rise of more than 5%.They say they have also offered talks on zero-hours contracts and on improving lower pay for female academics.But UCU says it rejected the 1.1% offer as it was only a marginal improvement on the original 1% on which it had balloted members.Ballots of university support staff represented by Unison and Unite are also under way on the improved offer, with both unions recommending it be rejected.Any action would take place during the autumn term, said a Unison spokeswoman.UCU is planning strike rallies in: Staff are also working to contract from Wednesday -refusing to set extra work, cover for absent colleagues or work overtime.
Generation BART BART+PINOCCHIO University staff in England are to stage a one-day strike on Wednesday in a row over pay, in a dispute that could lead to further industrial action.
Universities have warned of "minor disruption" for students in England after staff rejected an improved pay offer in a dispute over a pay rise.

Ratings
Consistency: 3 Fluency: 3 Relevance: 5 Consistency: 4 Fluency: 5 Relevance: 3 Annotator Comment consistency: location abstracted; "one day" on Wednesday not true fluency: "in a" twice in same sentence, doesn't flow consistency: quoted text is not an actual quote; location abstracted relevance: "improved pay" is misleading, missing key information: "1.1% marginal improvement"

Figure 1 :
Figure1: Distribution of positions in summaries about player signings in train vs. BART output.BART output is more peaked at positions more common in train, suggesting BART defaults to these when no position is supported by the source.

'
League Two club Cheltenham Town have signed Hibernian striker Brian Graham on a free transfer.'

'
League Two side Cheltenham Town have signed Hibernian midfielder Scott Graham on loan until the end of the season.'

Figure 2 :
Figure 2: Comparing the n-gram frequency distribution on the Dataset for generated, versus ground truth sources.The default BART model outputs (in green) over-represent frequent n-grams (bottom right of the distribution), but PINOCCHIO is closer to the ground-truth.Results in the SCD dataset are similar.The slope of the linear fits for ground-truth text and BART generations are significantly different (p ≪ 0.05, ANCOVA) while those between ground-truth and BART + PINOCCHIO generations are not (p > 0.05) for both 2-gram and 3-grams.

Table 1 :
An example of hallucination.Inconsistent words are highlighted in red italic fonts.In this case, PINOCCHIO corrects the inconsistent detail in the BART output.

Table 3 :
Rouge scores on different datasets with and without using PINOCCHIO.Datasets with higher abstractiveness (e.g., XSUM and SCD) may suffer from higher ROUGE drops when PINOCCHIO is used.

Table 5 :
Agreement between automated metrics and our annotations.tau represents Kendall's tau, compare̸ = (Dong et al., 201915)gainstZhu et al. (2021), a recent seq2seq fact corrector (FC) that incorporates OpenIE(Angeli et al., 2015)and knowledge graph embedding.We take the output of their strongest model (UniLM(Dong et al., 2019)+FC) on the XSUM test set and find that it changes only ~5% of examples, and that the net improvement rate of the changes is 15% (see App. F for details).This corresponds to an improvement on <1% of the full XSUM test set.By contrast, our experiments in the previous section

Table 7 :
Top-5 BART generations, by ROUGE-L gain over PINOCCHIO (#2 is excluded; it doesn't match an obvious pattern and is factually consistent).In all examples, BART clearly memorized training patterns and guesses the details in at least 3 (the 3rd output is memorized from noise in XSUM), which is not strongly penalized by ROUGE.

Table 8 :
Side-by-side comparison of the generated summaries with and without PINOCCHIO -Example 1 in XSUM.

Table 9 :
Side-by-side comparison of the generated summaries with and without PINOCCHIO -Example 2 in XSUM.