Mutual Information Alleviates Hallucinations in Abstractive Summarization

Despite significant progress in the quality of language generated from abstractive summarization models, these models still exhibit the tendency to hallucinate, i.e., output content not supported by the source document. A number of works have tried to fix—or at least uncover the source of—the problem with limited success. In this paper, we identify a simple criterion under which models are significantly more likely to assign more probability to hallucinated content during generation: high model uncertainty. This finding offers a potential explanation for hallucinations: models default to favoring text with high marginal probability, i.e., high-frequency occurrences in the training set, when uncertain about a continuation. It also motivates possible routes for real-time intervention during decoding to prevent such hallucinations. We propose a decoding strategy that switches to optimizing for pointwise mutual information of the source and target token—rather than purely the probability of the target token—when the model exhibits uncertainty. Experiments on the dataset show that our method decreases the probability of hallucinated tokens while maintaining the Rouge and BERT-S scores of top-performing decoding strategies.


Introduction
Abstractive summarization, the task of condensing long documents into short summaries, has a number of applications, such as providing overviews of news articles or highlighting main points in technical documents.Abstractive summarization is usually performed using probabilistic text generators (Goyal and Durrett, 2020;Mao et al., 2020;Kryscinski et al., 2020), which have shown a strong ability to produce fluent, human-like text (Baevski and Auli, 2019;Radford et al., 2019;Brown et al., 2020).However, these models have been observed to hallucinate facts, i.e., add information to the output that was not present in the original text.This behavior is problematic, as presenting users with unsubstantiated content can lead to undesirable effects, such as the spread of misinformation (Bender et al., 2021;Abid et al., 2021;Liang et al., 2021).Some works have attributed this phenomenon to the specific training corpora for these models, in which ground-truth summaries often contain outside information that may not have been directly deducible from the original text (Maynez et al., 2020;Zhou et al., 2021).Others have pointed to model architectures or training strategies (Voita et al., 2021;Wang and Sennrich, 2020;Kang and Hashimoto, 2020).While these works have given us an improved understanding of the cause of hallucinations, there still does not exist an efficient and robust set of techniques for identifying and preventing them during the generation process.
This work aims to first provide a simple criterion indicating when a model is more likely to assign higher probability to content not necessarily derived from the source document.Specifically, we link the start of a hallucination during generation to high model uncertainty about the next token, which we quantify by conditional entropy.We hypothesize that hallucinations may be due to a tendency of models to default to placing probability mass on tokens that appeared frequently in the training corpus, a behavior by language models previously observed in several natural language processing (NLP) tasks (Kobayashi et al., 2020;Wei et al., 2021).As a consequence, generations with hallucinations would still be viable candidates, as standard decoding strategies for summarization optimize purely for the probability of the generation.We propose an alternative decoding strategy to combat this behavior: When a model exhibits high uncertainty, we change our decoding objective to pointwise mutual information between the source document and target token (PMI; Li et al., 2016;Takayama and Arase, 2019), encouraging the model to prioritize tokens specifically relevant to the source document.While changing completely to the PMI objective causes a drop of 3.13% in ROUGE-L scores, this conditional and temporary change leads to only a 0.977% drop in ROUGE-L while increasing factuality according to the FACTScore metric.
In experiments, we first observe a strong correlation between conditional entropy and the start of a hallucination on an annotated subset of the XSUM dataset (Maynez et al., 2020).We next score the targets in the annotated subset under both the standard log-probability objective and PMI, and observe that the revised log-probability of hallucinated tokens under the PMI objective is indeed lower.Finally, we find that our proposed decoding strategy maintains ROUGE and BERTS scores.

Preliminaries
In this work, we consider probabilistic models for abstractive summarization.Explicitly, we consider models of the distribution p(y | x), where x is the source document that we wish to summarize and y = y 0 , . . ., y T is a string, represented as a sequence of tokens from the model's vocabulary V.The set of valid sequences Y is then defined as all sequences y such that y 0 def = BOS and y T def = EOS, the beginning-and end-of-sequence tokens, respectively, and y t ∈ V for 0 < t < T .Note that standard models are locally normalized, i.e., they provide a probability distribution over V def = V ∪{EOS} at time step t given the source document and prior context p(• | y <t , x).The probability of an entire string y can then be computed as p(y | x) = T t=1 p(y t | y <t , x), where for shorthand we define y <t def = y 0 , . . ., y t−1 .Generation from p is performed token-by-token due to the autoregressive natures of most language generators.We typically seek to generate a string that maximizes some score function In Evaluation.Abstractive summarization systems are usually evaluated using automatic metrics, such as ROUGE (Lin, 2004).While ROUGE generally correlates poorly with human judgments (Maynez et al., 2020;Fabbri et al., 2021) and is only weakly correlated with factuality,1 it is quick to compute, making it useful for quickly testing modeling choices.Recently, entailment metrics (FactCC; Kryscinski et al., 2020) and contextual embedding methods (BERTScore; Zhang et al., 2020) have surfaced as reasonable indicators of factuality (Pagnoni et al., 2021).

Finding and Combating Hallucinations
It is not well understood when summarization models start to hallucinate, i.e., when they start to place high probability on continuations that are unfaithful (not entailed by the information presented in the source document).In this work, we hypothesize that such moments correlate with high model uncertainty.In other problem settings, it has been observed that NLP models default to placing an inappropriately large portion of probability mass on high-frequency (with respect to the training corpus) tokens; this is especially the case when mak-ing predictions for data points of a type that the model has not had much exposure to during training (Kobayashi et al., 2020;Wei et al., 2021).In this same setting, models often have high (epistemic) uncertainty about their prediction (Hüllermeier and Waegeman, 2021).We extrapolate on these findings and posit that summarization models may highly score more marginally likely-but perhaps unrelated-tokens in settings for which they are not well-calibrated.
Fortunately, both model certainty and marginal likelihood have quantifications that can be easily computed at any given point in the decoding process, making it possible to test for relationships between these quantities and the start of hallucinations.Specifically, we can use the standard equation for Shannon entropy with our conditional distribution to quantify model uncertainty at time step t: Entropy is not a holistic measure of uncertainty,2 but our use of is motivated by previous research that has likewise employed it to quantify the uncertainty of model predictions in classification (Gal and Ghahramani, 2016) and summarization (Xu et al., 2020) tasks.Further, we can directly compute marginal probabilities p(y | y <t ) using a language model-this value quantifies how likely a continuation y is irrespective of the source.3

Pointwise Mutual Information Decoding
Under the premise that models are placing disproportionate probability mass on marginally likely, i.e., frequent, tokens, the standard log-probability decoding objective is prone to favor generic continuations regardless of the input.In order to alleviate the problem of generic outputs from neural conversation models, Li et al. (2016) propose maximizing for mutual information during decoding, which effectively introduces a penalty term for such candidates.Formally, they propose using the following score function in the problem of Eq. (1): which is the (pairwise) mutual information between the source x and the target y.Note that this would be equivalent to optimizing for score(y | x) = log p(y | x) − log p(y).4While this score function likewise decomposes over tokens, for the same reasons as discussed earlier, solving for the exact minimizer is computationally intractable.Thus we must still resort to approximate search algorithms.In practice, one can iteratively optimize for pointwise mutual information (PMI): log p(y | y <t , x) − log p(y | y <t ).5 Our proposed decoding strategy, conditional PMI decoding (CPMI), uses the conditional entropy at a given time step to indicate when the pointwise decoding objective should be changed.This process can be formalized as follows.For a given (token-by-token) decoding strategy, we use the pointwise score function: In words, when H(p(• | y <t , x)) is above a certain threshold τ , we subtract a term for the marginal log-probability of the token, i.e., we change from the standard token-wise log-probability objective to PMI.

Related Work
Understanding hallucinations.Several prior works have tried to identify the cause of hallucinations in various natural language generation tasks, along with methods for alleviating them.For example, both Wang and Sennrich (2020) and Voita et al. (2021) suggest that exposure bias, i.e., the failure of a model to predict accurate continuations following its own generations rather than the ground-truth context as a result of the discrepancy between procedures at training and inference, leads to hallucinations, as it causes the model to over-rely on target contributions when decoding.They propose using minimum risk training (MRT), which can alleviate exposure bias, to make models more robust.However these results show only a tentative connection to exposure bias, and are based on models for neural machine translation (NMT) rather than summarization.Other works have shown that pre-training and training on more data generate summaries more faithful to the source (Voita et al., 2021;Maynez et al., 2020).In contrast to these works, our method does not require any changes to model training.Rather, it intervenes during generation without the need to retrain the base model.
Detecting hallucinations.Other efforts aim to identify hallucinations rather than their cause.
Token- (Zhou et al., 2021) and sentence-level (Kryscinski et al., 2020) hallucination detection, as well as textual entailment systems (Goyal and Durrett, 2020) allow hallucinations to be identified after the generation process.Some techniques even aim to correct the unfaithful span by, e.g., replacing it with text from the source (Chen et al., 2021).However, these approaches are all post-hoc.
Our approach intervenes during decoding, allowing real-time hallucination detection and prevention.
Decoding to avoid hallucinations.Perhaps most in-line with this work, some prior work has modified the decoding procedure to avoid unfaithful outputs.Keyword-based methods extract keyphrases from the source and require that they appear in the summary (Mao et al., 2020).The focus attention mechanism (Aralikatte et al., 2021) biases the decoder towards tokens that are similar to the source.While this is similar to our approach, we use mutual information to bias our decoding algorithm away from high probability-but not necessarily relevant-candidates.Another difference is that our method only runs when model uncertainty, as quantified by conditional entropy, is high, so we only bias generation when necessary.Lastly, our approach is purely abstractive and does not require resorting to extractive methods.
Mutual information decoding.Mutual information-based decoding techniques have proven to be helpful in a number of settings.For example, in zero-shot settings (Holtzman et al., 2021) or for promoting diversity or relevance in neural dialogue models (Li et al., 2016;Takayama and Arase, 2019).Our work is the first to use mutual information to increase the faithfulness of summaries in abstractive summarization.

Experiments
Data.We use the extreme summarization (XSUM) dataset (Narayan et al., 2018), which is composed of 226,711 British Broadcasting Corporation (BBC) articles and their single-sentence summaries.We use the same train-valid-test splits as the authors.A subset of these articles (500) from the test set are annotated, i.e., reference spans are labeled as faithful or not to the source article (Maynez et al., 2020;Zhou et al., 2021).We further process these labels to obtain token-level hallucination labels.
Models.We use the Fairseq framework (Ott et al., 2019) for all of our experiments.We evaluate several models: a transformer based summarization model (TRANS2S) trained on the XSUM dataset with the standard maximum log-likelihood objective as well as the BART summarization model (BARTS2S) fine-tuned on XSUM (Lewis et al., 2020).Lastly, for our language model p(y), we train a transformer based language model.6 Decoding.We generate summaries using CPMI and beam search, as well as score existing summaries under the CPMI objective.We do a hyperparameter search to select the two hyperparameters λ/τ (see appendix B for details).For evaluations, we would ideally use token-level faithfulness labels.However, we have only 500 such human annotated reference summaries (Maynez et al., 2020).To obtain labels for the generated text, and the 10,832 other reference summaries, we turn to automatic factuality detection (Zhou et al., 2021).This allows us to evaluate on metrics specific to the token label e.g., average probability of hallucinated tokens.
Evaluation.Lacking a good operationalization of a hallucination, we cannot directly measure the fraction of hallucinated tokens in the generated summaries.In line with previous work, we therefore rely on automatic metrics and human evaluations to estimate the incidence of hallucinations (Nie et al., 2019;Maynez et al., 2020;Zhou et al., 2021).CPMI is evaluated using standard summarization performance metrics (ROUGE and BERTScore), factuality metrics (FactCC and FACTScore), and an estimation of hallucination incidence based on scoring reference summaries with associated human evaluated hallucination labels.The FactCC metric computes a token-level binary factuality label over a collection of source/summary pairs and returns the mean (Pagnoni et al., 2021).It uses the binary entailment classifier of the same name (Kryscinski et

2020
).Thus we can produce a similar entailment metric using the factuality labeling generated by Zhou et al. (2021), which we denote FACTScore.

Initial Analysis
Using the 500 faithfulness annotations of a subset of XSUM summaries (Maynez et al., 2020;Zhou et al., 2021), we are able to determine whether a token is hallucinated, and further whether it is the first in a sequence of hallucinated tokens.Our preliminary investigations found that on average, the conditional entropy under our summarization model is higher for first hallucinated tokens relative to non-hallucinated tokens (4.1972±0.0648vs 3.6893±0.0209for TRANS2S and 3.1147±0.0514vs 2.3898 ± 0.0131 for BARTS2S, where the ± value is the standard error).This suggests that hallucinations could possibly be connected with model uncertainty, and that the start of hallucinations could be identified when conditional entropy is above a certain threshold, the model defaults to a likely but perhaps unfaithful token.

Results
We now perform our generation and scoring analyses, as outlined above.
How are performance and factuality metrics impacted by CPMI?From Table 1 we see that for the BARTS2S model there is very little change to performance metrics.For the TRANS2S model, performance metrics are slightly worse un-der CPMI, but the largest change is still within the margin of error.This suggests CPMI does not negatively impact the quality of generated sentences.While FACTScore increases under CPMI for both models, FactCC decreases.It is unclear from these factuality metrics what effect our approach has the incidence of hallucinations during generation for both models.Examples of summaries generated with and without CPMI are given in fig. 1.
What happens to known unfaithful tokens when scored under CPMI?Table 2 shows how token-level score and ranking (where the highestprobability token is rank 1 and lowest probability token is rank |V|) change when CPMI is used instead of the standard log-probability scoring function for 500 ground-truth summaries with human evaluated factuality labels.Overall, we see that for hallucinated tokens, the scores decrease and ranking increases, which is the desired behavior.This is particularly true for tokens at the start of an unfaithful span (denoted as Initial), for which we see a more significant impact on both models.E.g., for BARTS2S, the score decreases more for initial vs. non-hallucinated (−0.13 ± 0.03 vs −0.07 ± 0.01) and likewise rankings increasing more (275 ± 114 vs. 552 ± 134).While there is also an impact on non-hallucinated tokens, it is much less significant.
For an appropriate choice of threshold though, it is likely that PMI will not be in action at times when non-hallucinated tokens would have been chosen, meaning this change should not be a concern.

Conclusion
In this work, we link the start of a hallucination in abstractive summarization during generation to model uncertainty, as quantified by high conditional entropy, about next-token predictions.
We then propose a decoding procedure, CPMI, which switches the decoding objective to pointwise mutual information when model uncertainty is high to prevent hallucinations.Our method reduces the likelihood of generating unfaithful tokens while still outputting high-quality summaries.
In the future it would be interesting to combine CPMI decoding with post-hoc correction methods and other modified decoding procedures, to investigate if we can complement existing techniques mentioned in §4.

Limitations
A clear limitation of this work is that the results have been shown only for English on the XSUM dataset, as this is the only open dataset with the annotations required for our set of experiments.Further work should consider other model architectures, and other datasets such as CNN/DM (Hermann et al., 2015).Further, we do not conduct human evaluations.Using human judges to obtain a qualitative assessment of the effect of CPMI could provide additional data about the efficacy of the decoding procedure.However, we note that human judgment of the faithfulness of summaries is far from perfect (Clark et al., 2021).
There are issues with the XSUM dataset that may be confounders for results: some articles/summaries are in Gaelic, and previous work has shown that reference summaries often contain spans not directly inferable from the source article (Maynez et al., 2020).Limitations of the models themselves are that we truncate the source to 4096 tokens, so we lose information due to this training constraint.

Ethical Concerns
We do not foresee any ethical concerns with this work beyond those already documented in abstractive summarization systems and other text generators, which are well documented already (Smiley et al., 2017;Zellers et al., 2019;Kreps et al., 2022).REFERENCE SUMMARY: A drunk man who was driving his car at 119mph when he crashed into and killed an off-duty police community support officer (PCSO) has been jailed.DOCUMENT: Alwyn Pritchard, 53, was riding his motorbike when he was struck by an Audi driven by Paul Wilson, who then fled the scene, Cardiff Crown Court heard.
[2 sentences with 41 words are abbreviated from here.]Wilson, an experienced HGV driver, admitted drinking "a couple of pints of Peroni and two bottles of Corona" but claimed he had been driving at 70mph on the Heads of the Valleys road near Abergavenny.
[12 sentences with 232 words are abbreviated from here.]Gwent Police Chief Constable Jeff Farrar described him as "a committed, kind and conscientious community support officer".TRANS2S: A driver has been jailed for four years for causing the death of a man by careless driving.TRANS2S WITH CPMI: A driver who crashed into a car which killed a couple has been jailed for seven years.BARTS2S: A man has been jailed for causing the death of a police community support officer by dangerous driving in Monmouthshire.BARTS2S WITH CPMI: A drink-driver has been jailed for causing the death of a police community support officer in Monmouthshire.
Figure 1: An abridged example from the XSUM dataset and the generated summaries under TRANS2S and BARTS2S, with and without CPMI decoding.

A Additional Results
Table 3 contains the results of evaluating beam search and CPMI on ground truth token labels that was processed to generate Table 2. Table 4 contains the full results of our preliminary analysis to correlate average conditional entropy of the summarization model by token labels.Alg 1 provides the standard beam search algorithm.

B Implementation Details
We train all models using the Fairseq framework (Ott et al., 2019).The code will be released upon acceptance.
Preprocessing.We tokenize the data with Moses (Koehn et al., 2007).For the TRANS2S we learn and apply BPE using FastBPE (Sennrich et al., 2016), whereas for BARTS2S we follow the provided BPE preprocessing steps. 7We then binarize the resulting data using the fairseq-preprocess CLI tool from Fairseq.
General training and generation.We train on a single GPU with 4 CPU cores, each with 2048 MB memory.The average runtime for training depends on the model, but was between 24 and 48 hours.We stop training early if validation performance does not improve for 5 consecutive runs.We use a maximum length of 4096 tokens, and truncate longer sources.We use beam search with a beam size of 5, the same beam size is used for CPMI.
TRANS2S.We use the fairseq transformer model with parameters selected according to the transformer-wmt-en-de model8 We picked the parameter update frequency to be the maximum value that did not cause out-of-memory errors: 64.We then did a grid-search over dropout in [0.1, 0.3] and learning rate in [7 × 10 −4 , 7 × 10 −5 ].The optimal values were dropout of 0.3 and learning rate of 7 × 10 −5 , with a validation loss of 6.225.
Language model.As the BPE step was different for TRANS2S and BARTS2S, we trained two language models denoted by the associated summarization model name.The architecture is the fairseq transformer-lm model.The early stopping criteria was 5 runs, and maximum length was 2048 tokens.As before we picked the update-frequency to be as large as possible without taking too long, this was 32.We searched over different training sets of targets only and both source and targets.We then did a grid-search over learning rate in [1 × 10 −4 , 2.5 × 10 −4 , 5 × 10 −4 ].The optimal parameters were to train on both source and targets with a learning rate of 1 × 10 −4 for TRANS2S and 5 × 10 −4 for BARTS2S.The optimal validation metrics were a loss and perplexity of 5.6404 and 49.88 respectively for TRANS2S and 4.5453 and 23.35 respectively for BARTS2S.
CPMI hyperparameter search.We select two hyperparameters λ/τ , controlling the influence of the language model and the conditional entropy threshold which triggers PMI decoding.The goal is to perform a min max optimization, where we minimize the average log probability (scored under the PMI objective) of initial hallucinated tokens based on human evaluations of the target sentences and to maximize the ROUGE-L score of generated sentences.We use the 500 example subset of XSUM with factuality annotations, that is a subset of the XSUM test set (Maynez et al., 2020).To perform the optimization we generate a heat plot with λ/τ on the x/y axis and the z axis is a weighted combination of ROUGE score -log probability to get around a 3:1 contribution respectively to the z value.We then determine the optimal parameters to be the ones that maximize this metric.
There were two evaluation runs, first with λ ∈ [2 × 10 −1 , 2 × 10 −2 , ..., 2 × 10 −1 ] and τ selected from a uniform distribution about the average conditional entropy of the initial hallucinated tokens ± the standard deviation (see Table 4 for these values).The second run, selected a smaller region the looked promising and then selected uniformly at random 10 λ/τ values for a total of 100 possible parameters pairs.The optimal values were λ = 1.3120 × 10 −1 , τ = 3.5618 for TRANS2S and λ = 6.5602 × 10 −1 , τ = 3.5987 for BARTS2S.The plots in fig.2, show the plots from this second run used to select the optimal parameters.Automatic hallucination detection.We mention in the paper that we use automatic factuality detection in order to obtain measures such as FACTScore.For this we use the provided code by Zhou et al. (2021) 9

Table 2 :
Change in average token score and ranking by ground-truth hallucination label for CPMI compared to beam search.