Faithfulness-Aware Decoding Strategies for Abstractive Summarization

Despite significant progress in understanding and improving faithfulness in abstractive summarization, the question of how decoding strategies affect faithfulness is less studied. We present a systematic study of the effect of generation techniques such as beam search and nucleus sampling on faithfulness in abstractive summarization. We find a consistent trend where beam search with large beam sizes produces the most faithful summaries while nucleus sampling generates the least faithful ones. We propose two faithfulness-aware generation methods to further improve faithfulness over current generation techniques: (1) ranking candidates generated by beam search using automatic faithfulness metrics and (2) incorporating lookahead heuristics that produce a faithfulness score on the future summary. We show that both generation methods significantly improve faithfulness across two datasets as evaluated by four automatic faithfulness metrics and human evaluation. To reduce computational cost, we demonstrate a simple distillation approach that allows the model to generate faithful summaries with just greedy decoding.


Introduction
Recent developments in large pre-trained language models have achieved remarkable performance on abstractive summarization (Lewis et al., 2020;Zhang et al., 2020a).However, such models often suffer from the problem of hallucinations, where the generated summary contains facts or entities not present in the original document.Prior research has analyzed and defined potential error types and typology (Maynez et al., 2020;Pagnoni et al., 2021;van der Poel et al., 2022), and developed methods to improve faithfulness, including post-processing models (Chen et al., 2021b;Dong et al., 2020;Liu and Liu, 2021;Ladhak et al., 2022) and faithfulness-aware training (Goyal and Durrett, 2021;Nan et al., 2021;Cao and Wang, 2021;Wan and Bansal, 2022;Zhang et al., 2022;Xiao and Carenini, 2022).
One aspect that is less understood on faithfulness of abstractive summarization is the effect of decoding strategies, which determine how the model generates the output strings.Our primary objective is to understand whether different types of exploration of the search space, such as traversing and maintaining multiple possible output hypotheses with beam search or encouraging diversity with nucleus sampling (Holtzman et al., 2020), have an impact on faithfulness.To this end, we first conduct a thorough analysis comparing the faithfulness of popular decoding strategies, including greedy decoding, beam search, and nucleus sampling for two popular summarization datasets XSum (Narayan et al., 2018) and CNN/DM (Hermann et al., 2015).Evaluating the generated summaries using four faithfulness metrics, including BertScore (Zhang et al., 2020b), FactCC (Kryscinski et al., 2020), DAE (Goyal and Durrett, 2021), and QuestEval (Scialom et al., 2021), and human evaluation, we find a consistent trend that beam search provides the most faithful summaries with its large exploration of the search space, and the randomness introduced by sampling hurts faithfulness.
To further improve faithfulness beyond the common decoding strategies, we propose two faithfulness-aware decoding methods.First, similar to Falke et al. (2019), we make use of the multiple candidates generated by beam search and propose a simple re-ranker, which selects the best summary according to a faithfulness metric.Instead of using a specific metric, we rank and select the summaries with a composite metric, a weighted A study led by Goldsmiths, University of London, found British army infantry troops spent less than 47% of their time on the Western Front … Faithful?
❌ ✅

Document
The idea that World War One was a "war of attrition" … Beam Search ⋯ The idea that British soldiers spent most of their time ... A study led by Goldsmiths, University of London, found British army infantry troops spent less than 47% of their time on the Western Front … The idea that World War One was a "war of attrition" is "simply not true"… British soldiers spent most of their time fighting ... Figure 1: Illustration of our proposed decoding methods.1a shows our ranker that re-ranks the candidates produced by beam search according to faithfulness metrics.The first summary achieves a high score and would be used as the final summary for beam search, but it is not faithful.Our ranker ensures that the more faithful summary is ranked higher.1b shows the lookahead heuristics that provide a faithfulness score given the full future summary.
The model assigns a higher score to the word "World" than "British".However, by looking ahead we know that the completed summary following the most likely token will result in an unfaithful summary.Hence, the lookahead heuristics will ensure selecting the token "British" so that the resulting summary will be faithful.
combination of popular faithfulness metrics.Next, inspired by Lu et al. (2022), we propose a faithfulness heuristic that looks into the future to generate a full summary starting with the current tokens of any partially generated summary so as to provide a faithfulness score of the future summary during generation.The added heuristic ensures that the selected tokens will lead to a more faithful path in the search space.Compared to the baseline decoding strategies we analyzed, the two proposed methods significantly improve faithfulness as evaluated by four automatic faithfulness metrics and further confirmed by human evaluation.
Finally, to overcome the computational and runtime overhead of our proposed decoding methods, we explore distillation to transfer the knowledge of generating faithful summaries from a teacher model to a student model.Specifically, we use the faithfulness-aware decoding strategies as the teacher model to generate reference summaries.Then, we train student models, which have not been fine-tuned on the original task, to imitate the more faithful generation techniques using an additional cross-entropy loss between the generated summaries by the student and teacher models.Results indicate that the student model is able to generate summaries of similar faithfulness to that of the full teacher model while reducing the decoding time (seconds per example) up to 1/6 of what the teacher model takes.This process can be performed iteratively by using the student model as the teacher for the next iteration (See Figure 2).With each iteration, the new student model is able to generate more faithful summaries, and outperform the original teacher model with just two iterations.
To summarize, our contributions are: 1.An analysis of the effect of popular decoding strategies, including greedy, beam, and nucleus sampling, on the faithfulness of abstractive summarization.2. Two faithfulness-aware generation methods, ranking and lookahead, that improve faithfulness over existing decoding strategies.3. A simple distillation approach that allows a student model to generate faithful summaries with just greedy decoding.

Faithfulness Behavior of Popular Decoding Strategies
We first describe our experiment investigating the effect of popular decoding strategies on faithfulness.We wish to primarily investigate whether better exploration of the search space, such as the candidate expansion with beam search, can improve faithfulness, and how randomness introduced through sampling impacts faithfulness.These investigations in turn motivate our more advanced, faithfulness-aware decoding strategies in Section 3.
Decoding Strategies (Greedy, Beam, and Nucleus Sampling).For generation, we assume the common left-to-right, auto-regressive setting where the model generates a summary y with n tokens given the input document x: The summary tokens are selected with probability according to the decoding strategies.We explore three common decoding strategies: greedy, beam search, and nucleus sampling (Holtzman et al., 2020).Greedy search selects the next token by the most probable token y t = arg max y p(y|y 1:t−1 , x).
Beam search extends greedy search by keeping top-k hypothesis at each time step, where k is the number of beams.Another approach to decoding is to use sampling, where we consider nucleus sampling.Holtzman et al. (2020) surprisingly find that methods that optimize probability, such as beam search, may lead to text degeneration, and thus propose nucleus sampling, a method that randomly selects from top tokens whose cumulative probability satisfies the threshold p.A small p means less randomness and becomes greedy search, while a large p allows for a more diverse output.

Faithfulness-Aware Decoding Strategies
We hypothesize (and later test and confirm whether it is true in Section 6.1 and Appendix E) that current decoding methods, such as beam search which explores a large space, may not explore the paths that focus on faithfulness directly and effectively.Hence, we propose two faithfulness-aware methods that can be applied on top of the base decoding strategies to modify how the space is explored from two different perspectives: (1) Ranking makes use of the large exploration of beam search and picks the explored path that is most faithful; (2) Lookahead directly guides the search process by adding faithfulness heuristics when selecting the next token starting from the initial decoding process.

Ranking with Faithfulness Metrics
Since beam search already explores many different suitable candidates during the decoding process, we hypothesize that more faithful summaries exist in the list of possible candidates, even if the model score is not directly optimized towards faithfulness (we show that this is true later in Section 6.1).Thus, we propose to rerank the generated candidates from beam search according to faithfulness metrics.The process is illustrated in Figure 1a.Assuming a beam search with beam size k, we have k summaries generated by the decoding method.We compute a faithfulness metric (details of the metrics are presented in Section 5.2) over all summaries and select the summary that achieves the highest faithfulness score.In the example, the more faithful summary that was originally ranked low according to model score is now ranked as the top summary according to faithfulness.
Re-ranking candidates for abstractive summarization have been studied primarily from the informativeness perspective (Ravaut et al., 2022a,b), and our focus is on improving faithfulness.Our idea is most similar to Falke et al. (2019), where the authors use NLI models to re-rank.However, the results indicate that the NLI performance does not translate to improvement in faithfulness; their best-ranking model actually increases the number of unfaithful summaries at the top summary after re-ranking by 3%.The authors attribute it to domain shift and NLI models relying on simple heuristics like lexical matching.We thus explore using faithfulness metrics directly for ranking.
Composite Metric.While it is possible to use one of the faithfulness metrics to rank the candidates, it often leads to over-fitting for one particular metric (each metric can have its own domain biases and idiosyncrasies) and hurts the overall faithfulness scores evaluated by other metrics.We instead tune a composite metric that aggregates the vote of several popular metrics (See Section 5.4).We use linear regression to provide weights for each metric and tune on human judgments of faithfulness.We refer the readers to Appendix D and Appendix E for details and ablations for the composite metric.et al. (2022) use lookahead to provide a future constraint satisfaction estimate and show its effectiveness in several constrained generation tasks (commonsense generation, constrained machine translation, table-to-text generation, and constrained question generation).We extend this idea to improve faithfulness of abstractive summarization.Instead of relying on explicit constraints that are available for the constrained generation tasks, we use reference-free faithfulness metrics on the full future summaries as an estimate.Unlike reranking which is constrained by the search space explored by beam search, lookahead allows for exploration of a much larger number of candidates.Figure 1b shows an example of the lookahead.When selecting the next token, the usual decoding scheme would select the word "World" that has the highest probability.However, if we were to follow this path, the resulting summary would introduce    W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 w 8 k i e y S t 5 c 5 T z 4 r w 7 W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0  hallucinations.Instead, we would like to guide the model to select the less probable token "British," which will yield a faithful summary sentence.

< l a t e x i t s h a 1 _ b a s e 6 4 = "
Formally, each summary token is selected by: where log P (y 1:t | x) is the model score, h(•) is a reference-free faithfulness evaluation function that assigns a score to the summary, w is the weight, and l is the number of tokens to look into the future.
Here, L y≤t is a set of possible generated summaries that start with the summary tokens y 1:t .The number of summaries for L varies given the decoding strategies we use to generate future summaries.Greedy search and sampling produce a single expansion, and beam search produces k number of summaries depending on the beam size.Although the lookahead length l can be specified, we instead generate the full summary, as current faithfulness metrics expect full summaries as input and do not work well on partial summaries (see Appendix E).

Combining Ranking and Lookahead
We can combine the two methods to further improve faithfulness.We first use the BEAM+LOOKAHEAD to generate faithful beam candidates and then select the best candidates with ranking.
We refer to this method as BEAM+LOOKAHEAD+RANKING.

Efficient Decoding via Distillation
One drawback of the proposed decoding methods is the heavy computational cost during decoding.We thus explore using distillation to transfer the knowledge of faithfulness-aware decoding to a student model that can generate summaries of similar faithfulness with just greedy decoding.We note here that our distillation aims at improving the decoding time rather than downsizing the model.Similar to Kim and Rush (2016), we assume that we have a teacher model and a student model.In our setting, the teacher model does not necessarily need to be a different model, but it needs to decode with more faithfulness-aware methods.Typical distillation methods use the teacher's probability distribution (Kim and Rush, 2016) as the target for the student model to imitate.In our case, however, that distribution is the same for all methods -the difference lies in how the probability is used to generate the next tokens.Thus, we propose a new decoding distillation loss.We use the teacher model to generate summaries y gen as additional reference summaries, and interpolate between the cross-entropy loss using the original reference summaries and the crossentropy loss where we consider y gen as reference summaries.Formally, the training loss is: where L XE is the cross entropy, y is the generated summary by the student model, and λ is a hyperparameter for the weight of the cross-entropy loss on the generated summaries.
Iterative Distillation.While we use the student model with just greedy decoding to improve decoding speed, the student model can also benefit from using our proposed faithfulness-aware decoding methods.Thus, the student models can also serve as a new teacher model to distill more faithfulness knowledge to a new student model.The distillation process thus becomes iterative, illustrated in Figure 2. We use the trained student model as a new teacher model, where we decode with our proposed faithfulness methods to create additional reference summaries y gen for the next iteration.

Datasets and Models
We perform experiments on two popular datasets for abstractive summarization, XSum (Narayan et al., 2018) and CNN/DM (Hermann et al., 2015).More details on the datasets are described in Appendix A.1.We use the released checkpoint of BART-large (406M) for the two datasets. 2The same experiment is done with PEGASUS (Zhang et al., 2020a), which is presented in Appendix B.

Evaluation Metrics
We use the F1 measure of ROUGE-L (Lin, 2004, RL), i.e., the overlap of the longest common subsequence between a generated summary and reference summary, and the F1 measure of BERTScore (Zhang et al., 2020b, BS) to evaluate summary quality.In addition, we use BS-Fact, i.e., the BERTScore precision of a summary with respect to its source document rather than the reference summary, FactCC (Kryscinski et al., 2020), DAE (Goyal and Durrett, 2021), and QuestEval (Scialom et al., 2021) for faithfulness evaluation.Details of the metrics are presented in section A.

Human Evaluation Setup
We use Amazon Mechanical Turk (AMT) to ask human annotators to judge the faithfulness and informativeness of the summaries generated with different decoding.
Faithfulness.We ask workers to judge the faithfulness of a summary sentence using a 3-star rating (1=major factual error, 2=minor factual error, 3=no factual error.Three judgments per summary are then aggregated using majority voting.We randomly select 200 examples from both datasets and use the summaries generated using greedy, sampling, beam search, as well as the ranking and lookahead strategies applied to beam search.We report the percentage of summaries that are fully factual (i.e. the percentage of summaries rated as 3-star) as the faithfulness score, and also report the distribution of summaries rated as 1, 2, and 3 stars.Details on qualification, payment and other aspects of the evaluation can be found in Appendix A.4.
Informativeness.We also evaluate the generated summaries in terms of informativeness.We consider summary to be informative if its content is important and relevant, but it does not necessarily need to be long.We use best-worst-scaling (BWS) for evaluating the informativeness of the generated summaries, as this method is "a less labor-intensive alternative to paired comparisons that has been shown to produce more reliable results than rating scales" (Kiritchenko and Mohammad, 2017).Accordingly, for each dataset, we select 200 random articles with the corresponding summaries from five systems in random order.We ask three annotators to select the most informative ("best") and the least informative ("worst") among the five.A rating per system is computed as the percentage of times it is chosen as best minus the percentage of times it is selected as worst.A value of 100 means that the system has been unanimously picked as "best", whereas a value of -100 means that the system has been unanimously picked as "worst".Additional details, as well as the screenshot of the annotation interface, are in Appendix A.4.

Decoding Setting Details
We describe the settings of the basic decoding methods, our faithfulness-aware decoding methods and distillation.More details are in Appendix A.3.
Basic Decoding Method.We compare the summaries generated using greedy search, beam search (k = 10), and nucleus sampling (p = 0.9).Additional experiments with various beam sizes and top-p values can be found in Appendix B.
Ranking and Composite Metric.We use beam search (k = 10) and rank the candidates using the composite metric introduced in Section 3.1.To train the composite metric, we explore combining FactCC, BS-Fact, DAE, and QuestEval.We use FACTCOLLECT (Ribeiro et al., 2022), a large collection of four faithfulness annotations to train a linear regression on the human-labeled faithfulness judgments.More details of the composite metric and its robustness to another domain can be seen in Appendix D.
Lookahead.We use BS-Fact as the faithfulness metric for the lookahead as it correlates highly with human judgment (Pagnoni et al., 2021) and is quick to compute without the need for additional pre-processing.We use greedy search to generate future summaries and apply it to both greedy and beam searches.
Distillation.We use the checkpoint of our two proposed faithfulness-aware decoding methods as the teacher model, and train the student model from BART-LARGE. 3 We follow the original fine-tuning hyperparameters provided by the authors (Lewis  6 Results

Baseline Decoding Results
We show the analysis of common decoding strategies in  2022) who hinted at the potential of better faithfulness with a large exploration of the search space, we use beam search to explore whether larger beam sizes (and hence larger exploration) derive more faithful summaries.To this end, we use all summaries generated by beam search and select the beam that would result in the highest possible score for each metric.We show the maximum score (Max) for the four faithfulness metrics and the faithfulness score of selecting the top beam (Top) given different beam sizes in Figure 3.We see a clear trend that increasing the beam size improves all faithfulness scores.This confirms our hypothesis that larger exploration of the search space can provide additional faithfulness gain, and thus showing the potential of our proposed decoding strategies, especially our reranking strategy, to output more faithful summaries.The faithfulness scores of TOP only increase marginally compared to the increase for Max, showing the importance of having better faithfulness guidance, such as our proposed faithfulness lookahead heuristics.

Faithfulness-Aware Decoding Results
We now show the impact of faithfulness-aware methods compared with the traditional decoding methods, which is shown in We observe similar improvement for lookahead as well, where applying the lookahead improves the faithfulness over the base decoding strategy over all faithfulness metrics.Nevertheless, the base decoding strategy is still the dominating factor, as BEAM+LOOKAHEAD generates more faithful summaries than GREEDY+LOOKAHEAD for all faithfulness metrics.GREEDY+LOOKAHEAD outperforms Beam on the XSum dataset, showing that better guidance with future faithfulness heuristics can improve faithfulness without large exploration.Finally, the combination of lookahead and ranking can further improve faithfulness as evaluated by FactCC, DAE, and QuestEval.
In terms of ROUGE score, applying faithful decoding methods decreases RL.This tradeoff between faithfulness and ROUGE has been observed in many prior works (Chen et al., 2021b;Kryscinski et al., 2020;Wan and Bansal, 2022).One reason for this phenomenon is that more than 70% of the reference summaries contain hallucinations (Maynez et al., 2020), so the more faithful summaries that do not contain such hallucinations will have lower ROUGE scores.To investigate this problem, we perform a human evaluation study, where we find that the summaries generated by BEAM+LOOKAHEAD are considered to be most informative.More details are in Appendix A.4.

Human Evaluation Results
Faithfulness.The observation on automatic faithfulness metrics aligns with the result of human evaluation in Table 3.For XSum, among the baseline decoding methods, we see that sampling performs the worst.Interestingly, greedy is more faithful than beam search, but the difference is only 1.5 points.Our proposed decoding strategies generate summaries that are judged more faithful compared to that of the baseline decoding strategies.Specifically, BEAM+LOOKAHEAD reaches 56.5, even outperforming BEAM+RANKING by 5 points.We also observe that our proposed methods are able to significantly reduce the percentage of summaries that are considered to contain major factual errors; Compared to beam search, ranking reduces the percentage from 44.5 to 36.5, and lookahead further reduces the percentage by 3 points.For CNN/DM, we see the striking result that the summaries generated by our proposed methods achieve the highest faithfulness, and among the two systems, there are no major errors for BEAM+LOOKAHEAD.
Informativeness.The result is shown in Table 4.The output of the BEAM+LOOKAHEAD is clearly seen as the most informative among the five methods.This result suggests that Rouge-L and BERTScore may not be good indicators for informativeness, as BEAM+LOOKAHEAD achieves the lowest scores for the two automatic metrics on both datasets.Table 3: Human evaluation results on faithfulness with the 3-star rating system (1=major factual error, 2=minor factual error, 3=no factual error).Our proposed faithfulness-aware methods are judged as the most faithful (the percentage of summaries rated as 3), confirming our observation with automatic faithfulness metrics.

Abstractiveness
Models can "trivially" become more faithful by becoming more extractive (Dreyer et al., 2023), and thus it is important to understand where the gain in faithfulness stems from.We experiment on XSum, as methods can achieve larger improvement in faithfulness and thus potentially more gain through extensiveness.We experiment with the 200 examples used for human evaluation and calculate MINT (Dreyer et al., 2023) for abstractiveness and plot this score against the human-labeled faithfulness, similar to Ladhak et al. (2022).The result is shown in Figure 4. Similar to the observation of Dreyer et al. (2023), more faithful models tend to be more extractive; however, the gain in faithfulness is considerably larger than the decrease in abstractiveness.For example, comparing BEAM+LOOKAHEAD with beam search, the relative increase in faithfulness (29.89%) is quadruple the decrease (7.27%) in abstractiveness.Similar experiments on CNN/DM are in Appendix F.
Lookahead with Faithfulness and Abstractiveness.We further show that our lookahead method can easily allow additional heuristics, such as balancing both faithfulness and abstractiveness.Specifically, we replace h(•) with combination of BS-Fact and MINT: We use α = 0.75 and the same hyper-parameters as BEAM+LOOKAHEAD.We refer to this model as BEAM+LOOKAHEAD+ABSTR and show the point in Figure 4. Compared to BEAM+LOOKAHEAD, this model can increase abstractiveness at a small cost in faithfulness, demonstrating the flexibility of our lookahead method to incorporate various characteristics for summarization.

Distillation
We present the distillation result in Table 5.While the student models are not able to outperform the teacher models, they approach the performance of the teacher models.The student models are also able to generate more faithful summaries compared to the greedy search baseline, which is only trained using the cross-entropy loss L XE (y , y).
The main benefit of the student model comes from the improved decoding speed.The ranking time reduces from 0.77 seconds per example to 0.47, which is a 40% improvement.The largest gain can be seen for lookahead, where the decoding speed reduces from 3 seconds per example to 0.49, only 1/6 of the time it was originally taking.
For example, the student model distilled from BEAM+RANKING improves DAE by 6.6 points and QuestEval by a point compared to the greedy search baseline and only differs from the teacher model by 2.5 points for DAE and 0.5 points for QuestEval.When using a more faithful teacher model, i.e.BEAM+LOOKAHEAD, the student model is able to generate more faithful summaries, as evaluated by BS-Fact, DAE, and QuestEval.Iterative Distillation.Next, we show the result of distilling BEAM+RANKING iteratively on XSum in Table 6.We see that with each iteration, the model is able to improve faithfulness further.When compared to the original teacher model, BEAM+RANKING, the student model is able to outperform all faithfulness metrics with two iterations.We stress that here all models only use greedy decoding, thus showing the potential of combining decoding with training for more faithful models.

Related Work
Many of the related works of our proposed decoding methods have been discussed in Section 3; here we cover other related areas.
Decoding methods.A decoding method for text generation explores an approximate search method to select the best tokens to form a hypothesis.Several works have critically analyzed different decoding strategies for natural language generation, including beam search (Meister et al., 2020a;Stahlberg and Byrne, 2019;Xu et al., 2022;Holtzman et al., 2020), best-first-search (Meister et al., 2020b), and lattice (Xu et al., 2022).While these works investigated the effectiveness of decoding methods on generated outputs from the perspective of diversity and repetitiveness, to our best knowledge, none of the works have explicitly analyzed their performance on faithfulness.
Distillation.Distillation aims at compressing the knowledge from a larger model into a smaller one.A conventional approach uses soft targets, i.e. learning the logits of a teacher model rather than final predictions (Buciluundefined et al., 2006;Hinton et al., 2015;Kim and Rush, 2016).While this method has shown to be very effective, it is less applicable to our case where the underlying Table 6: Iterative distillation results using BEAM+RANKING as the teacher decoding method.With two iterations, the student model is able to outperform the original teacher model in terms of faithfulness, and further iterations continuously improve faithfulness.
distribution for the next probable tokens does not necessarily change (for ranking, we do not modify the model scores at all) and thus not useful to learn soft labels.Different from compressing model size, our approach focuses on reducing the computational cost during decoding.Our method is most similar to pseudo-labeling (Shleifer and Rush, 2020), where we use generated summaries as "hard" labels.We do not replace reference summaries with our generated ones.Instead, we use interpolation (Kim and Rush, 2016) to account for both faithfulness and quality.

Conclusion
In this paper, we show a thorough analysis of the effect of decoding strategies on faithfulness for abstractive summarization.We present an analysis of popular decoding strategies, as well as our two newly proposed faithfulness-aware decoding strategies, ranking and lookahead, that can further improve faithfulness upon the base decoding methods.Finally, we show a simple (and optionally iterative) distillation trick where the training of a student model incorporates the summaries generated with more faithfulness-aware methods, and the student model generates summaries of similar faithfulness with minimal decoding time.
Future experiments could extend similar analysis of faithfulness and factuality beyond summarization and develop a combination of heuristics that also encompasses other aspects and styles.

Limitations
While the decoding strategies with lookahead show improvement in faithfulness, they require a heavy computational overhead, especially when they are coupled with beam search for the base decoding strategy and for generating the future summary.
We provide one solution with our distillation to improve decoding speed.Many of the computations, including the generated future summaries and the faithfulness scores on them, during this online process, are also later disregarded, similar to how any candidates are pruned during beam search.We believe an interesting direction might be to store the already generated future summaries so that the decoding may directly use the future summary if it is considered a good summary candidate.

Ethical Impact
While our work aims to reduce potential malicious or unintended harmful effects, our methods rely on the use of faithfulness metrics.The inherent problems and biases when using such metrics have been under-studied.Our decoding strategies can also be applied to be used for other metrics, even those that could be optimized for malicious intents.Another aspect to consider is the environmental impact of our proposed methods, as they require large computations.We hope that our distillation can mitigate this problem and future work can work towards more environmentally friendly approaches while improving faithfulness for safer use of large models.
Table 8: Full results of beam search and nucleus sampling for fine-tuned PEGASUS-LARGE models.We observe a similar observation as Table 7, showing that the faithfulness trend holds for different models.
Human Evaluation on Informativeness.The screenshot of the annotation can be seen in Figure 6.To achieve good quality, we set up a qualification task of three documents with their associated summaries.A selected pool of workers who had passed previous factuality qualification tests was allowed to take this current qualification test.The workers who passed the current qualification test were allowed to participate in this evaluation.In addition, we added the same three documents with known answers to the evaluation and observed that workers had 100% accuracy on them.We set the same maximum of 100 HITs per worker per dataset as in the factuality evaluation.The pay was $0.40 plus $0.10 bonus per HIT.Annotators spent a median time of 112 seconds per HIT, amounting to a pay of $16.07 per hour.For inter-annotator agreement, Krippendorff alpha (Krippendorff, 1980) for the CNN/DM annotation is 0.22, and Krippendorff alpha for the XSum annotation is 0.32.

B Full Analysis
Table 7 shows the full result.We see the general trend where increasing beam size improves faithfulness and increasing p for sampling is not helpful for faithfulness.
We similarly run the experiment on PEGASUS, a 568M model specifically trained for the task of abstractive summarization, with its respective check- points. 5The result is presented in Table 8.

C Lookahead Methods
We show the result of combining different decoding strategies for the base decoding strategy as well as for lookahead in Table 9 shows the result.We experiment with greedy and beam search as the base decoding strategies.For greedy, we experiment with all three decoding strategies for lookahead.For beam search, we are unable to run it with sampling or beam search due to the large computational cost.Interestingly, using beam for lookahead does not provide additional gains.We suspect that this is because exploring the future with more beams cannot guarantee that the base decoding strategy is able to explore them, as it is limited to selecting only the top tokens.

D Composite Metric
As described in Section 3.1, we train the composite metric on FACTCOLLECT and tune it on FRANK (Pagnoni et al., 2021).We use the test set of Pagnoni et al. (2021) for evaluation and the rest for tuning the composite metric.The resulting weights for the metrics are 0.29, -0.29, 1.97, and 0.94 for the FactCC, DAE, BS-Fact, and QuestEval, respectively, and the intercept is -1.91.We additionally compute partial correlations on FRANK, shown in Table 10.We see that the composite is able to further increase the correlations in all settings except for XSum's Spearman correlation.
Ablations on the effect of ranking with a single metric in Appendix E.
Since FACTCOLLECT only contains annotations on XSum and CNN/DM, we analyze whether the composite metric is robust for another dataset and domain.We use WikiHow (Koupaee and Wang, 2018) and decode using PEGASUS6 with greedy and beam decoding.The result of applying ranking to the beam output can be seen in Table 11.We see consistent gains in all faithfulness metrics when we apply ranking, showing its robustness of improving faithfulness in another domain.Table 12: Lookahead ablation with different lengths.l = 0 provides the faithfulness heuristic score only on the partially generated summaries while l = full is our lookahead model that evaluates on the full future summary.The faithfulness score calculated on the partial summaries does not provide an effective estimate that improves the faithfulness of the generated summary.

E Ablations
We present several ablation studies for our proposed faithfulness-aware decoding methods.More ablation studies exploring how lookahead explores the search space can be found in Appendix E.
Lookahead Length.We first present the result of using the lookahead heuristics but with l = 0.This means that at each time step, we do not use future heuristics but directly evaluate the faithfulness of the already generated partial summaries as the additional score.The result using GREEDY+LOOKAHEAD is shown in Table 12.Compared to greedy decoding, adding the faithfulness score of the current partial summary shows mixed results; the heuristic can only slightly improve BS-Fact, DAE, and QuestEval for XSum.However, we only see substantial gain when the future is taken into account (i.e.l = full).This shows the necessity of using the full summary to achieve the full potential of current faithfulness  metrics.
Ranking with Faithfulness Metrics.Next, we present the result for ranking with each respective faithfulness metric.The result is shown in Table 13.Generally, optimizing for one metric will lead to improvement in other faithfulness metrics.While optimizing each of the faithfulness metrics will undoubtedly perform the best when we use that metric for evaluation, the composite metric is able to achieve a similarly good score for all faithfulness metric that we are considering.
Evaluating the Search Space.We hypothesize that by incorporating lookahead, we can improve the search space even when a few tokens are generated.To better understand this, we greedily decode the full summary at each time step given the prefix similar to how lookahead works.We then use BS-Fact and DAE to score all generated summaries and analyze the faithfulness score at each time step.
Here, we focus on XSum and compare greedy and GREEDY+LOOKAHEAD.The plots of faithfulness scores using the current prefix to generate the full summaries are shown in Figure 7, where we see the benefit of having the lookahead heuristics.For BS-Fact, we see a large gap between the two methods especially when t is between 5 and 50.Though it may be less surprising as this is the faithfulness metric that the lookahead heuristic optimizes on, the heuristic can nevertheless prevent the score to dip, which we see for greedy search between t = 5 to t = 40.This shows that it is able to lead the model to a more faithful path to prevent straying away from a less faithful path.When we evaluate DAE, we show that optimizing on BS-Fact with lookahead heuristic can consistently improve the score for all lengths.

F Abstractiveness
We first show the same tradeoff result in CNN/DM in Figure 8. BEAM+LOOKAHEAD+ABSTR does achieve a slightly higher MINT score while also improving faithfulness.
We also extend the analysis to the whole test dataset and show the faithfulness score by taking the average of all faithfulness metrics (Avg.).Since DAE is an error rate, we subtract the score from 100 so that a higher score means it is more faithful.We do not use the composite metric as the ranking directly optimizes for it.
We can see a similar trend with the average of faithfulness metrics for both datasets in Figure 9 and Figure 10, where the gain in faithfulness outweighs the decrease in abstractiveness.The difference from the result using human faithfulness score is that BEAM+RANKING achieves the highest average score since ranking with composite metric optimizes the faithfulness metrics.
t e x i t s h a 1 _ b a s e 6 4 = " Z w K k a J c d U 3 h H l n h M y V D 4 t d z 5 5 Q M = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H a N r y P R i 0 d I 5 J H A h s w O D Y z M z m 5 m Z o 1 k w x d 4 8 a A x X v 0 k b / 6 N g 1 g g P A M r / D m P D o v z r v z M W 9 d c f K Z I / g D 5 / M H 6 v 2 N B w = = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " Q e 2 a b D / 1 I M 0 a x x g a I 1 5 z I N m y G H t e x i t s h a 1 _ b a s e 6 4 = " R n y z I g 4 I y 6 4 z h h o B A N N 9 E c B O I 5 Y = " > A A A B 7 X i c b V D L S g N B E J y N r x h f U Y 9 e B o P g K e y K r 2 P Q i 8 c I 5 g H J E m Y n v c m Y 2 Z 1 l p l c I S / 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S K Q w 6 L r f T m F l d H / P W g p P P H J I / c D 5 / A K c 3 j z I = < / l a t e x i t >

Figure 2 :
Figure 2: Illustration of the iterative distillation process.We train a student model θ with summaries generated by the teacher model θ, which uses faithfulness-aware decoding methods.The resultant student model θ that is trained on more faithful summaries can in turn be used as θ to generate the training data for the next iteration.

Figure 3 :
Figure 3: Maximum possible score (Max) for each faithfulness metric and the faithfulness scores of the top candidate (Top) at various beam sizes.As beam size increases, more faithful summaries exist in the list of candidates, but the faithfulness of the top beam improves only slightly.

Figure 7 :
Figure 7: Faithfulness score of the lookahead summaries at each time step.Adding lookahead as the heuristics improves the search space to generate more faithful summaries.

Figure 8 :Figure 9 :
Figure 8: Faithfulness and abstractiveness tradeoff results on the 200 CNN/DM examples used for human annotation.While our proposed methods are less abstractive, the gain in faithfulness is much larger than the decrease in abstractiveness.

Figure 10 :
Figure 10: Faithfulness and Abstractiveness tradeoff results on the full examples CNN/DM test set.Faithfulness is calculated by taking the average across all automatic faithfulness metrics.

Table 1 :
Baseline results of popular decoding methods measured by summarization quality metrics (Rouge-L (RL) and BertScore (BS)) and faithfulness metrics.We observe a general trend where beam search performs the best and nucleus sampling performs the worst in terms of faithfulness.Full result with different beam sizes and top-p probability for nucleus sampling is in Table7.
et al., 2020)and use λ = 1 for the weight of the additional cross-entropy loss.

Table 1 .
Both datasets show a similar trend.Beam search performs the best in terms of faithfulness except for FactCC on the XSum dataset.

Table 2 :
Results for our proposed decoding strategies.Compared to the baseline methods (greedy and beam search), both ranking and lookahead improve faithfulness.The combination of both methods further increases faithfulness.

Table 5 :
Distillation results using our proposed faithfulness-aware decoding methods as the teacher.We abbreviate FactCC as FC and QuestEval as QE.Speed is calculated by seconds per summary.

Table 10 :
Partial correlations of metrics on the Frank test dataset.Composite achieves the highest correlations on the combined and XSum dataset.* indicates results copied from the original work.

Table 11 :
Results for ranking on the WikiHow dataset.

Table 13 :
Ranking results with different faithfulness metrics.Top is the best summary from beam search, and each subsequent rows represent the ranker using the corresponding faithfulness metric.We abbreviate FactCC as FC, QuestEval as QE, and Composite as Comp.