Fidelity-Enriched Contrastive Search: Reconciling the Faithfulness-Diversity Trade-Off in Text Generation

In this paper, we address the hallucination problem commonly found in natural language generation tasks. Language models often generate fluent and convincing content but can lack consistency with the provided source, resulting in potential inaccuracies. We propose a new decoding method called Fidelity-Enriched Contrastive Search (FECS), which augments the contrastive search framework with context-aware regularization terms. FECS promotes tokens that are semantically similar to the provided source while penalizing repetitiveness in the generated text. We demonstrate its effectiveness across two tasks prone to hallucination: abstractive summarization and dialogue generation. Results show that FECS consistently enhances faithfulness across various language model sizes while maintaining output diversity comparable to well-performing decoding algorithms.


Introduction
Language models (LMs) have achieved remarkable success in generating human-like text, fostering advancements across numerous Natural Language Processing (NLP) applications.Despite the fluent and seemingly convincing outputs produced by LMs, these models can occasionally generate content that is factually inconsistent with the provided source (Koehn and Knowles, 2017;Rohrbach et al., 2018;Raunak et al., 2021), an issue known as the hallucination problem (Maynez et al., 2020;Ji et al., 2023).Methods to mitigate hallucination have been explored from various facets, including data perspectives (Wang, 2019;Filippova, 2020;Shuster et al., 2021), model architectures (Cao et al., 2018;Aralikatte et al., 2021;Xiao and Wang, 2021), and training strategies (Huang et al., 2020;Chen et al., 2021;Li et al., 2021).In this work, we 1.3B 2.7B 6.7B 1.3B 2.7B 6.7B turn to a less investigated lens-decoding-to improve faithfulness,2 and introduces a novel decoding method named Fidelity-Enriched Contrastive Search (FECS).Decoding algorithms can be categorized into deterministic and stochastic groups.Deterministic methods such as beam search and greedy decoding aim to generate the most probable text continuations.While these methods might appear to be less unfaithful, they are often degenerated.That is, the outputs are uninformative, monotonous, or repetitive (Li et al., 2016;Holtzman et al., 2019;Welleck et al., 2019).Conversely, stochastic methods such as top-k (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019) inject randomness into the generation process, thereby promoting the diversity.Yet, these sampling-based approaches often come at the cost of coherency and semantic consistency (Basu et al., 2020;Su et al., 2022;Su and Collier, 2023), where increasing the output diversity positively correlates with hallucinating (Dziri et al., 2021).To reconcile this faithfulness-diversity trade-off, we proposed FECS-a simple yet effective decoding strategy which extends the Contrastive Search framework (Su et al., 2022) and introduces context-aware regularization terms to enhance faithfulness and penalize degeneration.Specifically, a candidate token which exhibits (1) a great semantic similarity with tokens from the provided source and (2) a low semantic similarity with previously generated tokens is rewarded with a higher score to promote its selection.Importantly, FECS can be readily applied to existing LMs offthe-shelf, without requiring further training.
We evaluate FECS on two tasks particularly prone to text hallucination: abstractive summarization and dialogue generation (Ji et al., 2023).Experimental results show that FECS consistently improves faithfulness across various LM sizes while preserving a level of diversity comparable to predominant decoding algorithms.

Methodology
In this section, we present preliminary information on Contrastive Search (Su et al., 2022) before detailing our proposed FECS.

Preliminary
To address shortcomings in existing decoding methods, Su et al. (2022) propose Contrastive Search, a new decoding approach capable of generating diverse content without compromising coherency.At time step t, given an input x 0:c+t , where x 0:c signifies the prefix context and x c:c+t represents the previously generated tokens, Contrastive Search generates the next token x c+t via the following formula: Here, V k denotes a set of k candidate tokens with the top-k probability from the model's prediction distribution p θ (•|x 0:c+t ).The model confidence term represents the probability of the candidate token v, while the degeneration penalty term signifies the maximum value of the cosine similarity sim(•, •) between candidate token v and all previously generated tokens {x c , ..., x c+t−1 }.Specifically, sim(•, •) employs the token representation h x i and h v from the model's last hidden state, calculated by appending v to x 0:c+t as model input.α serves as a pre-determined, nonnegative hyper-parameter; when α equals 0, Contrastive Search reduces to greedy decoding.Essentially, Contrastive Search preserves coherence by choosing outputs from the top-k probable candidates while also curbing degeneration behaviors such as repetitions, thereby promoting diversity.

Fidelity-Enriched Contrastive Search
Motivated by Contrastive Search, we extend this framework by integrating a faithfulness term that encourages factuality and reduces hallucination.Using the notations from Section 2.1, we define FECS as follows: Consider an input x 0:c+t at time step t, where x 0:c represents the prefix context, and x c:c+t is the previously generated tokens.We further decompose x 0:c into: (1) the prompts x 0:s , and (2) the provided source x s:c , which the output is expected to remain faithful to.FECS generates the next token x c+t via the following formula: The newly introduced faithfulness term rewards candidate tokens exhibiting high semantic similarity to tokens in the source content.Specifically, the faithfulness term denotes the maximum value of the cosine similarity sim(•, •) between the candidate token v and all source tokens {x s , ..., x c−1 }.Here, β is also a pre-determined, non-negative hyperparameter.
3 Experimental Setup

Datasets, Models, and Configurations
We evaluate our method, FECS, on two tasks known for their susceptibility to hallucination issues: abstractive summarization and dialogue generation.For the abstractive summarization task, we adopt CNN-DailyMail (CNN-DM) dataset (Nallapati et al., 2016), a widely-used benchmark in several recent studies (Dong et al., 2020;Cao and Wang, 2021;Cao et al., 2020).The dialogue generation task employs the popular Wizard of Wikipedia (WoW) dataset (Dinan et al., 2018).The objective here is to generate responses based on given knowledge snippets, taken from Wikipedia, that are pertinent to the conversation topic.

Evaluation Metrics
Our evaluation process employs the following metrics: Standard Metrics.For assessing the quality of summarization, we employ ROUGE (Lin, 2004).For dialogue generation, we use ROUGE-L and BLEU-4 (Papineni et al., 2002).In addition, we also report BERTScore (Zhang et al., 2019) on both tasks for a more advanced soft metric.
Faithfulness Metrics.To measure factuality in summarization, we use FEQA (Durmus et al., 2020) following prior studies (Aralikatte et al., 2021;Chen et al., 2021).Higher FEQA scores indicate greater faithfulness of the summary to the source article.For evaluating dialogue, we employ Q 2 (Honovich et al., 2021), a question-answering (QA) based metric designed for assessing factual consistency in knowledge-grounded dialogue generation.Both FEQA and Q 2 exhibit strong correlations with human judgments.Diversity Metric.For both summarization and dialogue tasks, we evaluate the diversity of the generated text x by calculating where Rep-n(x) measures the proportion of n-gram repetitions in x, and is calculated as A higher diversity score suggests the model outputs exhibit less degeneration (Welleck et al., 2019;Su et al., 2022).4 Experimental Results

Faithfulness
Table 1 presents the results for abstractive summarization and dialogue generation.For abstractive summarization, FECS achieves substantial improvements on the factuality score across all scales, with 7.14%, 7.37%, and 9.55% increases for the 1.3B, 2.7B, and 6.7B models, respectively.Moreover, FECS records strong results in the ROUGE score and outperforms all other methods at the 6.7B scale.For dialogue generation, on the 1.3B scale, all stochastic algorithms, including FECS, fall short of Beam Search in most metrics.However, FECS surpasses other stochastic algorithms in terms of BLEU-4 and Q 2 .Upon scaling up to 2.7B and 6B, FECS outperforms all methods substantially in terms of BLEU-4, ROUGE-L, and Q 2 .Notably, the 6B model performs worse than its smaller counterparts, consistent with previous findings (Madotto et al., 2021).Compared to Contrastive Search, FECS exhibits a superior ability to focus on entities within the source material, emphasizing factual information more comprehensively.As evident in Figure 2, FECS provides more complete information-comparing "Jamaican starlet DeShane Beckford" versus "DeShane Beckford"-and generates output more comprehensively, evidenced by Contrastive Search's failure to produce the time phrase "earlier this month".Furthermore, when factual en-  tities are already present in the previous output, the degeneration penalty can inadvertently increase hallucinations.For instance, the term "Upton Park" produced by Contrastive Search lacks support from the source, whereas the correct output should be the previously generated "West Ham".In this case, FECS accurately reproduces "West Ham".Building on the framework of Contrastive Search, FECS not only inherits its properties of coherency and diversity (avoidance of degeneration) but also fosters the utilization of tokens that faithfully represent the provided source content.

Diversity
As we discussed in Section 1, model outputs must balance faithfulness and diversity.To better understand the impact of our proposed faithfulness reward on these two facets in the context of the original Contrastive Search, we calculated the improvements in faithfulness and the reductions in diversity based on the results from both the proposed FECS and the Contrastive Search.4Table 3

Analysis
Latency.To assess the decoding latency of our proposed FECS objective, we report the average decoding time (sec) per instance in Table 4.The results are averaged across 100 randomly selected instances.As observed in both the dialogue generation and abstractive summarization tasks, FECS and Contrastive Search perform comparably and slightly slower than beam search.Greedy and nucleus are the fastest.
The role of α.To establish a more comprehensive baseline, we evaluate FECS against Contrastive Search with different values of α on the 6.7B model.Intuitively, a smaller α value (i.e., a lower degree of diversity) might contribute to a more factual performance.However, as shown in Table 5 lowering α only improves faithfulness marginally and with essentially the same rouge scores.On the contrary, FECS retains a high level of diversity and achieves superior performance on both FEQA and standard metrics, indicating the effectiveness of our newly introduced β term.

Human Evaluation
In addition to the automatic evaluation, we also perform human evaluation to assess the faithfulness of our proposed FECS on the abstractive summarization task.We compare FECS against Contrastive Search, and ask annotators to vote which response is considered more faithful to the provided source (i.e., the text to be summarized).Specifically, we randomly sample 20 instance for each of the three model sizes, with a total of 60 instances for the evaluation.More details including the full evaluation protocol are provided in Appendix A.2.We present the results in Figure 2. As observed, FECS shows superior results, recording more than 60% of the votes, and outperforms Contrastive Search with more than twice the votes.The results support   the outcome of automatic evaluation, suggesting our proposed FECS is able to generated contents which are more faithful to the provided source.

Conclusion
This paper introduces a novel decoding approach, Fidelity-Enriched Contrastive Search (FECS), designed to enhance faithfulness in text generation.
Our experimental results on abstractive summarization and dialogue generation demonstrated the efficacy of FECS.It consistently improved faithfulness across various LM scales while preserving a level of diversity that is comparable to other leading decoding algorithms.Particularly when using larger LMs, it notably enhances faithfulness with only a minor impact on diversity.This indicates that FECS performs effectively when larger LMs are employed in dialogue generation tasks.In the future, we plan to explore how FECS performs with different kinds of source content, including erroneous or ambiguous inputs.

Limitations
Firstly, while FECS presents an improvement in faithfulness and diversity trade-off, its performance could be influenced by the quality of the source content.The assumption that source content is always correct and complete may not hold true in all scenarios, particularly in cases where the input data is ambiguous, incomplete, or erroneous.Secondly, the faithfulness assessment is primarily quantitative, based on FEQA and Q 2 established metrics.Although these metrics provide an essential standard for comparing models, they may not capture all nuanced aspects of faithfulness, such as the preservation of subtle implications or subjective information.

A.2 Details of Human Evaluation
The full human evaluation protocol is presented in Figure 5.We invite three graduate-level students proficient in English for the evaluation for the annotations.As our task does not require specific domain expertise, the payment is determined by the  minimum wage.We also compute inter-annotator agreements by Randolph's κ, and records a moderate κ = 0.57.

Figure 1 :
Figure 1: Results on CNN-DailyMail show our proposed FECS mitigates hallucination (i.e., improves factuality) while maintaining diversity of the generated summarization.

Figure 2 :
Figure 2: Human evaluation results comparing the faithfulness of FECS against Contrastive Search(CS) on the abstractive summarization task.FECS outperforms Contrastive Search, receiving more than twice the votes.

Figure 3 :
Figure 3: An example prompt of the CNN-DailyMail dataset for the abstractive summarzation task.

Figure 4 :
Figure 4: An example prompt of the Wizard of Wikipedia dataset for the dialogue generation task.

Table 1 :
Experimental results comparing FECS with other decoding methods across model scales.
ArticleWest Ham are discussing a deal for Jamaican starlet DeShane Beckford after he impressed on trial.The skilful 17-year-old forward from Montego Bay United was invited to train with West Ham's academy earlier this month and has impressed coaches after spending two weeks with the club.Beckford also has offers from clubs in Belgium.[...] The Hammers will have the cheapest pricing strategy in the Barclays Premier League in a bid to fill the 54,000 capacity stadium when they make the switch for the 2016-17 season.

Table 2 :
An actual example of news summaries generated by Contrastive Search and FECS on an article from CNN-DailyMail.Text highlighted in green indicates factual information; red indicates hallucination not supported by the article.

Table 3 :
Relative improvements in faithfulness and reduction of diversity of FECS over Contrastive Search.

Table 4 :
The averaged decoding speed (sec) per instance using different decoding methods across model scales.As observed, FECS is comparable to Contrastive Search.

Table 5 :
Comparison of FECS and Contrastive Search with different values of α.
Yijun Xiao and William Yang Wang.2021.On hallucination and predictive uncertainty in conditional language generation.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734-2744, Online.Association for Computational Linguistics.
Topic: Green Eggs and HamSystem: I have, and we made green eggs and ham for the kids when I did.Dr.Seuss wrote it.User: yeah.nice!what other cool facts can you tell me?We know that: Green Eggs and Ham.As of 2016, the book has sold 8 million copies worldwide.System replies: It has sold 8 million copies in many languages.Hebrew is one because I bought it as a gift in that one.Topic: Neil Brooks System: Yes, I do.Have you heard of Neil Brooks.He is a sprint freestyle swimmer that won the 100 m medley relay at the 1980 Olympics in Moscow User: I have never heard of him but he sounds like he was a very good swimmer.We know that: Michael Phelps.Michael Fred Phelps II (born June 30, 1985) is an American retired competitive swimmer and the most successful and most decorated Olympian of all time, with a total of 28 medals.System replies: Yes, another good swimmer is Michael Phelps.He has won 28 Olympic medals in swimming competitions, but he's retired now.Topic: Harry PotterSystem: Harry Potter is a series of fantasy novels I'm also a big harry potter fan! User: What's your favorite book?I like the Goblet of Fire We know that: Harry Potter and the Goblet of Fire.Harry Potter and the Goblet of Fire is a fantasy book written by British author J. K. Rowling and the fourth novel in the "Harry Potter" series.