Multi-level Adaptive Contrastive Learning for Knowledge Internalization in Dialogue Generation

Knowledge-grounded dialogue generation aims to mitigate the issue of text degeneration by incorporating external knowledge to supplement the context. However, the model often fails to internalize this information into responses in a human-like manner. Instead, it simply inserts segments of the provided knowledge into generic responses. As a result, the generated responses tend to be tedious, incoherent, and in lack of interactivity which means the degeneration problem is still unsolved. In this work, we first find that such copying-style degeneration is primarily due to the weak likelihood objective, which allows the model to"cheat"the objective by merely duplicating knowledge segments in a superficial pattern matching based on overlap. To overcome this challenge, we then propose a Multi-level Adaptive Contrastive Learning (MACL) framework that dynamically samples negative examples and subsequently penalizes degeneration behaviors at both the token-level and sequence-level. Extensive experiments on the WoW dataset demonstrate the effectiveness of our approach across various pre-trained models.


Introduction
In recent years, pre-trained language models using transformer architectures have made remarkable strides in open-domain generation tasks (Lewis et al., 2020;Raffel et al., 2020;Zhang et al., 2020;Roller et al., 2020).However, these models still struggle with dull and repetitive outputs, a problem commonly termed as neural text degeneration (Holtzman et al., 2020;Welleck et al., 2019).
To address text degeneration in dialogue, Dinan et al. (2018) proposed to equip interlocutors with external knowledge as additional support to enrich the informativeness of responses, which is known as the Knowledge-Grounded Dialogue Generation (KGDG).Recent methods have put much emphasis on the knowledge selection subtask, aiming to provide models with the most suitable knowledge (Kim et al., 2020;Chen et al., 2020;Zheng et al., 2020;Yang et al., 2022).
However, simply introducing golden knowledge as a ground source does not indeed mitigate the problem of degeneration.We observe that existing models often duplicate knowledge snippets to construct responses to meet the informativeness requirement, which can lead to unnatural and contextually incoherent responses.For example, the generated contents in Figure 1 simply duplicate the provided knowledge about the "blue skies" movie, resulting in a tedious response incoherent with user's utterance.Consequently, we need to consider not only the introduction of knowledge, but also how to effectively integrate it into responses.We term the superficially knowledge duplicating "Knowledge Regurgitation" as it mirrors how the model absorbs an entire knowledge sentence and regurgitates it in the response without genuine comprehension.This issue can be seen as a taskspecific copying-style manifestation of text degeneration.We test some pre-trained models with different parameter scales, and discover that the Knowledge Regurgitation is common among them.To quantify its severity, we design two automated metrics: PoD and KuD.Based on a series of experiments, we posit that degeneration is caused by the ineffectual design of the training objective, which allows models to "cheat" the MLE objective by merely duplicating knowledge snippets to construct responses.The model, therefore, improperly converges to superficial pattern matching based on overlap, given that knowledge sentences share spurious correlations (token overlaps) with groundtruth responses and exhibit high semantic fluency.
In this paper, to tackle the aforementioned issues, we propose a novel approach known as Multi-level Adaptive Contrastive Learning.This method effectively mitigates Knowledge Regurgitation through contrastive training at both the token and sequence levels.At the token level, we enhance the negative training paradigm (Welleck et al., 2019) by employing a dynamic negative token sampling method and reweighting the unlikelihood loss according to model sensitivity.In this way, our approach penalizes tokens that continuously appear in the knowledge but are not the targets to cut off the potential shortcuts.At the sequence level, we first employ a group beam search strategy (Vijayakumar et al., 2018) to sample negative responses from the degenerator's predictions, then use a novel metric as an oracle function to score them.Finally, samples demonstrating obvious degeneration are selected as hard negatives for the InfoNCE loss to distance them from their prefix in the representation space.It can expose the model to potential degeneration mistakes that happen at the inference stage and help the model learn to avoid them.
Our contributions are summarized as follows: • We explore a unique degeneration phenomenon in KGDG, termed "Knowledge Regurgitation", which is confirmed by a series of preliminary experiments in popular pretrained language models.
• We propose a Multi-level Adaptive Contrastive Learning (MACL) framework, which is designed to effectively internalize knowledge at both token and sequence levels.
• We conduct extensive experiments and provide a detailed analysis to validate the effectiveness of our method, showing a substantial improvement on both automatic and human evaluation metrics.

Preliminaries
In this section, we provide a brief introduction to the KGDG task and the Unlikelihood Training (UT) loss, which serves as the foundation for our tokenlevel contrastive learning loss.

Task Formulation
The knowledge-grounded dialogue generation task encompasses two relatively independent sub-tasks: knowledge selection and knowledgeaware response generation.Knowledge selection aims at selecting the most appropriate piece of knowledge, denoted k, from a given knowledge pool KP :{k 1 , k 2 , . . ., k |KP| }.Assuming that knowledge k has been selected by an efficient knowledge selector, we mainly focus on its utilization.Given the dialogue context u = {u 1 , u 2 , . . ., u m } and the related knowledge k = {k 1 , k 2 , . . ., k s }, the goal is to generate an engaging and informative knowledge-infused response y = {y 1 , y 2 , . . ., y |y| }.
Given a KGDG dataset (Dinan et al., 2018) D = {(u (i) , k (i) , y (i) )} derived from a collection of multi-turn human interactions, the standard method for training a sequence-to-sequence model involves concatenating the context with the knowledge as the model's input sequence and applying maximum likelihood estimation (MLE) to minimize: (1) where u is the context (golden dialogue history and user utterance at the current turn) , k is the golden knowledge, and y t is the t-th token of y.

Unlikelihood Training
To address the problem of neural text degeneration, Welleck et al. (2019) proposed the unlikelihood training method, combining a token-level unlikelihood objective with MLE.The core idea involves selecting a set of negative token candidates, denoted as C t , at each training step and reducing their prediction probability p(y c ), while concurrently increasing the probability of ground-truth tokens p(y t ).Negative candidates typically consist of tokens that have already been generated, alleviating the degeneration phenomenon where generated texts contain undesirable repetitions at various levels and high frequency tokens appear excessively.
(2) where C t is a subset of the vocabulary and x is the prefix sequence.
The MLE loss aims to model the groud-truth sequence probability distribution, while the unlikelihood loss corrects undesired patterns.The overall objective in unlikelihood training is mixed by them as follows: where α is a hyper-parameter varying from different tasks and datasets.

Knowledge Regurgitation
Although current pre-trained language models have demonstrated robust performance in generating fluent dialogue responses, they fail to align with human-like patterns of knowledge integration.We conduct a series of preliminary experiments to demonstrate the presence of text degeneration problems in some pre-trained language models of various sizes.Specifically, we quantify the severity of this issue by comparing human and model performance based on our newly proposed metrics.The results are presented in Table 1 and Figure 2, which demonstrates the model's severe knowledge regurgitation and singular knowledge utilization patterns.
Dup-n and Proportion of Longest Common Subsequence tokens (PLCS) are mainly used to measure knowledge snippets duplication.Dup-n represents the proportion of samples with n-grams co-occurring in knowledge and response and is computed as follows: where D denotes the dataset, k denotes the golden knowledge, and y denotes the generated response.For PLCS, it is calculated as ratio of the length of longest common sub-sequence (LCS) of response and knowledge to the length of response as follows: The aforementioned metrics approximately measure the frequency of a degenerated knowledge utilization pattern where only snippets of the provided knowledge are inserted into a generic response without substantial integration.
To highlight the gap in knowledge utilization patterns between humans and models, we further employed metrics related to the precision of knowledge grams (mKP-n), which calculates the mean ratio of knowledge tokens in response to all the response tokens.It is performed as follows: A comparison of the performance of each MLEfinetuned PLM with human dialogue, presented in Table 1, makes it evident that the duplication frequency of models surpasses that of humans.Although humans occasionally replicate professional knowledge and famous quotes to construct knowledgeable responses, such usage is less prevalent in casual chitchat, as shown in the last row of Table 1.Models, however, tend to misinterpret the duplication pattern superficially and apply it across various contexts.
In chit-chat scenarios, it is important to strike a balance with the number of knowledge tokens used in responses.Excessive knowledge tokens can reduce the interactivity of the response, potentially diminishing users' desire to continue the conversation.From the data presented in Table 1, we can observe that knowledge tokens account for approximately 50% of human responses, whereas the proportion in model-generated responses is too high.Additionally, the knowledge precision distributions between humans and models are illustrated in Figure 2. It highlights that humans exhibit a more diverse usage of given knowledge across various contexts, as evidenced by their uniform knowledge precision distribution.In contrast, the distributions of pre-trained language models exhibit a distinct pattern, with the probability mass predominantly concentrated on the higher percentage side.

Our Approach
We propose a novel multi-level contrastive learning method that dynamically samples negative examples and subsequently penalizes degeneration behaviors at both the token and sequence levels.
Token-level Contrastive Learning The motivation behind the token-level loss is to break the generation inertia more effectively (refer to the Figure 3 for a detailed explanation).Our improvements over basic unlikelihood training method are mainly twofold: dynamic negative sampling and dynamic unlikelihood loss reweighting.The detailed loss function is presented in the following formula: where C t and β(y t ) are calculated as follows: where p c is predictive probability of y c , and it is a scalar detached from the computational graph.
We empirically investigate the selection strategy of negative examples in our study.Given the objective of the knowledge-grounded dialogue task is to effectively integrate knowledge into the response, and that these knowledge tokens are essential potential candidates, punishing all knowledge without distinction is not appropriate.Our goal is to eliminate shortcuts leading to knowledge duplication while keep the knowledge integrating ability.
To achieve this, we propose to dynamically select negative tokens based on the model's sensitivity, thereby reducing the probability of tokens where the model is more likely to err.Specifically, we adopt a strategy of selecting the knowledge token with the highest prediction probability as the negative token.We also explored other strategies, such as sampling by probability, random selection, but the adopted strategy outperforms the others.We conjecture that this is due to our strategy targeting the prediction chains of the knowledge snippet (shortcuts), effectively disrupting it.
As for the dynamic reweighting strategy, we considered two reasons.Firstly, Jiang et al. (2020) highlight the importance of applying differentiable weights to individual token losses by proposing Token Focal Loss inspired by Lin et al. (2018)'s work.In light of this, we enhance the unlikelihood loss by introducing an additional control parameter β(y c ) to dynamically reweight the punishment strength for different tokens.By doing so, we suppress the gradients of easy tokens while amplifying the gradients of hard tokens, leading to faster and improved convergence during training.
Secondly, Lin et al. (2021) expressed concern that the vanilla negative training method might cause the model to decrease the probability of the target token p * to reduce the gradient norm |∇L a | (10) during the final stages of training, particularly when α is excessively large (refer to the Appendix B for a detailed explanation).Our approach mitigates the problem of that inverse optimization and allows the loss function to converge to a lower value as the introduction of β(y c ) facilitates the gradual decrease of the weight of L UL (2).
(10) Sequence-level Contrastive Learning Regarding the sequence-level contrastive loss, An et al. (2023) where B are from-batch negative samples, H are generated hard negative samples and µ is used to reweight the hard negatives.We begin by collecting a diverse set of utterances that exhibit significant degeneration problems using the following negative sampling method.Firstly, we train the PLM with MLE to obtain a degenerator, which generates responses exhibiting degeneration phenomena.We then employ a group beam search strategy (beam size b) to acquire negative responses from this degenerator.Finally, these responses are evaluated using a metric that measures the degree of duplication, serving as an oracle function for scoring.We utilize the length of the Longest Common Sub-sequence (LCS) with knowledge sequence as the oracle function.We retain the top m degenerated response samples.
Following the collection procedure, we drive the distance between the source sequence representation and the negative target sequence representation in a contrastive way.The L seq loss provides the model with a negative supervision signal on the shortcut paths that generated these sequences.Both hard negative samples and from-batch negative samples are retained, but we assign higher weights to the former to emphasize their importance.
The final objective function is defined as: See the Appendix E for the pseudo-code of the entire training procedure.

Experimental Setup
Dataset.Three common datasets have been typically used to evaluate the KGDG task: Holl-E (Moghe et al., 2018), CMU_DoG (Zhou et al., 2018), and WoW (Dinan et al., 2018).However, the topics in the first two datasets are limited to movies, with much of the knowledge being composed of movie reviews in the form of dialogues.This is not in line with our goal of exploring the internalization of retrieved world knowledge.The CMU_DoG dataset did not label golden knowledge, so there is no way to tell if the knowledge introduced is correct.Through observations and experiments, we find that the Holl-E dataset is also not suitable for conducting the evaluation of knowledge regurgitation, and we put the detailed reasons and experimental results in the Appendix F. Consequently, we choose the WoW dataset for our experiments.See the Appendix A for a brief introduction to the chosen WoW dataset.
Baseline Methods.We compare our MACL framework with vanilla MLE and several state-of-the-art (SOTA) methods that address the issue of text degeneration.NT: Welleck et al. (2019)  Implementation Details.We use PyTorch (Paszke et al., 2019) framework to implement our work.For the implementation of PLMs BART, T5 and GPT-2, we utilize the open-source Hugging Face transformers (Wolf et al., 2020).The whole model is optimized with Adam (Kingma and Ba, 2014) algorithm.We set the learning rate to 1e-5 and training batch size to 16, train up to 15 epochs, and select the best checkpoints based on performance on the validation set.Some hyper-parameters are set as follows: α = 4, λ = 1, µ = 2, b = 32, m = 16.At the inference stage, we utilize a decent knowledge selector designed by Yang et al. (2022) to select knowledge first and utilize it to generate response to fit the KGDG.The decoding strategy is set to beam search with a beam size of 3.
Evaluation Metrics.We choose perplexity (PPL) of the ground-truth responses, BOW Embedding (Avg., Ext.) (Liu et al., 2016), BLEU-1 (Papineni et al., 2002), Knowledge Utilization Difference (KUD), and Porportion of Degenerated samples (PoD) as automatic metrics for evaluation.The latter two are the metrics we propose to measure knowledge regurgitation and the difference from humans in knowledge utilization pattern.
Given that the primary characteristic of degeneration is the duplication of knowledge fragments, we propose considering samples with a Proportion of Longest Common Sub-sequence (PLCS) greater than 70% as degenerated samples.We manually annotate a subset of samples from the test set and calculate the precision of the automated metric, PoD.The annotators are given the following instructions: A generated response will be classified as degenerated if it 1) evidently replicates external knowledge and 2) produces content that is visibly unnatural and contextually incoherent.The results in Table 4 demonstrate effectiveness of the PoD metric.To compare the gap between different methods and humans in knowledge utilization patterns, we introduce the metric KUD, which is defined as: KUD = MAE(P h (KP-1)||P g (KP-1)), (13) where P h is the distribution of human response, and P g is the distribution of generated response.
For human evaluation, We randomly select 50 responses in test seen set and 50 responses in test unseen set.We conducted an aspect-based pairwise preference test.Specifically, for a given context, we paired our model's response with a response from the baselines and asked five welleducated annotators to choose the superior response based on the following four aspects: (1) Coherence: which model generates more contextually coherent responses; (2) Engagingness: which model generates more interesting responses; (3) Informativeness: which response contains more knowledge; (4) Interactiveness: which model generates more interactive responses that make the user want to continue the conversation.We compute Fleiss' kappa value (Fleiss, 1971) among different annotators to measure their agreement.

Experimental Results
Table 2 presents the automatic evaluation results on the WoW dataset.Our method, MACL, significantly mitigates knowledge regurgitation, reducing the proportion of degeneration by 13.42% on the test seen set and 16% on the test unseen set compared to the strongest baseline, Scalegrad.In terms of KUD metric, the enhancements of MACL are impressive, exhibiting a stark reduction in discrepancies from humans.It is about ten times better than the best performing baseline NT on the test seen dataset and five times superior on the test unseen dataset.The distribution of 1-gram and 2-gram knowledge, as shown in Figure 4, closely aligns with that of humans, indicating that MACL successfully internalizes knowledge into responses and achieves a similar ability to utilize knowledge as humans.
While the baseline methods effectively address traditional text degeneration, their impact on alleviating knowledge regurgitation is not as significant.This suggests that knowledge regurgitation is distinct from conventional repetition-style degeneration.We believe this is because KGDG ensures that knowledge is integrated into the response, and simply reducing the prediction probability of all tokens in the prefix is insufficient.This highlights the effectiveness of MACL's dynamic sampling and dynamic penalty mechanisms in adapting to the KGDG task.
In terms of conventional metrics that evaluate content quality, MACL achieves state-of-the-art (SOTA) performance in terms of perplexity, average (Avg.), and extrema(Ext.).This indicates that the quality of the generated responses is high and comparable to human performance.However, for the BLEU metric, which is based on gram overlap, MACL performs slightly worse than the baselines.We attribute this to the fact that the longer responses, which reveal knowledge regur-gitation, contain a higher number of knowledge tokens.As a result, the hit rate of knowledge tokens in the response is higher, leading to inflated scores for degenerated responses, despite their contextual incoherence.In the example provided in Table 10, we observed that the BLEU-1 score of the MLE-generated responses is higher than that of the MACL-generated responses (30.77 > 17.65).
More experimental outcomes based on other pretrained models and decoding strategies are detailed in the Appendix C.
The human-based evaluation results are shown in Table 3. Notably, MACL consistently outperforms all the compared methods.MACL effectively internalizes knowledge into its generated responses.As a result, the responses produced by MACL are more natural and human-like, avoiding direct narration.Regarding the Context Coherence metric, MACL maintains a relatively better focus on the user, carefully balancing the attention between the user's utterance and the knowledge during the generation stage.While MACL may lose some information compared to SimCTG's responses, it is important to note that excessive knowledge is inappropriate in chit-chat scenarios.The kappa results indicate a moderate level of agreement among the annotators.

Ablation Study
To analyze the sources of improvement achieved by MACL, we conducted an ablation study.The ablation results, shown in  effectiveness of MACL.Removing the dynamic reweighting factor results in a significant increase in the perplexity metric, demonstrating that dynamic weighting is indeed beneficial for mitigating the inverse optimization problem.Token-level contrastive learning plays a crucial role in aligning the model's ability to utilize knowledge with that of humans, as removing this loss function leads to a significant degradation in the KUD metric.Both the token-level and sequence-level contrastive learning functions are key in suppressing knowledge regurgitation, removing either of them results in an increase in the PoD metric.The performance improvement brought by sequence-level contrastive learning mainly stems from the selection of hard negative examples.When only frombatch negatives are used, the negative examples are easily distinguishable from the ground truth, and the model cannot effectively learn additional capabilities through the loss.

Case Study
To better evaluate the performance of response generation, we selected examples generated by MLE, NT, Scalegrad, and MACL for comparison.In the example provided in  (Wang et al., 2022;Lim et al., 2023;Zhang et al., 2018), especially knowledge-grounded conversations.The hot spot of research is mainly concentrated on how to improve the performance of knowledge selection (Sun et al., 2023;Xu et al., 2022).Zhan et al. (2021)  Neural Text Degeneration Neural text degeneration refers to the problem that the generated texts from the language model tend to be dull and incoherent, contain undesirable repetitions at different levels (Holtzman et al., 2020;Li et al., 2020).
The existing methods mainly alleviates this problem from two aspects: decoding strategy and training strategy (Su et al., 2022;Lagutin et al., 2021).Holtzman et al. (2020) found that the distribution of candidate tokens obtained by existing language models had unreliable long-tails, so they cut off them to reduce the occurrence of incoherent se-quences; Li et al. (2020) adjusted the negative training method to alleviate some generation problems in open domain dialogue tasks like inconsistency and contradiction.Su et al. (2022) attributed degeneration to the anisotropy of token representation in vector space, and designed a solution from both the training and decoding stage.

Conclusion
In this paper, we investigate a distinctive degeneration phenomenon in Knowledge-Grounded Dialogue Generation referred to as Knowledge Regurgitation, which is prevalent in pre-trained language models.To address this challenge, we present a novel solution called Multi-level Adaptive Contrastive Learning (MACL).Our approach tackles the problem by dynamically sampling negative examples and penalizing degeneration behaviors at both the token-level and sequence-level.Experiments on the WoW dataset demonstrate that our approach significantly mitigates knowledge regurgitation.

Limitations
MACL effectively addresses the issue of Knowledge Regurgitation.However, we acknowledge certain limitations in our work: (1) Due to limited computational resources, we have focused on demonstrating the effectiveness of our method on pre-trained language models with Encoder-Decoder structures that have less than 1 billion parameters.However, our method can indeed address the degeneration problem in lightweight chit-chat models, and we plan to explore degeneration in larger language models (LLMs) in future work.
(2) As existing datasets did not align with the specific scenario we wanted to explore, we solely evaluated our method on the WoW dataset.Although the dataset size may not be large enough, it provided valuable insights for our research.
(3) Our sequence-level contrastive loss involves generating negative examples during the training stage, which requires multiple calls to the pretrained language model to calculate the hidden states of negative targets.We have not yet optimized our algorithm code to run in parallel, resulting in a decrease in training speed.

Ethics Statement
The benchmark dataset we used in our experiments, WoW (Dinan et al., 2018), is a well-regarded, opensource dataset collected by crowdsourced workers.It was compiled with rigorous adherence to user privacy protection protocols, ensuring the exclusion of any personal information.Furthermore, our proposed approach consciously upholds ethical standards and societal fairness, ensuring that no prejudice is introduced.For our human evaluation component, all participants were volunteers who were provided with comprehensive information about the research's purpose, ensuring informed consent.In addition, all participants received fair and reasonable compensation for their contributions.

A Details of Dataset
The WoW data was sourced from a crowdsourcing website.During data collection, the user side plays the role of the apprentice, and the agent side plays the role of the wizard.The wizard has access to the knowledge retrieved from Wikipedia as groundsource to generate informative responses, while the apprentice prefers speaking common utterances.The WoW dataset consists of 22,311 dialogues with 201,999 turns, which are divided into a training set, a validation set, a test seen set, and a test unseen set.The topics in the test unseen set are ones that never appeared in the training set.

B Problem of Negative Training (NT)
As the gradient-based optimization progresses, it is expected that the gradient of the loss approaches zero around a minimum.Therefore, the probability of the ground-truth token p i should increase towards 1 to decrease the gradient norm |∇L a | and achieve convergence to a value close to 0 during the training stage.
However, in equation ( 14), when p i > 1 1+α , the value of the gradient norm |∇L a | becomes larger than 1.Consequently, the training procedure reduces p i to minimize the gradient norm, which contradicts the optimization principle.This issue prevents the loss function from converging during the later stages of training.

C More Experimental Results
As observed in Table 8, MACL works on the T5large as well, effectively reducing the knowledge regurgitation.Furthermore, in addition to the beam search decoding strategy with a beam size of 3, we also experimented with beam search using a beam size of 5, greedy decoding, and Nucleus Sampling.
The results in Table 9 show that larger beam sizes are more prone to degeneration.Although nucleus sampling could alleviate knowledge regurgitation, it is far less effective than MACL.It also illustrates that our approach is compatible with these decoding strategies as it mitigates the degeneration further.

D More Cases
In the example provided in Table 12, MACL effectively incorporates the knowledge about the movie "Blue Skies" into its response.It starts by acknowledging that there are indeed beautiful views and blue skies along the route of the Royal Blue train, and then proceeds to mention relevant content about the movie with the same name.On the other hand, the baselines fail to internalize the knowledge and generate responses that are incoherent and lack interactivity.

E Details of the Training Algorithm
See the pseudo-code below for details.

F More results on the Holl-E dataset
The WoW dataset is well collected, with a specific focus on engagingness and interactiveness.The collectors crafted responses by integrating grounded knowledge naturally, and they were forbidden to duplicate knowledge snippets for saving time.Compared to that, the collection of Holl-E dataset is relatively rough.There is apparent replication between the ground-truth response and the introduced knowledge in it (which means the ground-truth response is not a ideal response with knowledge internalization), it is unsuitable for evaluating the prevention of knowledge regurgitation.It sometimes takes movie comments as both external knowledge (input) and ground-truth responses (target) during data collection, leading to problematic ground-truth responses.The comparison between the two datasets are shown in  replication phenomenon (-17.71%).However, the dataset remains unsuitable for evaluating the knowledge regurgitation problem in terms of the results of the other metrics.All the baseline methods achieved notably low perplexity and remarkably high BLEU-1 score.It indicates the excessive similarity between knowledge input and generation targets.Besides, the generated response with less severe knowledge replication results in a broader discrepancy in knowledge utilization capacity (higher KUD value).The phenomenon illustrates problematic ground-truth responses.

Figure 1 :
Figure 1: An example shows how current models make rigid use of knowledge and cause the responses incoherent with context.Both T5-large and BART-large are finetuned on WoW dataset with MLE.

Figure 2 :
Figure 2: Knowledge Precision Distribution of human responses (red) and machine-generated responses (blue) on the WoW dataset.
proposed the concept of unlikelihood loss, combining it with MLE loss.Scalegrad: Lin et al. (2021) modified the gradient of the MLE, encouraging the model to use novel tokens.we designate knowledge tokens as non-novel tokens.ND: Li et al. (2022) developed a novel training paradigm known as negative distillation, designed to steer the model away from undesirable degenerated responses.We utilize the MLE-finetuned model as the negative teacher.CTloss: Jiang et al. (2022) put forward a new contrastive token learning objective.This objective aims to promote label tokens in the ranking at each step while demoting negative tokens, leaving other irrelevant tokens unaffected.SimCTG: Su et al. (2022) introduced a contrastive objective designed to learn discriminative and isotropic token representations by increasing the distances between distinct tokens' representations.

Figure 4 :
Figure 4: Knowledge Precision Distribution of human responses (red) and MACL-generated responses (orange) on the WoW dataset.
point out that contrastive learning provides Figure 3: An overview of the MACL framework.The mark in the top left corner of the figure indicates that certain response tokens are identical to the knowledge tokens enclosed in brackets.Owing to the generation inertia, that is, the shortcut, the model predicts an exceptionally high probability for the next knowledge token, which surpasses the probability of the ground-truth token.MACL breaks the knowledge snippet chain effectively and outputs more reasonable distributions.z x represents the source sequence (context and knowledge), and z y stands for its target sequence (response).The feature representations are derived from pooling the outputs of the encoder (source sequence) or decoder (target sequence), which are both differentiable in this context.
Arora et al. (2022) alleviate the exposure bias problem.Arora et al. (2022)suggest that the degeneration problem is a result of exposure bias, which motivates us to address this issue by leveraging the sequence-level contrastive learning method during the training phase.By exposing the model to negative targets exhibiting degeneration, we aim to help the model learn to avoid predicting them.y′∈Be cos(zx,zy′)

Table 2 :
Automatic Evaluation results on the WoW dataset (BART-large).The best results are highlighted with bold."*" denotes that the improvement to the best baseline is statistically significant (t-test with p-value < 0.01).

Table 3 :
Human Evaluation results on the WoW dataset (%).The result is statistically significant with p-value < 0.05.

Table 4 :
The proportion of samples with degeneration phenomenon annotated by humans on the 100 samples extracted, and the precision of our proposed metric PoD.

Table 5 :
Table 5, indicate that all of these design components contribute to the Ablation study on the WoW dataset.-Reweighting denotes removing the dynamic reweighting method compared to MACL.-TCL denotes removing the token-level contrastive learning compared to MACL.NaiveSCL denotes using only from-batch negative samples in InfoNCE compared to MACL.-SCL denotes removing the sequence-level contrastive learning method compared to MACL.

Table 6 .
The experimental results on the PoD metric show that MACL still effectively mitigate knowledge

Table 6 :
Comparison between the WoW dataset and theHoll-E dataset.

Table 7 :
Automatic Evaluation results on the Holl-E dataset (BART-large).The best results are highlighted with bold.