Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment

Pretrained language models (PLMs) based knowledge-grounded dialogue systems are prone to generate responses that are factually inconsistent with the provided knowledge source. In such inconsistent responses, the dialogue models fail to accurately express the external knowledge they rely upon. Inspired by previous work which identified that feed-forward networks (FFNs) within Transformers are responsible for factual knowledge expressions, we investigate two methods to efficiently improve the factual expression capability {of FFNs} by knowledge enhancement and alignment respectively. We first propose \textsc{K-Dial}, which {explicitly} introduces {extended FFNs in Transformers to enhance factual knowledge expressions} given the specific patterns of knowledge-grounded dialogue inputs. Additionally, we apply the reinforcement learning for factual consistency (RLFC) method to implicitly adjust FFNs' expressions in responses by aligning with gold knowledge for the factual consistency preference. To comprehensively assess the factual consistency and dialogue quality of responses, we employ extensive automatic measures and human evaluations including sophisticated fine-grained NLI-based metrics. Experimental results on WoW and CMU\_DoG datasets demonstrate that our methods efficiently enhance the ability of the FFN module to convey factual knowledge, validating the efficacy of improving factual consistency for knowledge-grounded dialogue systems.


Introduction
Pretrained dialogue models with the assistance of external knowledge sources have demonstrated remarkable performance to generate knowledgeable and reliable responses in many conversational applications (Dinan et al., 2019;Moghe et al., 2018;Ghazvininejad et al., 2018;Gopalakrishnan et al., 2019).However, these knowledge-grounded dialogue systems (KDS) are always hampered by factual inconsistency or even "hallucination" problem (Santhanam et al., 2021;Ji et al., 2023), which has been widely investigated in many natural language generation (NLG) tasks such as abstractive summarization (Zhu et al., 2021;Nan et al., 2021;Xie et al., 2021;She et al., 2023) and machine translation (Lee et al., 2019).The factually inconsistent responses produced by dialogue models are linguistically fluent and contextually coherent but deviate from the grounding factual knowledge, as exemplified in the left hand of Figure 1, potentially leading to misinformation to the users and restricting the applicability of dialogue agents.
The factual consistency in KDS indicates "accurately portrays the input knowledge (assuming the provided knowledge is correct)" as defined in prior research (Santhanam et al., 2021).Identifying the intrinsic causes of factual inconsistency in KDS remains persistently challenging, as the generated responses are jointly affected by conversational history, grounding knowledge, and dialogue PLMs.Generally, the dialogue context and grounding knowledge are assumed to be accurate and considered as ground truth, the factual inconsistency thus can be naturally attributed to the innate limitations of PLMs.The prior knowledge in PLMs learned from large-scale unlabeled corpus might be incomplete, outdated, or incorrect (Elazar et al., 2021;Cao et al., 2021) thus inevitably affecting the provided factual knowledge expressions, and consequently results in untrustworthy responses as shown in Figure 1, where the knowledge stored in the model is likely to be "France won the World Cup champion.".Therefore, it is essential to figure out the mechanism by which language models express factual knowledge.Previous research (Geva et al., 2021a;Dai et al., 2022a) observed that feed-forward networks (FFNs) in Transformers can be viewed as key-value memories that store and activate specific knowledge representations given certain input patterns.Accordingly, we propose two promising solutions to bolster FFNs' ability to produce factual knowledge and enhance factual consistency for KDS.
First, we propose K-DIAL, a knowledgeenhanced dialogue generation method that explicitly incorporates extended FFN modules in Transformers for knowledge enhancement, which improves the model's ability to express the gold knowledge given specific knowledge snippets and dialogue contexts.As illustrated in Figure 1, the factual knowledge "Argentina won the 2022 World Cup champion."with the contexts is directly used to enhance the expression of "Argentina" in the response.Notably, the parameters in extended FFNs are updated solely over the knowledge-related tokens occurring in both grounding knowledge and responses, ensuring the efficiency of improved factual consistency while maintaining the dialogue quality of KDS.
Second, we propose the reinforcement learning for factual consistency (RLFC) technique, which leverages alignment with the factual consistency preference to implicitly enable FFNs to express factual knowledge in responses.As shown in Figure 1, the response is aligned with factual knowledge "Argentina won the 2022 World Cup champion."to implicitly adjust FFNs to express accurately.The reward model is utilized for alignment which is a binary NLI model as a factual consistency classifier trained on publicly available benchmarks (Santhanam et al., 2021;Gupta et al., 2022;Dziri et al., 2021) for factual consistency evaluations of KDS.
The obtained consistency score of the reward model is utilized for RLFC training to induce factual expressions within FFNs.
To assess the factuality and conversationality of dialogue generations, we conduct a comprehensive evaluation employing both automated and human evaluations, including our carefully defined finely-grained NLI metrics based on recent human-annotated datasets released by Labadie et al. (2021); Dziri et al. (2022); Gupta et al. (2022).Significant performance improvements across the aforementioned metrics are obtained on both WoW (Dinan et al., 2019) and CMU_DoG (Zhou et al., 2018) datasets using K-DIAL and RLFC, demonstrating their efficiency in improving the expressions of factual knowledge within FFNs and mitigating the risk of factual inconsistency for KDS.
Our contributions are summarized as follows: (1) We propose K-DIAL, which explicitly extends FFN modules in Transformers for knowledge enhancement to express factual knowledge in responses to improve factual consistency while maintaining conversational properties.
(2) We propose RLFC, which implicitly promotes FFNs' ability of factual expressions by aligning generated responses with the gold knowledge for factual consistency preference of KDS.
(3) We obtain significant improvements across a range of sophisticated automatic and human evaluation metrics, demonstrating the efficacy of our two proposed methods in achieving superior performance in terms of both the factual consistency and dialogue quality of KDS.

Methodology
In this section, we first introduce the KDS models in this work and pose the view of key-value memories of FFNs in Transformer models.We then present our knowledge-enhanced dialogue generation method K-DIAL and reinforcement learning for factual consistency (RLFC) technique respectively.

Knowledge-Grounded Dialogue Model
As depicted in Figure 2, the causal graph illustrates the procedure of KDS generation where response Y is jointly determined by dialogue history X , retrieved knowledge K and PLM M. In this study, we employ GPT2 (Radford et al., 2019) as PLMs and fine-tune these models on grounded dialogue datasets and obtain the dialogue model.The in- put to the model concatenates a piece of knowledge K and a dialogue history X consisting of utterances that are segmented by the speaker types <bot> and <user>.Distinct special token-type embeddings are employed to delineate each part of the input for all GPT-2 models.For simplicity, we directly leverage the gold knowledge in this work thus the input knowledge is naturally correct.The dialogue model is trained to generate the response Y = [y 1 , y 2 , . . ., y m ] given the input via minimizing the cross-entropy loss:

Key-Value Memories in FFNs
Prior studies have exhibited PLMs are knowledge base (Petroni et al., 2019) and the knowledge is implicitly preserved in the parameters of FFNs in Transformers (Dai et al., 2022a).Generative PLMs, such as GPT-3 or GPT-2 (Brown et al., 2020;Radford et al., 2019), feature a deep stacking of multiple Transformer decoder blocks (Vaswani et al., 2017).As shown in Figure 2 (b), each feed-forward network (FFN) module in a Transformer block contains a two-layer linear model with an activation function between.Assume that h l i ∈ R dm represents the i-th hidden input of the FFN module in the l-th Transformer layer with d m -dimension word embedding .The normalized h l i is then fed into FFN as: where and only be triggered upon the occurrence of specific input patterns.The coefficients are then employed to compute the weighted sum with values Θ l v to induce a distribution over the vocabulary of the next token prediction.Dai et al. (2022a) further proposes the concept of knowledge neurons in FFNs that can store and activate the factual knowledge prediction.The observations provide insight to improve factual consistency for KDS by augmenting PLMs to recall and output factual knowledge in responses.

K-DIAL: Knowledge-Enhanced Dialogue Generations
KDS is supposed to generate more reliable and knowledgeable responses for knowledge-intensive situations leveraging the wealth of information in external knowledge.Even though gold knowledge is provided, the models still encounter challenges related to fictional expressions of gold knowledge they rely on, and resulting in factual inconsistency, for example, manifesting in the responses that are factually incorrect or uninformative.As shown in Figure 2 (a) where X and K are considered always correct, we can naturally infer that the inconsistency arises from M.
As the knowledge in PLMs inevitably contains inaccurate, outdated, incomplete redundant information (Elazar et al., 2021;Cao et al., 2021), which may influence factual knowledge predictions of FFNs.Factual knowledge, or world knowledge, is generally represented among entities in languages (i.e., dates or locations) (Roberts et al., 2020).As illustrated in Figure 2 where n ′ is the number of entities in Y and 1 ỹk (y i ) denotes whether y i belongs to the entity set ỹk .
The training process of K-DIAL framework is specified in two steps.First, all the parameters of the original PLM are frozen, and the loss L KCE is only calculated over the parameters of the extended FFNs.Afterward, we further adapt the knowledgeenhanced model on the KDS datasets using supervised fine-tuning of Equation (1) while keeping the parameters in extended FFNs fixed.The word embedding dimension and hidden size of extended FFN module are set equal to the corresponding Transformer FFN layers.Note that K-DIAL framework is only applied on the top 3 layers of the model in our experiments.
As illustrated in Figure 2, the extended FFNs are supposed to predict Argentina as the next token given the specific knowledge and context.The K-DIAL framework takes advantage of FFNs' ability to learn the complex dependency between the knowledge snippet and dialogue via activating related entity tokens.In this way, factual consistent entity words can be triggered in the response.The ability of PLMs to express factual knowledge has been improved while maintaining the general language ability.

RLFC: Reinforcement Learning for Factual Consistency
For KDS, we prefer knowledgeable responses that are faithful to the gold knowledge.However, since PLMs are trained on the massive unlabeled corpus, KDS models do not inherently prioritize following the preferences to constantly output factual knowledge and consistent responses.Aligning with the factual consistency preference can implicitly encourage FFNs of Transformers to convey factual knowledge and ultimately reinforce the factual consistency of responses.Inspired by the recent progress of reinforcement learning from human feedback (RLHF) technique to align human preferences (Ouyang et al., 2022;Ziegler et al., 2019;Christiano et al., 2017) like mitigating toxicity (Faal et al., 2023), we regard factual consistency as one type of preferences and thus propose reinforcement learning for factual consistency (RLFC) method in this work.
Specifically, we first design a reward model using a factual consistency classifier.There are some recent publicly human-annotated benchmarks and datasets containing information on the preference for factual consistency (Santhanam et al., 2021;Gupta et al., 2022;Dziri et al., 2021), where the similar definitions to factual consistency as "attributed" and "supported" are used in KDS indicating whether the response utilizes and follows the provided knowledge.We thus take advantage of these well-aligned data to train a binary NLI model, serving as a reward model to provide informative reward signals for RL training.The reward model R(•) is optimized using the following binary cross-entropy loss function: where the knowledge-response pair (K (i) , Y (i) ) is the input to the reward model and ŷ(i) is the label.As illustrated in Figure 3, the dialogue model to be optimized for RLFC training is used as the policy model.The response Y generated by the policy model and gold knowledge snippet K are fed into the reward model to obtain the consistency reward score r 1 as r 1 = R(K, Y ) which is mainly used to align the preferences for FFNs' factual expression.The reward model will return a higher score for the factually consistent pairs to facilitate the factual expressions of the policy model.Furthermore, a reference model generating a response Y ′ is also introduced, which is usually the dialogue model before RLFC training.The KL divergence r 2 = KL[Y ||Y ′ ] between the outputs of the reference model and the policy model is used as an extra reward signal to make sure the generated responses don't diverge too far from the originals.The optimization objective r = r 1 + r 2 is utilized for RLFC training via the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017).

Datasets
WoW The Wizard of Wikipedia (WoW)2 is a large-scale multi-turn knowledge-grounded dialogue dataset (Dinan et al., 2019) collected through human-human crowd-worker chats, where a "wizard" as the <bot> can respond to an "apprentice" as a <user> in a knowledgeable way given evidence from external Wikipedia documents.We only focus on modeling the responses by the "wizard" provided the selected gold-label evidence and the previous dialogue contexts.
CMU_DoG The CMU Document Grounded Conversations Dataset (CMU_DoG)3 (Zhou et al., 2018) refers to a collection of conversations that encompass two users discussing various famous movies given related Wikipedia documents.Utterances by the user who can access the movie evidence in the documents are treated as <bot> responses for dialogue modeling.Note that the initial configuration of CMU_DoG entails the provision of a gold knowledge paragraph to the models alongside the dialogue.In this work, we split these knowledge documents into sentence pieces and select the most relevant one as the grounding knowledge, preserving the average token number of knowledge snippets comparable to those on WoW.
More data processing details can be referred to in Appendix A.

Implementation Details
For the dialogue generation models, we leverage GPT2 series (GPT2-Medium(M), GPT2-Large(L), GPT2-XL) models (Radford et al., 2019) implemented using HuggingFace library (Wolf et al., 2020) 4 based on PyTorch (Paszke et al., 2017).All the PLMs are further fine-tuned on the above WoW and CMU_DoG dialogue datasets by minimizing the cross-entropy loss in Equation (1).ADAM parameter update is used in a mini-batch mode for all models.In the decoding stage, we use the beam search algorithm and set the number of beams n = 5.Further model setting information can be found in Appendix B.

Metrics
We exhibit a range of comprehensive metrics to gauge the factuality and conversational ability of KDS, entailing both a series of automated techniques as well as human evaluations.

Lexical and Semantic Metrics
In this work, we adopted token-level F1 uni-gram overlap, BLEU, and ROUGE-L metrics to measure the lexical similarity for dialogue quality evaluation between generated and ground-truth responses.Additionally, Knowledge F1 (KF1) (Shuster et al., 2021) and BERTScore (Zhang* et al., 2020) (BERT.)are used to capture such lexical overlap and semantic similarity of response and grounding knowledge.
Fine-Grained NLI-based Metrics NLI-based metrics are more robust and widely used to detect factual inconsistency or hallucinations in knowledge-intensive NLP tasks (Dušek and Kasner, 2020;Mishra et al., 2021;Falke et al., 2019;Welleck et al., 2019;Chen et al., 2023).Therefore, we developed a synthetic dataset for a BERT-based (Devlin et al., 2019) NLI model pre-training.The synthetic dataset adopts factual consistent samples that are derived from the ground-truth response and gold knowledge pairs in WoW.Inconsistent responses are generated by random pairing, negation, and entity swapping (Kryscinski et al., 2020a;Gupta et al., 2022).
Following (Santhanam et al., 2021), we develope three metrics to finely-grained evaluate factual consistency and fine-tune the pre-trained NLI model on the datasets released by (Santhanam et al., 2021;Gupta et al., 2022;Dziri et al., 2021), which are also used for reward model training of RLFC.The three fine-grained NLI metrics are designed to inspect 1) Verification (Verif.):whether a response contains verifiable information; 2) Hallucination (Hallu.):whether a response does NOT comprises hallucinated content; and 3) Factual Correctness (Fact.):whether a response is factually consistent with grounding knowledge.
Although there may be slight variations across the definitions in the aforementioned benchmarks, their shared objective is to enhance the faithfulness and reliability of responses to the provided gold knowledge.The data processing and alignment details are presented in Appendix C. Honovich et al. (2021) proposed Q 2 metric 11 employed a question generation system, a question answering system, and an NLI model to find the corresponding answer span in the knowl-11 https://github.com/orhonovich/q-squarededge that the response should be grounded to evaluate the factual consistency of KDS.
Human Evaluation We exhibit human evaluations as a means of assessing performance across various dimensions of generation quality.Annotators were requested to answer: 1) whether the response is fluent and understandable (Flue.)and 2) whether the response is contextually coherent given previous conversations (Cohe.)and 3) whether the response is factually correct (Fact.).
All the annotators were asked to rate each quality on a two-level Likert scale from 0 (Not Flunet, Not Coherent, Inconsistent) to 1 (Flunet, Coherent, Consistent) to evaluate the fluency, coherence, and factual consistency of generated responses.The averaged results by the human evaluation scores are reported.

Results of Automatic Evaluations
In Table 1, we present the experimental results of various GPT2-based dialogue PLMs using K-DIAL and RLFC on WoW and CMU_DoG test sets.Several trends can be found below: 1) The effects of K-DIAL: GPT2 series models using K-DIAL outperform all standard dialogue models in both factual consistency and dialogue quality for KDS on both WoW and CMU_DoG test sets.Significant factual consistency improvements of GPT2-L+K-DIAL in Fact.and Q 2 in terms of 5.36% and 7.63% absolutely over GPT2-L on WoW indicate that K-DIAL effectively enhances factual expressions.On CMU_DoG, supreme factual consistency improvements of 3.84% and 4.11% absolutely on Fact.and Q 2 are obtained on GPT2-M after using K-DIAL.
Improvements in the KF1 measure suggest that the responses equipped by K-DIAL are more knowledgeable and faithful to the supported knowledge.The performance improvements obtained on the conversationality metrics of BLEU, F1, and ROUGE-L present that through enhancing factual expressions for responses, the dialogue quality can be also marginally improved.
2) The effects of RLFC: Comparable performance improvements are also acquired using RLFC on GPT2 dialogue models, demonstrating that RLFC can proficiently improve the factual consistency of KDS on both WoW and CMU_DoG, where performance improvements of 3.77% in Verif.on WoW and 2.41% on CMU_DoG are obtained by GPT2-L+RLFC over GPT2-L models.RLFC performs better on Verif.measure over standard baseline models than K-DIAL, suggesting that RLFC is better at promoting the model's ability to generate verifiable responses by aligning with factual knowledge.The side-effect of degradation on dialogue quality metrics implies that applying RLFC uniquely cannot effectively maintain the original conversationality of the standard GPT2 dialogue model.
3) The effects of RLFC+K-DIAL: The optimal training strategy to obtain the final models is specialized in two stages.We first train the GPT2 models using RLFC and then apply K-DIAL on the obtained model.The supreme performance improvements are obtained on the setting of GPT2-L using the combination of RLFC and K-DIAL methods in Q 2 and Fact. in respective of 6.18% and 8.64% absolutely over standard GPT2-L dialogue model on WoW.The combination of RLFC and K-DIAL can implicitly and explicitly improve the models' ability to express factual knowledge as a complementary, demonstrating the best performance in respect of factual consistency and dialogue quality.
4) The effects of model size: We observe that better performance improvements are attained on GPT2-M and GPT2-L models using either K-DIAL or RLFC than on larger-scale GPT2-XL models on both WoW and CMU_DoG test sets, as the GPT2-XL models finetuned on KDS datasets are more robust to generate factual consistent contents.strong correlation with the Fact.and Q 2 metrics in Table 1, which confirms that our proposed methods exactly improve factual consistency for KDS.The raters' agreements for each quality are measured separately using Fleiss' Kappa of statsmodels 12 .All the results (Flue.:0.99,Cohe.:0.95,Fact.:0.77 respectively) demonstrate substantial and almost perfect agreement levels.

Baseline Methods and K-DIAL Variants Comparisons
Experiments of Baseline Comparisons To the best of our knowledge, this work is the first to propose to improve factual consistency for KDS.Previous related works that investigate the factual consistency of KDS only focus on evaluation methods (Honovich et al., 2021) or datasets (Labadie et al., 2021;Dziri et al., 2022;Gupta et al., 2022).Therefore, there is no direct improvement method to be compared.Nevertheless, for the knowledge-enhancing method K-DIAL, we still carry out experiments on several knowledge injection and enhancement methods for NLG tasks, including K-Adapter (Wang et al., 2021), Kformer (Yao et al., 2022), and 12 https://github.com/statsmodelseural Knowledge Bank (NKB) (Dai et al., 2022b), which first integrate substantial exogenous knowledge and are further adapted on KDS tasks as baselines presented in Table 3. Experimental results on GPT2-M dialogue models generally show that all the knowledge injection methods can marginally improve the factual consistency but slightly degrade the dialogue quality on BLEU and ROUGE-L.K-DIAL outperforms the three baseline knowledge injection methods in both factuality and conversationality, demonstrating superior performance to improve factual consistency without sacrificing the dialogue qualities for KDS tasks.
More details regarding the baseline configurations and implementations are available in Appendix D.

Experiments of K-DIAL Variants
We also conduct variant comparison experiments of a K-DIALα which updates the extended FFN modules using L CE as Equation ( 1) and calculate the loss on all tokens rather than L KCE of Equation (3) for just knowledge entities.Experimental results suggest that optimizing K-DIAL by either L CE or L KCE has comparable performance, as the knowledge information is mainly represented by the knowledge entities and learned by the FFN modules.For ef-ficiency, we adopt only updating extended FFN parameters on knowledge entities for the K-DIAL method.

Case Analysis of K-DIAL and RLFC
We further present a representative case in Figure 4 to analyze the practical performance of proposed K-DIAL and RLFC methods in comparison with the standard GPT2-L model respectively.The following trends are found: 1) Both K-DIAL and RLFC can effectively correct the inconsistent response generated by the standard GPT2-L dialogue model that contradicts the factual knowledge in the Wikipedia document, demonstrating their efficacy in improving factual consistency for KDS.
2) The K-DIAL method is more likely to learn the important knowledge snippet and directly express it in responses, which is achieved by extended FFNs' explicit ability to enhance factual knowledge expressions.
3) RLFC implicitly aligns the FFNs' expressions with external gold knowledge for the factual consistency preference, which is more semantically natural and human-like than K-DIAL.

Related Works
Factual Consistency in NLG The issue of factual inconsistency in NLG tasks has attracted increasing attention in many fields such as abstractive summarization, with studies focusing on both improving and evaluating the factual consistency (Kryscinski et al., 2020b;Maynez et al., 2020;Xie et al., 2021;Zhu et al., 2021;Nan et al., 2021).Related works were also conducted on data-to-text generation (Dušek and Kasner, 2020;Thomson and Reiter, 2020).In the context of dialogue systems, Dziri et al. (2021); Gupta et al. (2022) introduced the benchmarks for measuring the attribution and fact-checking of dialogue generations with grounding knowledge.In the context of dialogue systems, Rashkin et al. (2021) added controllable tokens on the input of the dialogue model to generate responses that are more faithful to the source knowledge.Shuster et al. (2021) investigated the Retrieval-Augmented Generation (RAG) approach to reduce knowledge hallucination in conversational agents.Peng et al. (2023) introduced LLM-AUGMENT, a framework for augmenting black-box LLMs with external knowledge and automated feedback to reduce hallucinations.
Enhancing Knowledge in PLMs Previous works have explored ways to incorporate external knowledge into pre-trained language models (PLMs).ERNIE (Zhang et al., 2019) and Know-BERT (Zhang et al., 2019;Peters et al., 2019) enhance the word representations by incorporating external knowledge graphs.K-ADAPTER introduces two adapters to inject factual and linguistic knowledge into PLM respectively (Wang et al., 2021).Inspired by Dai et al. (2022a), recent works focused to add extended FFNs in Transformer-like K-former (Yao et al., 2022) or Neural Knowledge Bank (NKB) (Dai et al., 2022b) to inject and update extra knowledge while keeping the original model parameters frozen.Dong et al. (2022) investigated to detect the incorrect knowledge stored in PLMs and proposed CALINET for knowledge calibration.

Conclusion
In this work, we investigate the inadequacy of KDS that often produces factually inconsistent responses unsupported by grounding knowledge.We propose two strategies to tackle this issue.K-DIAL introduces extended FFN modules to explicitly enhance factual knowledge expressions given specific input patterns of the knowledge snippets and dialogue contexts.RLFC technique is used to implicitly adjust FFNs to augment factual expressions in responses by aligning with the gold knowledge for factual consistency preferences.Experiments present that both K-DIAL and RLFC can promote the knowledgeability, factual consistency and conversational ability of responses, demonstrating the efficacy of our proposed methods to improve the ability of FFNs to express factual knowledge to generate more informative and reliable responses in dialogue applications.

Limitations
The limitations of this work are summarized below: 1) As shown in Figure 2 (a) and described before, this paper assumes that factual inconsistency comes along with dialogue model M and procedure ➃ in Figure 2 (a), which deviated from the reality that the knowledge retrieval process ➀ and hallucinations in knowledge K and contexts X are not always correct.A more challenging problem lies in locating the inconsistency cause of generation processes.In future work, we will make a systematic investigation of the factual inconsistency and hallucination problems in KDS.
2) Recently, large-scale language models (LLMs) such as GPT3 and ChatGPT have demonstrated state-of-the-art performance across a range of NLP tasks.This work was only conducted on the GPT2 series PLMs with a maximum of 1.26B parameters, which is extremely small in comparison with such LLMs containing hundreds of billions of parameters.However, since the proposed methods involve plenty of model parameter updating, it is difficult to employ on LLMs due to the limitations of GPU resources in the initial work.Next, we will continue to explore the transferability of the framework using the parameter-efficient method to the employment of open-source LLMs.

A Dataset Details
WoW As listed below in Table 4, the WoW dataset contains 18,430, 1,948, and 1,933 conversations in Train/Valid/Test sets respectively.Both "seen" and "unseen" topic portions of test sets have been merged.Each conversation data spans 4-5 turns of utterances between Wizard and Apprentice.In each turn, the response by the Wizard is grounded on gold knowledge of one or two checked sentences in the Wikipedia documents as evidentiary support.
CMU_DoG The CMU_DoG (Zhou et al., 2018) dataset contains 3,373, 229, 619 conversations with an average of 21.43 turns per conversation.In CMU_DoG, the grounding knowledge is at the document level, which is a more difficult (but realistic) setting rather than WoW.

B Model Details
The hyper-parameters of GPT2 series models on both WoW and CMU_DoG tasks are listed in Table 5 including training epochs, batch size, learning rate, warm-up steps, and maximum sequence length.

C Metrics Details
NLI-based Metrics For our annotation of factual consistency, we categorize responses into three types as shown in Figure 6: Non-verifiable Response does not include any information that needs to be verified and cannot be evaluated as consistent or not consistent.A factually consistent response is informative and highly relevant to the provided knowledge.Hallucinated responses may not be always consistent with the knowledge but could still be correct.Despite the exorbitant expenses associated with human annotations, there are still publicly precious gold-label datasets.Following Santhanam et al. (2021), we defined finegrained metrics on factual consistency evaluation with respect to Verification, Factual Consistency, and Hallucination as exemplified in Figure 6.Similar taxonomy is also adopted in Kformer Kformer (Yao et al., 2022) is a knowledge fusion model that converts the knowledge into dense embedding vectors and then injects them into the parameters of expanded FFNs of the Transformer layers followed by Dai et al. (2022a).In accordance with (Yao et al., 2022), we encode the external knowledge via an embedding layer which is initialized as the same word embedding matrix of GPTs.Then we map the obtained knowledge representations into the corresponding vector space of FFN weights.Only the top 3 layers of all PLMs were adopted for knowledge enhancement.
Neural Knowledge Bank Neural Knowledge Bank (Dai et al., 2022b) have put forth the Neural Knowledge Bank (NKB) which is an extended FFN module as the memory slots for knowledge infusion using Salient Span Masking (SSM) (Guu et al., 2020) strategy to preserve the general language modeling competency.The NKB architectures are expanded onto the top three FFN layers of GPTs.The quantity of supplementary memory slots is established to match the dimension of intermediate

Figure 1 :
Figure 1: An illustration of enhancing the factual knowledge expression to tackle the inconsistency problem for knowledge-grounded dialogue system in this work.

Figure 2 :
Figure 2: An illustration of (a) a causal graph denoting the process of knowledge-grounded dialogue generations, where factual inconsistency issue in this work is attributed in M and procedure ➃ while others are assumed correct; (b) our proposed K-DIAL framework in a dialogue model with a sample.
denote the weight matrices of the FFNs and Act(•) is the activation function.The bias terms are omitted.Geva et al. (2021b) pointed that Θ l k in FFNs corresponding to keys are multiplied with h l to yield d memory coefficients.Each individual key k l i ∈ Θ l k can capture a textual pattern across the input prefix

Figure 3 :
Figure 3: Reward model training and workflow of RLFC training for KDS.
(b), we propose K-DIAL, which directly extended an additional FFN module with d ′ key-value pairs and further concatenated with the original FFNs of PLMs, to maximize the activation of each entity token y k over a certain knowledge-grounded dialogue input pattern of the sequence [K, X, y <k ].The loss function of K-DIAL framework is formulated as follows: During K-DIAL training, all the knowledge entities in gold knowledge and responses are recognized using spaCy 5 .The RLFC is implemented by trl 6 in this work, and all the hyperparameters related to PPO algorithm are default values by the trl PPOConfig recipe 7 except the epoch, learning rate and batch size.The reward model is obtained by training a BERT-Large (Devlin et al., 2019) based NLI model as a factual consistency classifier trained on three public, highquality human-annotated benchmarks and datasets Santhanam et al. (2021) 8 , Dziri et al. (2021) 9 , Gupta et al. (2022) 10 .

Figure 4 :
Figure 4: An example of a conversation on WoW valid set before and after using K-DIAL and RLFC on the dialogue model (denoted as <bot>).Incongruous content is highlighted in Red, while the counterpart gold knowledge in the Wikipedia document is in Blue.

Table 1 :
Experiments of GPT2 series models fine-tuned on KDS datasets employed with K-DIAL and RLFC methods on WoW and CMU_DoG test set.

Table 3 :
Experiments and parameter use of several knowledge enhancement baseline and K-DIAL variants comparisons on GPT2-M on WoW and CMU_DoG test set.
The number of processed training samples are presented in 4 below.

Table 4 :
Data statistics of two datasets.
Wang et al. (2021))(Generic/Attributable/Not Attributable) andDziri et al. (2021)(Verification/Supported/Refuted/Not Enough Information (NEI)).-AdapterWangetal.(2021)proposedK-ADAPTER, a neural adapter architecture specifying one kind of knowledge (e.g.Factual Knowledge or Liguistic Knowledge), as plug-in connections into different Transformer layers of PLMs.FollowingWang et al. (2021), we set each adapter model consisting of three adapter layers plug-in among the highest, middle, and lowest Transformer layers of PLMs (e.g. for the 36-layer GPT2-Large, adapter layers plug-in is configured as {1,18,36}), and parameters are not shared across different adapter layers.Each adapter layer comprises two Transformer layers and two projection layers illustrated in Figure5.The Transformer block of the adapter layer has been established to be of equal size to that of the PLMs.Additionally, the hidden dimensions of the down-projection and up-projection layers have been set to correspond to the word embedding and hidden dimension of the PLMs in different scales respectively. K