Generative Knowledge Selection for Knowledge-Grounded Dialogues

Knowledge selection is the key in knowledge-grounded dialogues (KGD), which aims to select an appropriate knowledge snippet to be used in the utterance based on dialogue history. Previous studies mainly employ the classification approach to classify each candidate snippet as “relevant” or “irrelevant” independently. However, such approaches neglect the interactions between snippets, leading to difficulties in inferring the meaning of snippets. Moreover, they lack modeling of the discourse structure of dialogue-knowledge interactions. We propose a simple yet effective generative approach for knowledge selection, called GenKS. GenKS learns to select snippets by generating their identifiers with a sequence-to-sequence model. GenKS therefore captures intra-knowledge interaction inherently through attention mechanisms. Meanwhile, we devise a hyperlink mechanism to model the dialogue-knowledge interactions explicitly. We conduct experiments on three benchmark datasets, and verify GenKS achieves the best results on both knowledge selection and response generation.


Introduction
To improve the informativeness in open-domain dialogue agents (Freitas et al., 2020), knowledgegrounded dialogues (KGD) are proposed to leverage external structured (Liu et al., 2019) and unstructured (Dinan et al., 2019) knowledge to dialogue responses.In KGD, it is pivotal to embed factual and conversationally appropriate knowledge in responses.Two classes of approaches are considered to embed knowledge: end-to-end and pipeline.End-to-end models, such as FiD (Izacard and Grave, 2021), process the document and generate the response in one shot.However, they tend to misuse knowledge (Adolphs et al., 2021).Pipeline models address this problem by explicitly identifying a specific knowledge snippet to be used in the response (Adolphs et al., 2021).Typically, pipeline KGD approaches have two substeps, i.e., knowledge selection and response generation (Dinan et al., 2019;Kim et al., 2020): The former aims to select knowledge snippets from passages, and the latter generates responses based on them.Knowledge selection plays a vital role in KGD as it directly determines the content of the response (Lian et al., 2019;Meng et al., 2020).In this paper, we focus on selecting knowledge snippets for dialogue to enhance pipeline KGD models.
The Classification paradigm dominates knowledge selection studies.In this paradigm, each snippet is independently classified as "relevant" or "irrelevant" (Dinan et al., 2019;Zhao et al., 2020b).However, these approaches ignore knowledge interactions, which refer to flows of information within the knowledge or between knowledge and dialogues.As shown in Figure 1, we identify two types of knowledge interactions in KGD: Intra-knowledge interaction Intra-knowledge interaction refers to the interactions between snippets.It is worth noting that the meaning of a knowledge snippet is context-dependent and can be ambiguous when taken individually.For example, the <8> snippet in Figure 1 "This work led to their" has a referential element their, and is difficult to identify its meaning without knowing the remaining context of the sentence.However, with the existence of the remaining context, we can quickly infer that it refers to Lamarr and George Antheil.This problem challenges existing methods when selecting knowledge on new topics.
Dialogue-Knowledge interaction Previous works also neglect interactions between dialogue and knowledge.There is a discourse structure and smooth transition of involved knowledge in multi-turn dialogue.For example, Lamarr's profession mentioned in the dialogue in Figure 1 is demonstrated in a parallel and multi-perspective manner, while some other cases follow a shallow-to-deep structure in dialogue.Some recent efforts attempt to fix these problems within the classification paradigm; for example, Li et al. (2022) build a semantic graph for passages to capture intra-knowledge interaction, Kim et al. (2020) propose sequential knowledge selection to model the dialogue-knowledge interaction as latent variables.However, they are complicated, lack deep semantic interactions, and are challenging to model the two types of knowledge interaction simultaneously.
In this work, we propose GENKS (Generative Knowledge Selection), a simple yet effective generative model that addresses these challenges.GENKS first assigns an identifier to each snippet, feeds all the snippets into the model simultaneously, and then selects snippets by generating their identifiers with a sequence-to-sequence Transformer model (e.g., BART (Lewis et al., 2020a)).Compared with KGD methods with the classification paradigm, GENKS captures interactions between knowledge snippets through the self-attention mechanism in Transformer (Vaswani et al., 2017).Therefore, GENKS can obviate the ambiguity in snippets with the existence of the rest context and improve the understanding of knowledge.Moreover, we propose a hyperlink method to capture the dialogue-knowledge interactions explicitly and effectively.Finally, we propose to joint knowledge selection and response generation within one generative model.
We evaluate our proposed method on three public KGD datasets: Wizard of Wikipedia (Dinan et al., 2019), Holl-E (Moghe et al., 2018), and CMU_DoG (Zhou et al., 2018).The experimental results show that GENKS significantly improves the accuracy of knowledge selection as well as the quality of response generation, by establishing new state-of-the-art on KGD benchmarks.Improvements are particularly significant on unseen topics, outperforming the BART classification model by up to 8.1% absolute.GENKS also achieves the best results as the number of dialogue turns increased, with an average of 10% improvements over the BART classification model in the last three turns.We also compare our model with recent SOTA endto-end methods (Shuster et al., 2021), and find our model can generate responses with fewer hallucinations while having better controllability and interpretability.The effectiveness of the proposed method is also validated through human evaluation and ablative experiments.
Our contributions are summarized as follows: (1) We propose GENKS, which is the first attempt at generative knowledge selection in KGD.
(2) GENKS captures intra-knowledge and dialogueknowledge interactions simultaneously.(3) We propose a hyperlink method to enhance the interactions between dialogue and knowledge.(4) Experiments verify that GENKS establishes a new state-of-the-art on KGD1 .

Related work
Knowledge-grounded dialogues With the advances in large-scale language models, dialogue agents can now generate high-quality responses using parametric knowledge (Thoppilan et al., 2022;Freitas et al., 2020;Bao et al., 2021).However, hallucination remains a challenge, which means that the language model tends to generate plausible-looking statements that are factually incorrect (Shuster et al., 2021).To address this problem, knowledge-augmented approaches are applied in dialogue generation (Lewis et al., 2020b).In knowledge-grounded dialogues (KGD), the dialogue models first select a knowledge snippet from passages and then generate the responses (Liu et al., 2018;Dinan et al., 2019).
Knowledge selection As the critical step in KGD, knowledge selection has received many studies.The exiting methods mainly employ classification model with dual-encoder (Dinan et al., 2019;Kim et al., 2020) or cross-encoder (Zhao et al., 2020b) architecture.However, the classification paradigm is unable to capture the knowledge interaction in KGD (Kim et al., 2020;Li et al., 2022).To address this problem, Li et al. (2022) propose a graph-based method to capture the relationship between candidate snippets, Zhan et al. (2021a) and Wu et al. (2021) employ machine reading comprehension model to extract span from long document.Sequential knowledge selection has also been proposed to capture the topic transition in conversations (Kim et al., 2020;Zhan et al., 2021b;Zheng et al., 2020;Meng et al., 2020;Yang et al., 2022).Despite their effectiveness, the existing methods have two drawbacks: (1) they use compact vectors to represent dialogue and knowledge and thus lack deep semantic interactions; (2) they are complicated and challenging to capture intra-knowledge and dialogue-knowledge interactions simultaneously.We address these drawbacks by shifting the modeling paradigm of knowledge selection to identifier generation (Sun et al., 2022), and propose GENKS to capture the two types of interaction simultaneously using Transformer (Vaswani et al., 2017).
Generative knowledge selection A generative paradigm for knowledge selection is not foreign to the NLP community; for example, sequenceto-sequence models have been applied on entity retrieval (Cao et al., 2021), document ranking (Nogueira et al., 2020;Tay et al., 2022), multievidence retrieval (Min et al., 2021;Yavuz et al., 2022), and etc.Our proposed model GENKS differs from existing methods in the following ways: (1) we are the first to explore generative knowledge selection in KGD; (2) we consider the effectiveness of intra-knowledge interaction; (3) we design hyperlinks to capture the interaction between knowledge and dialogue.

GENKS
We provide an overview of GENKS in Figure 2. As shown in Figure 2, the dialogue data is first serialized into a sequence.Then a sequence-to-sequence model (i.e., BART) is employed to select knowl-edge and get the response by generating the target sequence autoregressively.In this section, we first formulate the task in Section 3.1.Then, we detail the serialization (Section 3.2) and optimization (Section 3.3) methods.

Problem formulation
Suppose that we have a case of knowledgegrounded dialogues (C, K, r), where C = (c 1 , ..., c |C| ) is a dialogue context that contains |C| utterances, r is the response to C, K = (K 1 , ..., K |K| ) denotes |K| passages that are relevant to C; for each i, as the total number of snippets in K.A knowledge-grounded dialogue agent is decoupled into two modules: a knowledge selection module P (k|C, K) that selects a snippet from K; a response generation module P (r|C, K, k s ) where k s is the selected snippet from knowledge selection module.

Serialization
We formulate the knowledge selection task as a procedure of sequence generation.As shown in Figure 2, the dialogue context C and knowledge candidates K are mapped into a sequence and then fed into a sequence-to-sequence model.The model's output is converted back to the selected knowledge k or the response r.
Specifically, we first assign an identifier to each snippet in K, sequentially starting from <k1> to <km>.Then we convert passages K into a sequence using a template that packages snippets with the corresponding identifiers and concatenates them in order; see the green block in Figure 2. Similarly, the dialogue context C is serialized by adding task prompts, i.e., task description and speaker name, as shown in the blue block in Figure 2.
In multi-turn dialogues, the knowledge appearing in the dialogue history prompts the discourse structure of knowledge transition and knowledge expression.Hence we propose a hyperlink method to capture the dialogue-knowledge interaction explicitly.We provide an example of the hyperlink method in Figure 2. We see that the first utterance of User1 refers to a snippet (whose identifier is <k2>) in the passage "Skateboarding".We thus add a hyperlink to the utterance.The hyperlink includes the identifier and the title of the snippet, i.e., annotating [Skateboarding]<k2> at the beginning of this utterance (as shown in the red block in Figure 2).Finally, we splice the passages and dialogue context sequences as input for a Transformer model (i.e., BART).Therefore, the model can capture the intra-knowledge and dialogue-knowledge interactions through a self-attention mechanism (Vaswani et al., 2017).

Optimization
The knowledge selection model is optimized by the cross-entropy loss: , where k true denotes the label knowledge.Since k true needs to be labeled manually and is not available in some scenarios (Zhou et al., 2018), we construct pseudo-labels for model training following Zhao et al. (2020b) in cases the knowledge label is absent.In particular, we calculate the F1 score (Dinan et al., 2019) between each knowledge snippet and the response.We use the snippet with the highest score as the pseudo label.Such a method is based on the intuition that human responses provide hints regarding the relevance of the snippets (Zhao et al., 2020b;Li et al., 2020).
Since both knowledge selection and response generation are modeled with the generative paradigm, we unify the two modules with one joint generative model.In this joint model, the knowledge selection and the response generation are optimized jointly, with shared parameters.To this end, we splice the knowledge identifier k true and response r into one sequence (as shown in Figure 2).Then, we optimize the sequence-tosequence model using cross-entropy loss on all the tokens of the target sequence.In inference, the model generates knowledge identifier k s and responses r in an autoregressive fashion.We note that the end-to-end model allows the two tasks to be mutually enhanced and improves the model's efficiency.

Experimental setup 4.1 Datasets
We conduct experiments on Wizard of Wikipedia (WoW) (Dinan et al., 2019), Holl-E (Moghe et al., 2018), and CMU_DoG (Zhou et al., 2018).The statistical details on these three datasets are shown in Table 7 in  In the multi-reference test, there are multiple human-annotated ground-truth knowledge and corresponding responses for each instance.• CMU_DoG focuses on the domain of movies.
The workers discuss a movie in depth given the background knowledge(e.g., introduction, plots, and key scenes).

Baselines
We compare GENKS with baselines of two categories: (i) End-to-end methods that generate response directly without explicit knowledge selection, and (ii) Pipeline methods that explicitly select knowledge snippet to be used in response.
The end-to-end methods we consider are: In addition, we randomly sample 100 examples from the WoW test seen and WoW test unseen, respectively, and recruit three experts for human evaluation.The annotators are asked to judge the model-generated response in four ways: • Fluency, which measures whether the response is fluency in expression; • Coherence, which measures whether the response is coherence to the dialogue context; • Relevance, which measures whether the knowledge used in the response is relevant to the dialogue; and • Factuality measures whether the response's content is factual.In Factuality evaluation, the experts check the content using Google.The annotators are asked to assign a score in {0, 1} (representing "nonfactual" and "factual") for factuality, and a score in {0, 1, 2} (representing "bad", "fair", and "good") for the others.

Implementation details
We implement the GENKS using BART large (with 400M parameters) (Lewis et al., 2020a) in Hug-gingFace's Transformers library.We truncate the dialogue context to 256 tokens, then truncate the knowledge so that the total length is less than 1024 tokens.During inference, the responses are decoded using a greedy search.See Appendix A for more details.
Typically, the number of passages in K is large, so that the input sequence exceeds the maximum input length of BART (i.e., 1024 tokens).To address this problem, we take advantage of a lightweight passage selector based on DistilBERT (with 66M parameters) (Sanh et al., 2019), which aims to rank the passages in K. Specifically, we concatenate each passage with dialogue context and encode the sequence using DistilBERT.Finally, the representation of [CLS] token is used to estimate the relevance score of the passage through a learnable MLP classifier.The passage selector is optimized via contrastive learning objective (Nogueira and Cho, 2019), in which the model learns to assign a higher score to positive passages than negative passages.During inference, we keep only the top-1 passage ranked by the passage selector.The passage selector gets Recall@1 of 75.5%, 76.5%, and 68.0% for the WoW test seen, WoW test unseen,

WoW
Holl-E
5 Experimental results

Performance on knowledge selection
We evaluate the knowledge selection effectiveness of GENKS on WoW and Holl-E, respectively 3 .In Table 1, we compare the knowledge selection accuracy of GENKS with previous pipeline methods.Results show that GENKS achieves the best knowledge selection accuracy on both datasets and consistently outperforms baselines.We find that GENKS particularly excels at topics that do not appear during training (see WoW unseen test split).For example, the classification models both have noticeable accuracy drops on the unseen topic.In contrast, models that model the intra-knowledge interaction (e.g., GENKS, Graph, DIALKI) can better understand the knowledge of unseen topics 4 .
To evaluate the Performance of GENKS as dialogue goes deeper, we compare GENKS with 3 We cannot evaluate the knowledge selection accuracy on CMU_DoG because the knowledge snippets used in responses are not manually labeled. 4The higher results on unseen than seen might be due to the smaller number of topics in the unseen test set.four classification baselines (SKT, DiffKS, Knowl-edGPT, and BART-CLS) overturns.Figure 3 shows the results.Both methods achieve good accuracy in the first few turns.However, as the conversation dives deeply into a topic, a significant performance decline can be seen in baseline methods.In contrast, GENKS that explicitly captures the multiturn dialogue-knowledge interaction, achieves a relatively high accuracy (around 22%-23%).

Quality of generated responses
We report response generation evaluation results on WoW in Table 2.The results on Holl-E and CMU_DoG are available in Table 8 and Table 9 in the appendix.The results of baselines are cited from original papers or re-evaluated using officially released checkpoints.
Compared with previous pipeline models, GENKS achieve the best Performance on almost all metrics.For example, GENKS surpasses Knowl-edGPT by 0.7% and 2.4% in terms of F1 on WoW seen and WoW unseen, respectively.Note that the improvements on the unseen test set are more notable than on the seen test set, which agrees with the experimental results regarding knowledge selection.GENKS also achieve competitive results compared to SOTA end-to-end models.For example, GENKS performs comparably to BART FiD-RAG DPR-Poly on WoW seen and outperformed on WoW unseen.

Ablation study about knowledge selection
To analyze the effect of each component in GENKS, we designed several variants and conducted an ablation study about knowledge selection.Results are listed in Table 1, "Variants for comparison".The details of compared variants and the findings are as follows:

WoW Seen
WoW Unseen

BART classification w/ position
To understand the influence of position bias, we splice the snippet's position into the classification model's input.We find that the results are improved to a certain extent (about 1% improvement), but there is still a clear gap compared with GENKS.
Hierarchical classification This variant first uses the passage selector model of GENKS to rank the passages and then selects the snippets in the topranked passage using BART classification w/ position.The results show that the passage selector does not affect the classification model's Performance.
Without passage selector When the passage selector model of GENKS is removed, the model has more probability of truncating the label knowledge, resulting in an evident decline in Performance.
Unorder knowledge snippets To disable the intraknowledge interaction, we unorder the snippets so that order of the snippets is inconsistent with the original passages.This variant shows a decline in selection accuracy, especially on unseen topics, indicating that keeping the order of the snippets in the passage is necessary.
Without hyperlinks We remove the hyperlinks in the dialogue context.About a 1% accuracy drop is seen, indicating the effectiveness of hyperlinks.

Ablation study about response generation
As shown in Table 2, we also conduct an ablation study about response generation.The details of compared variants and the findings are as follows: With BART classification knowledge When replacing the generated identifier with the knowledge selected by BART classification, a performance decline is witnessed -the F1 value drops by 0.7% and 1.8% on Wizard seen and unseen, sustaining the effectiveness of the knowledge selection of GENKS.
Without identifier generation This variant removes the identifier generation by directly generating the response.We see notable performance drops, especially in the KF1 metric.(knowledge used by ground-truth response).The results (e.g., KF1=74) suggest that GENKS can effectively locate and incorporate the corresponding knowledge into the responses following the guidance of the identifier.

Efficiency evaluation
To evaluate the efficiency of GENKS, we compare the model with previous end-to-end models and pipeline models.The results listed in Table 4 show that GENKS is more efficient than previous pipeline models.We infer that this phenomenon is because GENKS jointly models knowledge selection and response generation, avoiding repeated encoding of dialogue history and knowledge.As a pipeline method, we also find that GENKS achieves comparable efficiency compared to end-to-end models like RAG, but benefits from explicit knowledge selection.

Analytical experiment
Multi-snippets selection GENKS select a single snippet following the experimental setup outlined in the baselines (Dinan et al., 2019), but it can also select multiple snippets by generating multiple identifiers.We test a variant of our GENKS model, GENKS-2, which selects two snippets by generating two identifiers consecutively.We compared its performance with the original GENKS on the WoW dataset.The results are listed in Table 5 group 1. GENKS-2 performs slightly worse than the original GenKS, likely because the WoW dataset only uses one snippet in response annotation and therefore does not benefit from using multiple snippets (Dinan et al., 2019).Nevertheless, the results suggest that the proposed generative knowledge selection approach has the ability to select multiple knowledge.
Hyper-parameter analysis We also conduct ablation experiments on the number of input snippets to the model and maximum input tokens.The results are listed in  reducing the number or length of knowledge reduces model effectiveness.

Case study
To better understand end-to-end baselines and our model, we provide an example in Table 6, which shows that GENKS appropriately changes its response prediction when providing different knowledge snippets5 .Therefore, GENKS is more controllable and interpretable than end-to-end models, where the end-to-end system is a black box.We provide more case studies in Appendix B.

Conclusion
In this paper, we have proposed GENKS, a simple yet effective knowledge-grounded dialogue model.GENKS is a generative model, which learns to select knowledge snippets by generating their identifiers.Benefiting from the modeling of intraknowledge interaction and dialogue-knowledge interaction, GENKS effectively addresses the challenges of ambiguity and discourse structure.

A Implementation details
We use gradient clipping with a maximum gradient norm of 0.1.We optimize the model for up to 5 epochs with a batch size of 16 on 4 3090 GPUs with 24G memory.We choose the model checkpoints by evaluating the metrics on the validation set for each epoch.During inference, the responses are decoded using a greedy search.We have tried some advanced decoding algorithms (e.g., nucleus sampling) and found no improvement.The training of the model can be completed within 5h, and the latency of the model inference for one example is within 0.1s.The passage rerank model gets Recall@1 of 75.5%, 76.5%., 61.0% for WoW test seen, WoW test unseen, and Holl-E, respectively.PPL F1 Avg.Ext.Greedy ITDD (Li et al., 2019) 26.0 10.4 0.748 0.390 0.587 DRD (Zhao et al., 2020a) 46.1 10.8 0.791 0.406 0.613 TMN (Dinan et al., 2019) 75.2 9.9 0.789 0.399 0.615 KGPT (Zhao et al., 2020b)    DukeNet i have heard the city is located in the ouachita mountains among the us KGPT i've been to the ouachita mountains, too! i've been to the ouachita mountains in the ouachita mountains.GenKS I've never been to Hot Springs, but I've always wanted to go to there.Human I've never been to that one!I bet its beautiful!

B Case study
To better understand baselines and our model, we present two examples in Table 10 and Table 11.Table 10 show example where both GENKS and baselines select out the proper knowledge (i.e., the knowledge snippet shown in green).We see that the response generated by GENKS is more appropriate to the dialogue context than baselines, while KnowledGPT's response does not answer User2's question and is also factually incorrect.In Table 11, we observed that although neither GENKS nor the baselines selected the label knowledge, the response generated by GENKS is still more natural and coherence.We also find that KnowledGPT is more colloquial than GENKS but has problems with hallucinations.

Figure 1 :
Figure 1: An example of knowledge-grounded dialogues.The dialogue agent selects a knowledge snippet (i.e., <7>) from passages and generates a response based on it.Intra-knowledge interactions and dialogueknowledge interactions are denoted by ➀ and ➁, respectively.

Figure 2 :
Figure 2: Overview of GENKS.The dialogue context and the knowledge are serialized and fed into a seq-to-seq model, BART.The outputs are the identifier of the selected snippet (i.e., <k5>) and the response.
the appendix.• WoW is an open-domain KGD dataset using Wikipedia passage as background knowledge.The test set of Wizard is split into seen and unseen versions, where the unseen test set contains 58 new topics not discussed in the training data.• Holl-E focuses on the movie domain.The background knowledge consists of plots, comments, and movie reviews collected from different websites.Holl-E has two versions of the test set: single test and multi-reference test.

Figure 3 :
Figure 3: Knowledge selection accuracy over different dialogue turns.BART-CLS represents a text-matching model with cross-encoder architecture.

Table 1 :
Knowledge selection accuracy on WoW (seen and unseen test set) and Holl-E (single reference and multi-reference test set).Bold denote the best results with significant improvements over the previous SOTA (t-test, p < 0.05).Underline denote second best results.

Table 2 :
13.1 22.9 * 29.5 4.5 * 13.2 22.7 * 28.1 * 4.6 * Evaluation results on WoW seen and unseen test set in terms of response quality.We compare against the ground-truth dialogue response in terms of perplexity (PPL), F1, Knowledge F1 (KF1), and BLEU-4 (B4).The four groups lists previous end-to-end models, previous pipeline models, GenKS, and ablative variants.The best results are highlighted with bold, and the second-best results are highlighted with underline.* indicates significant improvements over all baselines with p-value < 0.05.

Table 4 :
Inference time (minutes) on one GPU on WoW unseen test set.The values of model with * are estimated based on the model size and input/output length.KG and RG denote inference time for knowledge selection and response generation stage, respectively.

Table 3
Table 5 group 2. We find that

Table 6 :
Examples of GENKS outputs on the WoW.

Table 8 :
Results on Holl-E in term of response quality.RG1 and RG2 denote ROUGE-1 and ROUGE-2 respectively.Best results are heighten with bold.

Table 9 :
Results on CMU_DoG in term of response quality.The best results are highlighted with bold.Topic Nickelback User Do you like Nickelback? ... User Yes, what more can you tell me about Nickelback?System Chad Kroeger is the leading vocalist of the group.User Amazing.What about the other group members Passage The band is composed of guitarist and lead vocalist Chad Kroeger, guitarist, keyboardist and backing vocalist Ryan Peake, bassist Mike Kroeger, and drummer Daniel Adair.TMN i think the song is a very popular song DukeNet chad kroeger is a great band KGPT the lead vocalist is chad kroeger. he is also the drummer.GenKS Chad Kroeger, Ryan Peake, Mike Kroeger and Daniel Adair.Human Ryan Peake is the keyboardist an Mike Kroeger is the bassist.

Table 10 :
Case study on the Wizard Test Unseen dataset.This table shows an example where both GENKS and baselines select the proper knowledge.Topic List of national parks of the United States User I love national parks dont you ... User I live in Arkansas and love the Hot SPrings National Park and have been there many times, really it is beautiful

Table 11 :
Case study on the Wizard Test Unseen dataset.This table shows an example where both models select the wrong knowledge.