Unsupervised Knowledge Selection for Dialogue Generation

Knowledge selection is an important and challenging task which could provide the appropriate knowledge for informative dialogue generation. However, the needed gold knowledge label is difﬁcult to collect in reality. In this paper, we study knowledge selection for dialogue generation in the unsupervised scenario and propose a novel Distilled Distant Supervision Loss (DDSL) to supervise knowledge selection when the gold knowledge label is unknown. Speciﬁcally, we ﬁrst obtain an oracle knowledge label via distant super-vision and then leverage knowledge distillation to alleviate the noisy labeling problem of distant supervision. Furthermore, we pro-pose a pretraining-ﬁnetuning strategy to deal with the mismatch knowledge selection problem that models tend to select the mismatched knowledge for dialogue generation in the un-supervised setting and will cause the degeneration of knowledge-aware decoder. Experiments on two knowledge-grounded dialogue datasets show that our approach manages to select knowledge more accurately in the unsupervised setting and generates more informative responses, even outperforming many strong supervised baselines. 1


Introduction
To avoid general and dull dialogue generation , knowledge-grounded dialogue which equips dialogue systems with external knowledge has become a popular research topic. Thanks to the hand-collected knowledge-grounded dialogue datasets which align each dialogue (even each utterance) with a pre-identified document Moghe et al., 2018;, many researchers focus on injecting the given knowledge to generate informative responses and achieve promising results (Yavuz et al., 2019;Tang and Hu, 2019;Qin et al., 2019a;Li et al., 2019b;Zheng and Zhou, 2019;Meng et al., 2019;Ye et al., 2020;. However, they usually need the preidentified knowledge and the knowledge access task is less studied (Kim et al., 2020b) which is the precursor to knowledge dialogue generation in reality .
It is natural to extract the external knowledge via information retrieval technology. Several works regard the retrieved knowledge sentences as the preidentified document (Ghazvininejad et al., 2018;Gopalakrishnan et al., 2019). However, the retrieved document contains redundant and irrelevant information which are harmful for dialogue generation (Zhao et al., 2020b). Hence, knowledge selection which chooses an appropriate sentence from the pre-retrieved knowledge pool gains much attention and it plays the role of knowledge access for generation.  first propose knowledge selection for dialogue generation which are two sequential subtasks and the generation is based on the selected knowledge. Several works follow their setting and achieve improvements with latent variable models  or more complex selection mechanism (Niu et al., 2019;Meng et al., 2021). Although those works show promising performance with explicit use of knowledge in open-domain dialogue, they still need gold knowledge labels to train the selection module well . And it is still less studied to make knowledge selection work well without gold knowledge label, which is valuable and challenging.
In this paper, we explore knowledge selection for dialogue generation in the unsupervised scenario and propose a novel Distilled Distant Supervision Loss (DDSL) to supervise knowledge selection when there is no gold knowledge label. Specifically, we first obtain an oracle knowledge label via distant supervision (Mintz et al., 2009) to substitute the gold one in the unsupervised setting. However, distant supervision inevitably suffers from the noisy labeling problem due to literally matching (Yang et al., 2019a). Therefore, to train knowledge selection well without gold label, we leverage knowledge distillation to reduce the noise of the oracle label. Furthermore, we find that models tend to select the mismatched knowledge for dialogue generation in the unsupervised setting. And forcing the knowledge-aware decoder to leverage the selected knowledge at training will cause the decoder degenerating into the knowledgeindependent decoder. To deal with this problem, we propose to pretrain the knowledge selection and response generation independently and then finetune the decoder with the selected knowledge using different sample weighting scores. We demonstrate the effectiveness of our unsupervised approach on two knowledge-grounded dialogue datasets, i.e., Wizard of Wikipedia  and Holl-E (Moghe et al., 2018) in comparison with various supervised and unsupervised baselines.
Our contributions are summarized as follows: • We propose Distilled Distant Supervision Loss to make knowledge selection work well in the unsupervised scenario where the gold knowledge label is not available.
• We propose a pretraining-finetuning strategy to alleviate the degeneration of knowledgeaware decoder caused by the mismatch knowledge selection problem.
• Results on two datasets show that our approach manages to select knowledge more accurately in the unsupervised setting and even generates more informative responses than many strong supervised baselines.

Task Formulation
Given the utterance x t at each turn t and the associated knowledge pool K t = {k i t } = {k 1 t , · · · , k L t } 2 containing L retrieved candidate sentences k i t , the final goal is to generate an informative response y t . Following , we first learn to select the appropriate knowledge k s t from the knowledge pool K t and then generate the response y t by incorporating the selected knowledge. In the conventional supervised setting, there exists gold knowledge label to supervise knowledge selection. However, the manually labeled knowledge is difficult to obtain in reality . As a result, we study the unsupervised knowledge selection for dialogue generation in this paper, which is very valuable and challenging.

Architecture Overview
In the following subsections, we first introduce the three major components (Section 2.3 ∼ 2.5): Encoder, Knowledge Selection (KS) and Decoder, which are trained jointly with the gold knowledge loss and response generation loss in the conventional supervised setting, as Figure 1 (a) shows. Then we introduce our Distilled Distant Supervision Loss (DDSL) in Section 2.6 to train knowledge selection well in the unsupervised setting, which consists of distant supervision and knowledge distillation, as Figure 1 (b) shows. Finally, we detail the mismatch knowledge selection problem and how to make the decoder leverage the selected knowledge k s t well in Section 2.7.

Encoder
For each sentence st, we obtain the context aware word representations H st via BERT (Devlin et al., 2019) and the corresponding sentence representation h st via Mean Pooling (Cer et al., 2018): where N st is the sentence length and d is the hidden size. Specifically, we represent the utterance x t with H xt and h xt , and represent each knowledge sentence k i t ∈ K t with H k i t and h k i t .

Knowledge Selection
In this paper, we mainly focus on knowledge selection in the unsupervised setting and adopt the standard dot-product attention over the knowledge candidates to select knowledge .
Selection Query: The selection query consists of the current utterance, the dialogue history dh t = [x 1 , y 1 ,· · ·, x t−1 , y t−1 ] and the history of selected knowledge kh t = [k 1 ,· · ·, k t−1 ] as they help the knowledge selection . Formally, we use two GRUs (Cho et al., 2014) to summarize the dialogue and knowledge selection history as the corresponding state vectors s dht and s kht : where s dh 0 and s kh 0 are zero vectors, h xt , h yt and h k t are sentence vectors of utterance x t , response y t and the oracle knowledge k t (will be described in Equation 8) and [·; ·] denotes concatenation. Then, the selection query q t is obtained: Knowledge Selection: The knowledge selection distribution S ∈ R L over the knowledge pool K t ∈ R L is obtained by the dot-product attention: Finally, the knowledge k s t with the highest attention score is selected for further response generation. If the gold knowledge k t exists, we could train this task via the Cross Entropy (CE) loss:

Decoder
Following , our Transformer-based decoder takes the representation concatenation H rc = H xt ; H k s t ∈ R Nrc,d of current utterance x t and the selected knowledge k s t as input, and uses the copy mechanism (Gu et al., 2016) to generate responses. The process of generating a word can be formulated as follows: where TD denotes the Transformer decoder, s n t is the hidden vector for the n-th word in the response y t at t-th turn, p cp is the attention weight of the first head in the additional multi-head self-attention layer for the copy mechanism, which is short for MultiHead cp , V is the vocabulary, and p (V) is the final generation distribution. Finally, we generate the word y n t with the highest probability, and we keep generating by feeding y n t to the decoder until the " eos " token is generated. We train the generation task the Negative Log-Likelihood (NLL) loss: The model is trained with the loss L = L KS + L G , where L KS is the knowledge selection loss, i.e., Equation 5 in the supervised setting.

Distilled Distant Supervision Loss
In this section, we will introduce our Distilled Distant Supervision Loss (DDSL) to train knowledge selection well in the unsupervised setting, which consists of distant supervision, label weighting and knowledge distillation. Actually, we first obtain a noisy oracle knowledge label via distant supervision. And then our DDSL tries to reduce the noise via label weighting and knowledge distillation.
Distant Supervision: Suppose that we have the utterance x t , response y t and the retrieved knowledge pool K t without knowing the gold knowledge label, we first calculate the confidence score W k i t whether each knowledge k i t is matched up with this dialogue flow by: where set(a) or set(b) indicates the tokens in the string a or b, softmax τ (z i ) = e (z i /τ ) / j e (z j /τ ) and τ is the temperature to reshape the confidence probability distribution W ∈ R L over the knowledge pool. Then, we obtain the oracle knowledge k t with the highest confidence score, assuming that the gold knowledge should contribute most tokens to the response generation. This is possible because that 1) the causal modeling specified conditions (Selltiz et al., 1976) hold between knowledge selection and response generation (Tuan et al., 2020), and 2) it is common for humans to (involuntarily) produce utterances which are copied or suitably modified from background knowledge sentences (Moghe et al., 2018).
Although we can directly replace the gold label k t in Equation 5 with the alternative one k t , there are some noise. Therefore, we modify Equation 5 with label weighting via the confidence score: Knowledge Distillation: We further alleviate the noisy labeling problem of distance supervision via Knowledge Distillation (KD) as shown in Figure 1 (b). Following (Tian et al., 2020;, the teacher takes the context and response as input and generates the distribution of knowledge selection as soft target. Compared with the student, i.e., the standard knowledge selection module described in Section 2.4, teacher has the gold response as an additional input. Specifically, we make up the teacher query as q tea t = W tea s dht ; s kh t−1 ∈ R d , which contains more information (i.e., the response) than q t in Equation 3. Then we use this query q tea t to obtain the teacher's knowledge selection distribution T ∈ R L by Equation 4. Finally, online responsebase knowledge distillation 3 is formulated as: where the first term is used for teacher training and the second term transfers the knowledge from teacher to student based on the Kullback-Leibler (KL) divergence between the teacher and student distributions (i.e., T ∈ R L and S ∈ R L ).
Although the teacher is also trained with the noisy label, the teacher produces an independent source of variance that can be used to cancel out the variance introduced by the label noise (Li et al., 2017). Moreover, previous works have proved that the student can still be enhanced by the noisy teacher (Sau and Balasubramanian, 2016;Yuan et al., 2020). Therefore, we believe that the student trained by two supervision signals gets benefits from the regularization of the soft target (Yuan et al., 2020).
To sum up, our DDSL loss is as follows: which consists of distant supervision, label weighting in Equation 9 and knowledge distillation in Equation 10 for unsupervised knowledge selection.

Training
The Mismatch Knowledge Selection Problem: As we know, there are chances that the selected knowledge is not the gold one due to 1) the diversity of knowledge selection in conversation  and 2) the under-optimized knowledge selection at early training stage (Zhao et al., 2020b). And it is more serious in the unsupervised setting where it is hard to train knowledge selection well. The mismatch knowledge selection problem occurs due to the training paradigm in Figure 1 where the decoder is trained to generate the gold response with mismatched knowledge. This mismatch problem causes the knowledge-aware decoder to take the selected knowledge as noise and degenerate into the knowledge-independent decoder.
Our Pretraining-Finetuning Strategy: Although training the decoder with the matched knowledge 4 solves the mismatch problem. It can't deal with wrong knowledge selection at inference which is often the case, yet never seen at training.
We take the plain idea as our pretraining stage, and then train the decoder to adapt to the selected knowledge using different weighting scores in the finetuning stage .
In the pretraining stage, we train knowledge selection and response generation in parallel as Figure 2 shows. The pretraining loss is as follows: where we use L KS = L DDSL in the unsupervised setting and the decoder is trained to generate the gold response with the oracle knowledge k t instead of the selected one k s t . Therefore, the decoder learns how to incorporate the matched knowledge 4 k t into the response generation because k t is much more accurate than the selected one k s t as Table 4 shows. And we could alleviate the mismatch problem from the pretraining process because 1) we get a fully optimized knowledge selection module and 2) the pretrained decoder provides a good initialization for finetuning.
In the finetuning stage, we continuely train the pretrained decoder to adapt to the pretrained knowledge selection module with the sample weighting idea. And the finetuning loss is defined as follows: where L NLL is defined in Equation 7 and W k s t is the confidence or weighting score of the selected knowledge k s t defined in Equation 8. As mentioned above, the mismatch knowledge selection problem is caused by the training paradigm that we may train the decoder to generate the gold response with the mismatched knowledge from the knowledge selection module. Here, we finetune the pretrained decoder with the selected knowledge k s t by giving higher importance weights if the selected knowledge is suitable for the gold response generation. In this way, we further alleviate the mismatch problem because we highlight the matched samples by assigning an importance weight to each instance (x t , k s t , y t ) to reform the training data (Cai et al., 2020;Dong et al., 2020).

Dataset
We evaluate our model on two public knowledgegrounded dialogue datasets: Wizard of Wikipedia (WoW)  and Holl-E (Moghe et al., 2018), both of which provide the knowledge candidates with gold knowledge labels for knowledge selection. To test our approach in the unsupervised setting, we do not use the gold knowledge label provided in those datasets. WoW contains the dialogues between two participants on some open-domain topics, where one is a curious learner while the other plays the role of a knowledgeable expert with access to the knowledge pool. Each knowledge pool contains about 61 sentences on average per turn, which are retrieved from Wikipedia based on the dialogue context via the IR system. There are 18430, 1948 and 1933 dialogues for training, validation and test, respectively. According to the topic overlapping, the test set is split into two subsets: 965 Test Seen and 968 Test Unseen dialogues, where Test Unseen consists of 58 topics never seen in train or validation.
Holl-E contains 7228, 930 and 913 dialogues for training, validation and test, respectively. Two test versions are provided: one with a single gold reference, the other with multiple gold references (more than one gold knowledge sentences and corresponding responses for each given conversation context). Each dialogue is assigned with a document of about 60 sentences on average as the knowledge pool. Here, we use the modified version  which fits for knowledge selection.

Models for Comparison
We compare our method with a set of baselines: 5

No Knowledge
S2S Transformer is a Seq2Seq model based on Transformer  that does not leverage the external knowledge. S2S BERT replaces the Transformer Encoder with a pretrained BERT (Devlin et al., 2019).

Supervised Knowledge Selection
TMN is short for End-to-End Transformer Mem-Net , which selects knowledge based on the Transformer memory network and generate responses via the Transformer decoder. TMN BERT+PostKS+CP , implemented by , enhances the encoder with BERT, knowledge selection module with PostKS  and decoder with the copy mechanism (CP).

Setting Row Method
Test Seen Test Unseen  Table 1: Quantitative results on WoW. Our approach manages to select knowledge more accurately in the unsupervised setting and generate more informative responses than the strong baselines in the supervised setting. Note that models with " †" are implemented by ourselves and other models with citation are from the original paper.
SKT , short for Sequential Knowledge Transformer, uses the posterior distribution by sequential latent modeling and achieves promising results in the supervised setting.

Unsupervised Knowledge Selection TMN 0 is TMN, trained only via generation loss in Equation 7 without knowledge loss in Equation 5.
SKT 0 is SKT optimized without knowledge loss.
PostKS  takes the benefits of latent variable models and leverages the posterior knowledge distribution as a pseudo label for knowledge selection without knowledge loss. Here, we use the results provided by .

Our Unsupervised Methods
We implement our model in the unsupervised setting, namely Unsupervised Knowledge Selection for Dialogue Generation (UKSDG), which is optimized with our DDSL in Equation 11 for unsupervised knowledge selection and generation loss in Equation 7. UKSDG PF indicates that we adopt our Pretraining-Finetuning strategy to alleviate the mismatch knowledge selection problem in Section 2.7. Furthermore, we remove several components for ablation study: (1) UKSDG w/o DDSL is only optimized by the generation loss in Equation 7 without our DDSL.
(2) UKSDG vec w/o DDSL further replaces the decoder input H rc (in Section 2.5) with the averaged knowledge vector enhanced context

Implementation Details
We use TensorFlow 2.0 to implement our approach base on SKT 6 . All sentences are encoded by the shared BERT BASE (Devlin et al., 2019), and the response is greedily generated via a 5-layer Transformer decoder with copy mechanism. The hidden size d is 768 and the vocabulary size |V | is 30, 522. The knowledge selection module contains two separate one-layer GRUs and one projection layer. Our proposed DDSL contains no trainable parameters except one projection layer in the teacher selection module. And the temperature τ is 0.1.
We use the Adam optimizer (Kingma and Ba, 2014) with gradient clipping at 0.4 to train our models on a single GPU (TITAN Xp). The learning rate is 2e −5 and the batch 7 size is 1. Moreover, we apply label smoothing (Pereyra et al., 2017) and set 0.1 for knowledge selection and 0.05 for response generation. In the pretraining-finetuning strategy, we use 5 and 20 epochs in the pretraining and finetuning stage, respectively. The pretrained models are selected according to the accuracy score and other models are selected according to the R-1 score since knowledge selection aims to serve for high-quality generation.
It takes almost the same time for the convergence of UKSDG and KSDG as we only replace our DDSL with the gold knowledge selection loss. The convergence of SKT is a bit slower as it is hard 6 Thanks for their processed datasets, models and evaluation codes at https://github.com/bckim92/ sequential-knowledge-transformer.
7 Each example is a dialogue rather than an individual turn.

Setting Row Method
Single Reference Multi Reference

Evaluation
Automatic Evaluation. We automatically evaluate knowledge selection with accuracy (Acc), response generation with perplexity (PPL), unigram F1 (R-1) and bigram F1 (R-2), which are commonly used in this task . We also remove all the punctuation and (a, an, the) to compute the R-1 and R-2 scores as  do. Note that lower PPL and higher R-1 and R-2 scores indicate better generation quality.
Human Evaluation. We firstly select 100 samples from each test set on WoW for human evaluation. Then we follow Li et al., 2019a) and ask three annotators to evaluate the generation quality according to Engagingness and Knowledgeability from 1 to 4, where 1 means not at all, 2 is a little, 3 is somewhat, and 4 is a lot. Engagingness measures how much do you like the response and Knowledgeability measures the informativeness in the responses.

Main Results
Quantitative Results. We report the automatic results on WoW and Holl-E in Table 1 and Table 2, respectively. And we have the following consistent observations: 8 (1) Comparing SKT and SKT 0 , we see that the gold knowledge loss plays an important role to train knowledge selection well.
(2) Comparing row 0 and 2, we see that our proposed DDSL is the key to train knowledge selection well in the unsupervised setting and can be the alternative of the gold knowledge loss. As a result, our approach significantly outperforms the other unsupervised methods on all metrics (significance tests (Koehn, 2004), p-value < 0.01).
(3) Although UKSDG vec w/o DDSL could learn some patterns of knowledge selection from the gradient of generation loss, we see that compressing the knowledge into a vector will loss much information for dialogue generation. (4) Although our UKSDG PF usually makes knowledge selection worse than the supervised SKT in row 1 2, we achieve higher generation quality, which indicates that our pretraining-finetuning strategy could alleviate the mismatch knowledge selection problem and emphasizes the importance of leveraging the selected knowledge properly for future study on this task. (5) Comparing row 3 and 4, we see that the sample weighting also helps in the finetuning stage, though PPL score is slightly worse due to the difficulty of injecting the knowledge into responses. (6) Moreover, our approach demonstrates the stronger ability of generalization with smaller performance gap between Test Seen and Test Unseen in Table 1, which we attribute to our DDSL because it works in the unsupervised setting and is more suitable for Test Unseen. For example, compared with SKT in row 1 2, UKSDG PF in row  4 achieves the highest selection accuracy in Test Unseen with a slightly lower accuracy in Test Seen.
Qualitative Results. We report human evaluation results of the generated responses according to Engagingness and Knowledgeability in Table 3. Comparing UKSDG and UKSDG PF , we see the effectiveness of our pretraining-finetuning strategy which could alleviate the mismatch problem described in Section 2.7. Our UKSDG PF in the unsupervised setting generates responses slightly better than SKT in the supervised setting. Moreover, the improvement is much obvious according to Knowledgeability, which also indicates the importance of using the selected knowledge properly.

Ablation Study
We have introduced the components of our DDSL in Section 2.6 and shown its effectiveness in Section 4.1. Here, we analyse our DDSL via an ablation study in Table 4. Actually, we train UKSDG PF on knowledge selection task, using our DDSL with components removed. We have the following observations: (1) We see that most of the oracle knowledge from distant supervision is the same as the gold knowledge label by human. Therefore, it is acceptable to directly use this oracle knowledge label to substitute the gold label as the supervision signal (see the last row of Table 4).
(2) However, distant supervision inevitably suffers from the noisy labeling problem due to literally matching. For example, there are about 30% oracle knowledges different from the gold ones on WoW where the responses convey knowledge much more flexibly.
(3) Loss weighting in Equation 9 helps on WoW where the noisy labeling problem is serious. (4) Knowledge distillation in Equation 10 could further alleviate the label noise, and our method manages to select knowledge accurately in the unsupervised setting.  servations: (1) The oracle knowledge from distant supervision contains several informative tokens which could help dialogue generation since they appears in the gold response as defined by Equation 8. In particular, the oracle knowledge in the case 3 is also appropriate and the gold response contains much more information than the gold knowledge, which indicates the diversity of knowledge selection and selecting one sentence may be not enough for informative generation.

Case Study
(2) However, some oracle knowledge labels are different from the gold ones (i.e., the selected one by human). Our model still learns to select knowledge as human, which we attribute to our DDSL since knowledge distillation alleviates the noisy labeling problem of distant supervision.
(3) SKT does not leverage the selected knowledge well and generates the dull response. For example, SKT repeats the context in case 1, generates the verbose and contradictory response in case 2 and does not provide new information in case 3. (4) Whereas, our UKSDG PF firstly selects the appropriate knowledge as human does, and then generate fluent and informative responses by alleviate the mismatch knowledge selection problem with the help of the pretraining-finetuning strategy. This indicates the importance of leveraging the selected knowledge properly for future study.
To make use of knowledge, knowledge access is very important. Therefore, knowledge selection which selecting appropriate knowledge given the dialog context gains much attention Meng et al., 2019;. In this paper, we focus on knowledge selection in the unsupervised setting where there is no gold knowledge label.  and  attempt to deal with this problem via latent models, yet their performance is less than satisfactory. Differently, we design our DDSL to make knowledge selection work well in the unsupervised setting. There is a very recent work (Zhao et al., 2020b) that finetunes GPT-2 (Radford et al., 2019) with the unsupervised pretrained knowledge selection module on unlabeled dialogues. We are different in two aspects: (1) Our DDSL leverages knowledge distillation to alleviate the label noise at the pretraining stage; (2) We adopt the sample weighting idea at the finetuning stage. And we will leverage GPT-2 for future study.
Our work is inspired by Distant Supervision (DS), an effective method to generate labeled data with external knowledge base (KB) for information extraction (Mintz et al., 2009;Min et al., 2013;Zeng et al., 2015;Wang et al., 2018). Following this idea, Gopalakrishnan et al. (2019) use the oracle knowledge from DS to construct the Topical-Chat dataset. Similarly, Qin et al. (2019b) obtain the weakly labeled data to train a KB retriever in the task-oriented dialogue system.  propose a distantly supervised learning schema at segment level to effectively learn the topic transition vector. Although inspired by the similar idea, we are devoted to knowledge selection in the unsupervised setting, which is a different application of DS. Moreover, rather than just using distant supervision, we design our DDSL with label weighting and knowledge distillation to deal with the noisy labeling problem from DS.

Conclusion
We study unsupervised knowledge selection for dialogue generation where the gold knowledge label is not available. Actually, we design the Distilled Distant Supervision Loss, a novel and effective solution to train knowledge selection well in the unsupervised setting. Furthermore, we propose the pretraining-finetuning strategy to deal with the mismatch knowledge selection problem that models tend to select the mismatched knowledge for dialogue generation in the unsupervised setting and will cause the degeneration of knowledge-aware decoder. Experiments show that our approach manages to select knowledge more accurately in the unsupervised setting and generates more informative responses than many strong supervised baselines.
A.1.4 Supervised Knowledge Selection TMN is short for End-to-End Transformer Mem-Net , which selects knowledge based on the Transformer memory network and generate responses via the Transformer decoder. TMN BERT+PostKS+CP , implemented by , enhances the encoder with the pretrained BERT, knowledge selection module with PostKS  and decoder with the copy mechanism (CP). SKT , short for Sequential Knowledge Transformer, uses the posterior distribution by sequential latent modeling and achieves promising results in the supervised setting. PostKS  takes the benefits of latent variable models and leverages the posterior knowledge distribution as a pseudo label for knowledge selection without knowledge loss. Here, we use the results provided by .

A.1.6 Knowledge-Aware Generation
Different from ours on the knowledge selection setting, there are much work, named knowledgeaware generation , which regard the retrieved knowledge pool as the pseudo pre-identified document for dialogue generation with complex knowledge injection mechanism. Nevertheless, we provide the results to get a comprehensive understanding where this field is going. MTASK-RF (Ghazvininejad et al., 2018) is an early model that realizes knowledge-grounded conversation without crowd-sourced knowledgegrounded dialogues. Here we use the results provided by . ITDD is short for Incremental Transformer with Deliberation Decoder  where the encoder incrementally represents multi-turn dialogues and knowledge, and the decoder conducts response decoding in two passes. DRD is short for Disentangle Response Decoder (Zhao et al., 2020a), a model that exploits pre-training techniques to tackle the low-resource challenge in knowledge-grounded dialogue generation. We choose the one whose parameters are fine-tuned on the full training data. KIC  integrates recurrent Knowledge-Interaction and knowledge Copy (KIC) to generate informative responses. ZRKGC ) is a very recent and unpublished work, which learns their model under the zero-resource setting, where the dialogue corpus and the knowledge corpus that are independent with each other. Table 6 shows the automatic results in various settings on the Wizard of Wikipedia dataset, from which we have the following observations 9 : (1) The models with retrieved document generally achieves much lower PPL, still the R-1 is worse than our UKSDG PF . For the worse R-1 score, we think the retrieved document contains the redundant and irrelevant information which are harmful for dialogue generation (Zhao et al., 2020b). Meanwhile, we attribute the better PPL score to the complex knowledge injection mechanism. For example, ITDD leverages the deliberation decoder (Xia et al., 2017) to improve context coherence and knowledge correctness and DRD devises a disentangled response decoder with pretrained language model, context processor and knowledge processor.

A.2 Quantitative Results
(2) Comparing KSDG with KSDG PF , again we see the generation quality improvement via the pretrain-finetuning strategy. Hence, our method is general in both supervised and unsupervised setting. (3) Comparing UKSDG vec w/o DDSL with UKSDG w/o DDSL, we see that although selecting the knowledge vector softly allows the gradient from dialogue generation to update knowledge selection directly, compressing the knowledge into a vector will loss much information for dialogue generation. This observation, combined with observations (1) and (2), together indicates the importance of leveraging the selected knowledge well for future study.