CoLV: A Collaborative Latent Variable Model for Knowledge-Grounded Dialogue Generation

Knowledge-grounded dialogue generation has achieved promising performance with the engagement of external knowledge sources. Typical approaches towards this task usually perform relatively independent two sub-tasks, i.e., knowledge selection and knowledge-aware response generation. In this paper, in order to improve the diversity of both knowledge selection and knowledge-aware response generation, we propose a collaborative latent variable (CoLV) model to integrate these two aspects simultaneously in separate yet collaborative latent spaces, so as to capture the inherent correlation between knowledge selection and response generation. During generation, our proposed model firstly draws knowledge candidate from the latent space conditioned on the dialogue context, and then samples a response from another collaborative latent space conditioned on both the context and the selected knowledge. Experimental results on two widely-used knowledge-grounded dialogue datasets show that our model outperforms previous methods on both knowledge selection and response generation.


Introduction
Knowledge-grounded dialogue generation (Liu et al., 2018;Zhou et al., 2018a;Tian et al., 2020), which utilizes external knowledge to enhance conversation backgrounds, has achieved promising performance. To exploit external knowledge efficiently for conversations, typical approaches (Dinan et al., 2019;Meng et al., 2020;Chen et al., 2021) tend to decompose this task into two streamlined sub-tasks: knowledge selection and knowledgeaware response generation. Besides, some other work (Qin et al., 2019;Tian et al., 2020) also tries to * First two authors contribute equally. † Corresponding author.

Dialogue context
What is your favorite number? → I love the number 7.
What do you think about that? 1. Anyone who dares to kill Cain "will suffer vengeance seven times over". 2. Seven is the natural number following six and preceding eight. Knowledge candidates 3. Islam first came to the western coast when Arab traders as early as the 7th century CE. 4. The number 7 has been associated with a great deal of symbolism in religion. In western culture, it is often considered lucky. ...... N. This genre has been popular throughout the history of culture.

Response a
Yeah. I know that it is before 8 and after 6! Response b Yes, it is known as a lucky number in western countries! Response c I think 7 is lucky certain cultures. It also depicts some religious importance. Table 1: An example of knowledge-grounded conversations. Given the dialogue context, knowledge selection and response generation are inherently coupled. Besides, while knowledge selection is diverse, the knowledge-aware response generation could also be diverse based on the same knowledge content. Knowledge No.2 and No.4 are appropriate to the dialogue. Besides, given the same knowledge No.4, both Response b and c are appropriate.
integrate these two sub-tasks in a unified memoryaugmented training framework. In both paradigms, knowledge selection plays an important role in the knowledge-grounded dialogue systems.
Observing that the diversity of knowledge selection (given a dialogue context, several pieces of knowledge are appropriate) can be dramatically raised from prior and posterior distributions over knowledge, recent studies  utilize posterior mechanism to select knowledge during training phase. KL loss (Kullback and Leibler, 1951) is employed as one of the training objectives to minimize the gap between training and inference procedure, since posterior information is absent at inference.  enhances this framework with sequential latent variables and Chen et al. (2020) proposes a knowledge distillation training strategy to further bridge the gap between prior and posterior information.
While the success of variational knowledge selection is indisputable, there still exists some challenges that impede the conversational models from selecting appropriate knowledge. Firstly, knowledge selection is inherently coupled with knowledge-aware response generation. However, previous methods mostly emphasize the importance of knowledge selection without explicitly modeling the correspondence between the selected knowledge and the generated response. In Table 1, knowledge No.2 (in blue) corresponds to response a (in blue), while knowledge No.4 (in red) is related to response b and c (in red). Secondly, the diversity of knowledge selection is effectively improved with variational inference, while the diversity of knowledge-aware response generation (given the selected knowledge, several suitable responses can be generated) is still neglected. As shown in Table 1, response b and c are two different responses that share the same piece of knowledge, i.e., No.4.
In this paper, in order to simultaneously improve the diversity of both knowledge selection and knowledge-aware response generation, we propose a Collaborative Latent Variable (CoLV) model to integrate both aspects in separate yet collaborative latent spaces, so as to capture the inherent correlation between knowledge selection and response generation. During generation, our proposed model firstly draws knowledge candidate from the latent space conditioned on the dialogue context, and then samples a response from another collaborative latent space conditioned on both the context and the selected knowledge. Experimental results on two widely-used datasets of knowledge-grounded dialogue generation show that our model outperforms previous methods on both knowledge selection and response generation. Further analysis on collaborative latent variables demonstrates CoLV model's ability to not only improve the diversity of knowledge selection but also generate coherent and diverse responses.

Related Work
Our work is mainly related to two research branches: knowledge-grounded dialogue generation and variational auto-encoder learning. Knowledge-grounded Dialogue Generation has raised broad interest and also has been greatly advanced by many new datasets Dinan et al., 2019;Moghe et al., 2018;Zhou et al., 2018b). Existing methods on this task mainly focus on resolving two research problems: knowledge selection (KS) and knowledgeaware response generation. Dinan et al. (2019) proposed a memory network to retrieve knowledge and combined it with a Transformer-based model to generate response. External knowledge base was also utilized to facilitate the utterance understanding and knowledge selection . Lin et al. (2020) used memory network and copy mechanism to keep deep interaction between knowledge and utterances. Meng et al. (2020) employed a dual learning paradigm to enhance knowledge interaction. Su et al. (2020) proposed to augment the dialogue generation by utilizing external non-conversational text, which is effective but also introduce noise. Li et al. (2020) and Zhan et al. (2021) proposed to employ pretraining methods on the structured/unstructured knowledge representation and fine-tune the model using the limited knowledge-grounded training examples. Other work took efforts on utilizing future information. VAE Learning (Kingma and Welling, 2014) is widely used in a variety of natural language processing tasks, including machine translation , question answering , and conversations (Serban et al., 2017;Shen et al., 2019;Li et al., 2020;Shen et al., 2021). The core idea of variational auto-encoder is to utilize the advantage of posterior information or external information during training phase, and optimize the objectives by minimizing the KL divergence (Kullback and Leibler, 1951). Unlike previous work that applied VAEs on dialogue generation Serban et al., 2017;Qiu et al., 2019), we aim at using collaborative latent variables to connect the external knowledge, dialogue context, and response, which will further enhance the correlation between knowledge selection and response generation. To the best of our knowledge, our method takes the first attempt to collaboratively model these two different distribu-tions for knowledge-grounded conversations.

Task Formulation
Our goal is to simultaneously improve the diversity of knowledge selection and generate diverse knowledge-aware responses. Formally, given a dialogue context c which contains |c| tokens, c = {c 1 , ..., c |c| }, and its corresponding knowledge pool KP , which contains |k| knowledge candidate sentences, KP = {k 1 , ..., k |k| }. Each knowledge sentence k i ∈ KP contains M tokens, k i = {k 1 i , ..., k M i }. Our goal has two main steps: (1) selecting the most relevant knowledge sentence k from knowledge pool KP based on dialogue context. (2) Then, generating a response r = {r 1 , ..., r |r| } with |r| tokens, based on the dialogue context c and the selected knowledge k. We aim at tackling this task by learning the conditional collaborative latent distributions of the knowledge selection and response generation given the dialogue context, which can be formulated as follows: (k, r) ∼ p(k, r|c), We estimate the collaborative distribution p(k, r|c) by employing a collaborative latent variable model, named as CoLV model.

CoLV Framework
Our proposed CoLV model tends to model the conditional collaborative distribution p(k, r|c) in relatively separate ye collaborative latent spaces for knowledge and response, which is defined as follows: where z k and z r are latent variables for knowledge and response respectively, and the p φ (z r |z k , c) and p φ (z k |c) are their conditional prior distributions. Specifically, a Gaussian distribution (Kingma and Welling, 2014) and a categorical distribution (Jang et al., 2017), are employed for z r and z k respectively, as shown in Figure 1. Knowledge selection is a discriminative task, which is suitable to be modeled by a Categorical distribution. Besides, Gaussian distribution is continuous, which is appropriate to model the response latent variable. As shown in Figure 1, we devise a mutual interaction of these collaborative latent variables for knowledge and response separately.
Training Response Generation z k Figure 1: The graphical framework for CoLV model. c: dialogue context, k: knowledge, r: response. The dotted line denotes training procedure solely, while the solid line denotes both training and inference process.
To construct the collaborative latent variables, we enforce the response latent space to be dependent on the knowledge latent space in p φ (z r |z k , c), while the knowledge latent space is conditioned on the dialogue context c in p φ (z k |c). During the training phase, we use a variational posterior q ϕ (·) to maximize the Evidence Lower Bound (ELBO) as follows: where θ, φ and ϕ are the parameters of the generation, prior and posterior networks. The graphical framework for our proposed CoLV model is shown in Figure 1.
During training phase, our proposed CoLV model consists of two independent latent variables: z k and z r , which represent the latent variables of knowledge and response respectively. Meanwhile, the variational lower bound includes the reconstruction terms and KL divergence terms (Kullback and Leibler, 1951) based on these two latent variables, which will be optimized in a unified process.
In the generative process, latent variables obtained via prior networks and selected knowledge are fed to the decoder phase, which corresponds to red solid arrows in Figure 1. The generative process is as follows: Step 1: Sample knowledge latent variable: z k ∼ p φ (z k |c).  Step 2: Sample response latent variable: z r ∼ p φ (z r |z k , c).

Input Representation
We where e c and h c are the initial representation and hidden representation after BERT of dialogue context. WE(·), PE(·) and TE(·) refer to the wordlevel, position-level and turn-level embeddings respectively. Avepool(·) is the average pooling operation (Cer et al., 2018). Similarly, we also employ BERT base to encode the knowledge candidate sentences. The initial representation e kp and hidden representation h kp after BERT model and average pooling operation of knowledge candidate sentences are formulated as follows: Similarly, we can also get the posterior information of ground truth response representation h r and knowledge representation h k for training phase.

Collaborative Latent Variables
We will use two separate but content-dependent latent variables z r and z k to represent dialogue response and knowledge respectively. In the following section, we will discuss the prior network and posterior network separately.

Prior Network
We use two different conditional prior networks p φ (z k |c) and p φ (z r |z k , c) to model these two tasks. As we know, knowledge selection and response generation belongs to discriminative and generative task respectively. For better collaboratively modelling the relationship between knowledge selection and response generation, we utilize two different distribution models: the standard Categorical distribution Cat(π) for p φ (z k |c) and Gaussian distribution N (µ, σ 2 I) for p φ (z r |z k , c). Therefore, the z k and z r are sampled from: where the parameters σ and µ are estimated by: where MLP(·) denotes the multiple layer perception, and softplus(·) function is a smooth approximation to ReLU and can be used to ensure positiveness.

Posterior Network
During training phase, we utilize the posterior information to help enforce training. Similar to the prior network, we also use two different conditional posterior networks q ϕ (z k |c, k) and q ϕ (z r |z k , c, k, r) to approximate the true posterior distributions of latent variables for both knowledge k and response r. Therefore, the z k and z r in the posterior networks are sampled from: where the parameters σ and µ in the posterior networks are estimated by: In the training phase, we adopt the reparameterization trick (Kingma and Welling, 2014) to train our model with back-propagation since the stochastic sampling process of both knowledge selection and response generation is non-differential. Besides, we further employ gumbel-softmax (Maddison et al., 2017) for knowledge selection training procedure, since the latent variables z k is discrete.

Heuristic-based Knowledge Selection
While the efficiency of heuristic matching algorithm (Mou et al., 2016) has been demonstrated in many other tasks, such as question and answering. Following Lee et al. (2020), we also employ a heuristic-based knowledge selection module. Besides, different from previous work, which select out relevant knowledge instance from multiple knowledge sentences, our proposed heuristic-based knowledge selection module regards all candidate knowledge sentences as an integrated paragraph. Then, this module will predict the start and the end word position of an knowledge span. The knowledge span is regarded as the selected knowledge and will be incorporated by the following response generation process.
Specifically, given the representation of dialogue context h c and latent variables z k , the heuristicbased knowledge selection layer will consider to concatenate the adding and multiplying operation as an new integrated representation h cat , which is formulated as follows: where the new representation h cat will be used to predict the knowledge span in the following steps. Therefore, we will feed the integrated representation h cat into two separate linear layers (as shown in Figure 2) to predict the start and end position of knowledge span ks. the knowledge span ks will be extracted and sent into the generation phase.

Response Generation
In the decoding layer for response generation, we apply a stacked Transformer decoder module equipped with a copying mechanism (See et al., 2017) to generate response. The copy mechanism is used to copy specific knowledge from the selected knowledge span. We feed the dialogue context representation h c , the selected knowledge span representation h ks and latent variable z r into the decoder phase. Specifically, the probability of generating token y t at t-th step is modeled as: P (y t ) = λ 1 P vocab (y t |h c , z r ) + λ 2 P cp (y t |h ks ).
where P cp (y t |h ks ) derives the copying probability from the selected knowledge span ks. The copy mechanism is defined as follows: P vocab (y t |h c , z r ) is the output probability from a stack of Transformer decoder layers (Vaswani et al., 2017). λ 1 , and λ 2 are the coordination probability parameters.

Experimental Setup
Dataset. We conduct our experiments on two public knowledge-grounded dialogue datasets, Wizard of Wikipedia (Dinan et al., 2019) (WoW) and Holl-E (Moghe et al., 2018). In these two benchmarks, both of them contain multiple sessions of dialogues with corresponding knowledge candidate pool. For each dialogue utterance, there is a ground truth knowledge sentence. The statistical details on these two datasets are shown in Table 3. Baseline Models. We compare our CoLV model with several state-of-the-art models, including: • S2SA: The bidirectional LSTM-based encoderdecoder framework with attention mechanism. This baseline model only consider the dialogue context and do not utilize knowledge information. (Sutskever et al., 2014).
• Transformer: an encoder-decoder architecture relying solely on multi-head self-attention mechanisms (Vaswani et al., 2017). It does not consider the knowledge information either.   Transformer memory network for knowledge selection and a Transformer decoder for utterance prediction.
• PostKS: A LSTM-based model with the posterior knowledge selection mechanism , which uses the posterior knowledge distribution as a pseudo-label for knowledge selection.
• SLKS: A sequential latent knowledge selection model , which keeps track of prior and posterior distribution over knowledge in a sequential process.
• DukeNet: A dual knowledge interaction network (Meng et al., 2020), modeling the knowledge shift and tracking processes with a dual learning paradigm.
• PIPM: SLKS model with posterior information prediction module and knowledge distillation training strategy . It aims to bridge the gap between prior and posterior distributions.
Evaluation Metrics. We report accuracy (Acc) to evaluate the knowledge selection 1 . Besides, 1 Note that lower perplexity (PPL) indicates better performance. For the evaluation on knowledge selection, only knowledge span with both correct start and end position will be counted in the accuracy. Partially correct sample will not  we use the traditional indicators, i.e., perplexity (PPL), BLEU-4 (Papineni et al., 2002), ROUGE-1, ROUGE-2 (Lin, 2004) and Distinct-2  to evaluate the quality of response generation. We also conduct human evaluation for our model. We randomly sampled 300 generated response and then we invite six annotators to select out their preferred response (win), or vote a tie, considering the following aspects: diversity, coherence and knowledge engagement. Each comparison is conducted between two responses generated by our CoLV and a baseline models respectively. Implementation Details. We implement our proposed model with pytorch (Paszke et al., 2019). For fair comparison, we keep the same defualt settings during dataset pre-processing and the model parameter settings as the same as in . We employ a pre-trained BERT base model to encoder dialogue context and knowledge sentences. The initial word embedding size is set to 300, and we keep the sentence length of dialogue context and knowledge to 64 and 512 respectively. The hidden size is 768 and vocabulary size is set to 30,522. The batch size is set to 64. Models are trained with be counted in the accuracy calculation, but we will analysis the KS performance in Section 4.5   30 epoch to get the best performance. For training details, we use Adam (Kingma and Ba, 2015) for gradient optimization in our experiments, and the correspondign parameters β 1 and β 2 are set to 0.9 and 0.998. The learning rate is set to 0.001. We use gradient clipping with a maximum gradient norm of 0.4. We run all models on the Tesla P40 GPU and select the best models based on performance on the validation set.

Experimental Results
Automatic Evaluation Results. The quantitative evaluation results on WoW and Holl-E datasets are shown in Table 2 and Table 4 respectively. Generally, CoLV outperforms baselines on most metrics in these two datasets. In terms of the knowledge selection accuracy, CoLV outperforms three strong baseline SKLS, DukeNet and PIPM on WoW Test Seen dataset by 11.0%, 16.2% and 8.7%, which is significant. Even though the accuracy of CoLV on WoW Test Unseen is a little lower than PIPM, it still outperforms other baselines. The reason why CoLV can improve knowledge selection performance is that CoLV takes two collaborative la- tent variables simultaneously, which resolving the gap between knowledge and response. Besides, in terms of the generation performance, CoLV also has a significant improvement over baseline models. It helps verify the consistency of improvement on both knowledge selection and response generation. Human Evaluation Results. The human-based evaluation results are shown in Table 6. For each case, given a post-knowledge pair, two generated responses are provided, one is from our model and the other is from the compared model. Not surprisingly, CoLV consistently outperforms all the compared models. Meanwhile, we notice that CoLV exhibits significant improvements comparing with vanilla S2SA and Transformer. Besides, CoLV substantially reaches better performances than strong baselines, e.g., SKLS and PIPM. We analyze the bad cases and find that some baselines still suffer from the general or knowledge-irrelevant responses.
Augmented with the collaborative latent variables, CoLV introduces a competitive boost in response quality, which is in line with the automatic evaluation, confirming the superior performance of our proposed model. We also employ Fleiss' kappa scores (Fleiss, 1971) to measure the reliability between different annotators, and results show that annotators reach a moderate agreement.

Ablation Study
To examine the effectiveness of proposed CoLV model we conduct model ablations by removing particular modules from CoLV, including knowl-   Table 5 and Table 7 . We observe that without either knowledge latent and heuristic matching, the performance of knowledge selection drops largely with respect to accuracy metric. The result verifies the effectiveness of integrating these two modules into knowledge selection process. Besides, the values of generative metrics, e.g., PPL, BLEU-4, ROUGE-1/2 and Dist-2, also drop significantly if we remove the response latent variables. It affirms that the collaborative latent variables are helpful to refine the coherence, knowledge engagement and diversity of generated responses. While we remove all these three modules, we can witness a similar performance of our model with the baseline model MemNet, a vanilla Transformer model with knowledge memory network.

Case Study
To facilitate a better understanding of baselines and our model, we present some examples in Table 8. To better evaluate the performance of response generation, we choose a case from WoW Test Seen that both three baseline models SKLS, DukeNet, PIPM and our model select the same knowledge (marked as yellow in Table 8 from knowledge pool. We observe that even though both baselines and our model can select out the true knowledge sentence, our model still achieve better performances in response generation. For example, SKLS generates a counterfactual response that is not consistent with original knowledge. In orignial knowledge, "Ireland is the third-large island in Europe". However, SKLS generates "Ireland is the largest". Besides, to show the effectiveness of our model in generating diverse responses, we present several different responses that all generated by our model.  As shown in Table 8, our model is able to engage different parts in the knowledge sentence and then generate diverse and coherent response. The reason why our model has the ability to generate different and semantic coherent response is that the collaborative latent variables in CoLV consider these two distributions collaboratively.

Analysis of Heuristic-based KS
We conduct a further experiments to analyze if our fine-grained knowledge selection perform better than traditional sentence-level match methods. The results of knowledge selection on three datasets are shown in Figure 3. Different from previous sentence-level knowledge selection. Our method firstly treats all knowledge sentences as an integrated paragraph and select knowledge span from Table 9: Qualitative analysis of collaborative latent variables. Knowledge-response pairs generated by our model. "GT pair" denotes the ground truth knowledgeresponse pair in the dataset. "Pair-1", "Pair-2" and "Pair-3" are generated from our model. this paragraph. Only the start and end position of knowledge span are totally matched (100% matching) with original knowledge sentence, it will be counted by accuracy metric. However, we observe that in our test set, many bad cases also select out partial correct knowledge content. Therefore, we conduct a further statistics on accuracy of different percentage of correct knowledge span, as shown in Figure 3. Take the WoW Test Seen dataset as example ((a) in Figure 3), our model can reach around 0.35 accuracy on the 80% and 90% percentage of correct knowledge span, which is significantly higher than baseline models. Considering that conversational model usually do not engage all the knowledge context into response generation, we claim that the 80% and 90% percentage of correct knowledge span are acceptable in real application scenarios. Therefore, CoLV is more practical and flexible than the existing methods.

Effects of Collaborative Latent Variables
We conduct a further qualitative analysis on the collaborative latent variables. Firstly, we utilize the knowledge variable in multiple times to get different knowledge. Then, for each selected knowledge, we employ the decoder phase to generate corresponding responses. As shown in Table 9, our CoLV is able to select different knowledge context and then generate corresponding responses. We can notice that all responses in Pair-1 to Pair-3 are coherent and fluency to the dialogue context.
Besides, knowledge information is appropriately engaged into the response. Therefore, the two latent variables in our CoLV model are effective to help select diverse knowledge and then generate coherent response.

Conclusion
In this paper, we propose a novel collaborative latent variable (CoLV) model to simultaneously learning to select knowledge and generate responses in knowledge-grounded dialogue generation. CoLV model helps improve the diversity not only on knowledge selection but also help generate diverse response while given a specific knowledge.
Besides, the CoLV model uses two collaborative latent variables for coupling the knowledge and dialogue. Extensive experiments on two benchmark datasets show that CoLV achieves satisfied performance, indicating that CoLV can select more diverse knowledge and further generate more coherent and diverse responses than baseline models.