PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism

We investigate response generation for multi-turn dialogue in generative chatbots. Existing generative modelsbased on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the history, which makesmodels unable to capture the subtle variability observed in different dialogues and cannot distinguish the differencesbetween dialogues that are similar in composition. In this paper, we propose Pseudo-Variational Gated Recurrent Unit (PVGRU). The key novelty of PVGRU is a recurrent summarizing variable thataggregates the accumulated distribution variations of subsequences. We train PVGRU without relying on posterior knowledge, thus avoiding the training-inference inconsistency problem. PVGRU can perceive subtle semantic variability through summarizing variables that are optimized by two objectives we employ for training: distribution consistency and reconstruction. In addition, we build a Pseudo-Variational Hierarchical Dialogue(PVHD) model based on PVGRU. Experimental results demonstrate that PVGRU can broadly improve the diversity andrelevance of responses on two benchmark datasets.


Introduction
Complex grammatical rules exist in the high variability text data (Gormley and Tong, 2015;Chung et al., 2015;Nie et al., 2022), especially dialogue. As shown in Figure 1, two utterances with just one/two words different may have different semantics, such as utterance u 6 of dialogue a vs. u 5 of dialogue b; On the other hand, two dialogues with lots of semantically similar utterances may express quite different context meanings, such as a vs. b. These variabilities lead to multiple mappings between dialogue context and response, which occurs in response causing one-to-many issue and in context resulting in many-to-one problem. We can observe that the distribution of dialogue contexts (i.e., N a and N b ) is composed of the distribution of utterances, and the distribution of each utterance consists of the distribution of words (i.e., Figure 1 (c)). How to model the word-level and utterancelevel variation in dialogue plays an important role in improving the quality of responses.
A line of existing researches (Henderson et al., 2014;Shang et al., 2015;Luo et al., 2018) employ RNNs (Recurrent Neural Networks) to model dialogue context. However, researchers perceive that it is inappropriate to employ RNNs to directly model this kind of variability observed in dialogue corpora (Chung et al., 2015). This is because the internal transition structure of the RNNs is entirely deterministic, which can not effectively model the randomness or variability in dialogue contexts (Chung et al., 2015).
Variational mechanism has demonstrated attractive merits in modeling variability from both theoretical and practical perspectives (Kingma and Welling, 2014). Methods based on variational mechanism Gu et al., 2019;Khan et al., 2020;Sun et al., 2021) introduce latent variable into RNNs to model one-to-many and many-to-one phenomena in dialogue. Although these approaches achieve promising results, they still have defects. First, these methods face the dilemma that latent variables may vanish because of the posterior collapse issue (Zhao et al., 2017(Zhao et al., , 2018Shi et al., 2020). Variational mechanism can work only when latent variables with intractable posterior distributions exist (Kingma and Welling, 2014). Second, the sampled latent variables may not correctly reflect the semantics of the dialogue context due to the one-to-many and many-to-one phenomena observed in dialogue (Sun et al., 2021). Third, the posterior knowledge is employed in training while prior knowledge used in inference, which causes an inconsistency problem between training and inference (Shang et al., 2015;Zhao et al., 2017;Shi et al., 2020).
To tackle these problems, we propose a Pseudo-Variational Gated Recurrent Unit (PVGRU) com-arXiv:2212.09086v3 [cs.CL] 12 May 2023 VS a man is standing in the laundry room while speaking on the phone the man is laughing while holding the phone the man bends down and puts black shoes on the table the man talks to someone on the phone and then puts shoes on… how many people are in the video? there is just one man in the room 1 : 2 : 3 : 4 : 5 : 6 : two people is standing in a laundry room … one is speaking on the phone while the other is putting items … then a man is holding a cup and laughing while another man … how many people are in the video? there are two man in the room 1 : 2 : 3 : 4 : 5 : Figure 1: Overview of dialogue variability. (a) and (b) represent two dialogues a and b from DSTC7-AVSD. N a and N b represent the distribution composition of dialogue a and b on utterance level, respectively. N a t represents the distribution of dialogue a at time step t on utterance level. (c) stands for the distribution composition of an utterance. u a i and u b i represent the i-th utterance of dialogue a and b, respectively. w i j stands for the j-th word of the i-th utterance. a 3 denotes the distribution variations caused by u 3 to N a and i 3 denotes the variations caused by token w i 3 to the distribution of u i . The utterances marked in brown in dialogue a indicate that there is a similar expression in dialogue b. ponent based on pseudo-variational mechanism. PVGRU is based on GRU by introducing a recurrent summarizing variable, which can aggregate the accumulated distribution variations of subsequences. The methods based on PVGRU can perceive the subtle semantic differences between different sequences. First, the pseudo-variational mechanism adopts the idea of latent variables but does not adopts posterior mechanism (Serban et al., 2017;Zhao et al., 2017;Park et al., 2018;Sun et al., 2021). Therefore, PVGRU does not suffer from the posterior collapse issue (Zhao et al., 2017(Zhao et al., , 2018Shi et al., 2020). Second, we design the consistency and reconstruction objectives to optimize the recurrent summarizing variable in PVGRU, which ensure that the recurrent variable can reflect the semantic of dialogue context from word-and utterance-level, respectively. The consistency objective makes the distribution of the incremental information consistent with the corresponding input at each time step. For instance in Figure 1, the distribution of u a 3 is consistent with a 3 = N a 3 − N a 2 and the distribution of w i 3 is consistent with i 3 . The reconstruction objective demands the summarizing variable to be able to reconstruct the sequence information. For example, we can reconstruct the subsequence information before time step 3 from distribution N a 3 . Third, we guarantees the consistency between training and inference since we do not employ posterior knowledge when optimizing summarizing variable.
In addition, we build a Pseudo-Variational Hierarchical Dialogue model (PVHD) based on PV-GRU to model the word-level and utterance-level variation. To summarize, we make the following contributions: • We analyze the reasons for one-to-many and many-to-one issues from high variability of dialogue corpus and propose PVGRU with recurrent summarizing variable to model the variability of dialogue sequences. • We propose to optimize recurrent summarizing variable using consistency and reconstruction objective, which guarantees that the summarizing variable can reflect the semantics of the dialogue context and maintain the consistency between training and inference processes. • We propose PVHD model based on PVGRU, which significantly outperforms strong baselines with RNN and Transformer architectures on two benchmark datasets. The code including baselines for comparison is avaliable on Github 1 .

Dialogue Generation
As an important task in Natural Language Processing, dialogue generation systems aim to generate fluent and informative responses based on the dialogue context (Ke et al., 2018). Early dialogue generation models (Henderson et al., 2014;Shang et al., 2015;Luo et al., 2018) usually adopt the simple seq2seq (Sutskever et al., 2014) framework to model the relationship between dialogue context and response in the manner of machine translation.
However, the vanilla seq2seq structure tends to generate dull and generic responses. To generate informative responses, hierarchical structures Song et al., 2021;Liu et al., 2022) and pre-training techniques (Radford et al., 2019;Lewis et al., 2020;Zhang et al., 2020) are employed to capture the hierarchical dependencies of dialogue context. The results of these methods do not meet expectations (Wei et al., 2019).
The main reason is that there are one-to-many and many-to-one relationships between dialogue context and responses. Modeling the multimapping relationship is crucial for improving the quality of the dialog generation. In this paper, we propose a PVGRU component by introducing recurrent summarizing variables into GRU, which can model the varieties of dialogue context.

Variational Mechanism
Variational mechanisms enable efficient working in directed probabilistic models when latent variables with intractable posterior distributions existing (Kingma and Welling, 2014). Variational mechanisms can learn the latent relationship between dialogue context and responses by introducing latent variables. Most existing methods (Serban et al., 2017;Zhao et al., 2017;Bao et al., 2020) based on variational mechanisms employ prior to approximate true posterior probability. These methods not only encounter the problem of posterior collapse issue but also the problem of inconsistency between training and inference (Zhao et al., 2018;Shi et al., 2020). In this paper, we employ consistency and reconstruction objectives to optimize the summarizing variable different from variational mechanism, which can model the multi-mapping phenomena in dialogues.

Preliminary
In this paper, we employ GRU (Gated Recurrent Unit) (Cho et al., 2014) as the implementation of recurrent neural network (RNN). The reset gate r t is computed by: where σ is the logistic sigmoid function. x t represents the input at time step t and h t−1 denotes the hidden state at time step t-1. W r and U r are parameter matrices which are learned. Similarly, the updated gate z t is defined as: The hidden state h t at the time step t is then computed by: where φ(·) is the tanh function, W and U are weight matrices which are learned. GRU is considered as a classic implementation of RNN, which is widely employed in generative tasks.

Pseudo-variational Gated Recurrent Unit
As shown in Figure 1, it is difficult to distinguish the semantics of similar dialogue contexts only relying on the last hidden state representations. The internal transition structure of RNNs is deterministic, which can not model variability observed in dialogues and tends to generate dull and generic responses. Drawing the inspiration from variational recurrent neural network (VRNN) (Chung et al., 2015), our proposed PVGRU explicitly models the variability through introducing a recurrent summarizing variable, which can capture the variations of dialogue context. VRNN based on variational mechanism employs latent variables paying attention to the variety between different words. Different from VRNN, PVGRU maintains a summarizing variable unit that can summarize the accumulated variations of the sequence. As shown in Figure 2 (a), PVGRU introduces a recurrent summarizing variable v based on GRU. The recurrent summarizing variable v is obtained based on the incremental information of hidden state h and the previous state of summarizing variable. Specially, the summarizing variable v 0 is initialized with standard Gaussian distribution (i.e., Figure 3 (a)). We assume the input is x t at the time step t, the reset gate r t is rewrited as: where W r , U r and V r are parameter matrices, and v t−1 is the previous summarizing variable state. Similarly, the update gate z t is computed by: We introduce a gate g t for summarizing variable factor, which is defined as follows: Figure 2: Overview of PVHD based on PVGRU. (a) is the overview of PVGRU, where RE stands for refactoring process and the "sam" represents sampling process. (b) is graphical representation of the PVHD. The updated gate of summarizing factor controls how much information from the previous variable will carry over to the current summarizing variable state. Under the effect of g t , theh t follows the equation: Then the PVGRU updates its hidden state h t using the same recurrence equation as GRU. The summarizing variable v t at the time step t is defined as: where ϕ(·) represents a nonlinear neural network approximator andṽ t denotes the variations between time t and time t − 1. The variations across subsequent up to time t is defined as: Figure 3 (b) demonstrates the schematic diagram of the recurrent process of PVGRU described above. We can observe that PVGRU does not adopt posterior knowledge, which can guarantee the consistency between training and inference.

Optimization Summarizing Variable
Based on but different from traditional variational mechanism, we design the consistency and reconstruction objectives to optimize the summarizing variable. The consistency objective ensures that the distribution of the information increment of hidden state at each time step is consistent with the input. For example, we will keep the distribution of information increment h t − h t−1 at time t consistent with x t . The consistency objective function at time step t is denoted as: where KL(·) represents Kullback-Leibler divergence (Barz et al., 2018) and p(·) represents the distribution of the vector. We employ "sam" to represent this process of distribution sampling in Figure 2 (a). The reconstruction optimization objective ensures that the summarizing variable can correctly reflect the semantic of the dialogue context from the whole perspective, which requires PVGRU reconstructs the sequence information from the accumulated distribution variable. The reconstruction loss at time step t is described as: where f (·) stands for decoder using MLP, δ is a hyperparameter and | · | represents the absolute value. We employ "RE" to represent the reconstruction process in Figure 2 (a). Figure 3 (c) demonstrates the schematic diagram of optimizing summarizing variable. Reconstruction and consistency objectives ensure that summarizing variable can correctly reflect the semantics of the dialogue context.

Hierarchical Pseudo-variational Model
As shown in Figure 1, the dialogues contain word-level and sentence-level variability. We follow previous studies (Serban et al., , 2017Figure 4: Kullback-Leibler loss variation trend graph on DailyDialog (up) and DSTC7-AVSD (down). The abscissa represents the number of training iterations. KL represents the Kullback-Leibler loss term. Huang et al., 2021) using hierarchical structure to model dialogue context. Figure 2 (b) shows the structure of PVHD we proposed. PVHD mainly consists of three modules: (i) Encoder PVGRU; (ii) Context PVGRU; (iii) Decoder PV-GRU. The encoder PVGRU is responsible for capturing the word-level variabilities and mapping utterances{u 1 , u 2 , ..., u m } to utterance vectors {h u 1 , h u 2 , ..., h u m }. At the same time, v t records the accumulated distribution variations of the subsequence at time step t. The context PVGRU takes charge of capturing the utterance-level variabilities. The last hidden state of the context PVGRU represents a summary of the dialogue. The last summarizing variable state of the context PVGRU stands for the distribution of dialogue. The decoder PVGRU takes the last states of context PVGRU and produces a probability distribution over the tokens in the response {y 1 , y 2 , ..., y n }. The generation process of training and inference can be formally described as: The log-likelihood loss of predicting reponse is formalized as: The total loss can be written as:

Experiments
For descriptions of the datasets, please refer to the Appendix A.1. Please refer to Appendix A.2 for implementation details. In Appendix A.5 we show the ablation results of two objective functions, showing the effectiveness of the objective functions. In order to evaluate the effectiveness of experimental results, we performed a significance test in Appendix A.6. We can observe that the pvalues of PVHD are less than 0.05 compared with other models. In addition, we present case studies in Appendix A.7 and discuss model limitations in Appendix 7, respectively.

Baselines
The automatic evaluation metrics is employed to verify the generality of PVGRU, we select the following RNN-based dialogue generation models as baselines: seq2seq: sequence-to-sequence model GRU-based with attention mechanisms (Bahdanau et al., 2015). HRED: hierarchical recurrent encoder-decoder on recurrent neural network  for dialogue generation. HRAN: hierarchical recurrent neural network dialogue generation model based on attentiom mechanism (Xing et al., 2018). CSG: hierarchical recurrent neural network model using static attention for contextsensitive generation of dialogue responses (Zhang et al., 2018).
To evaluate the performance of the PVHD, we choose dialogue generation model based on variational mechanism as baselines: HVRNN: VRNN (Variational Recurrent Neural Network) (Chung et al., 2015) is a recurrent version of the VAE. We combine VRNN (Chung et al., 2015) and HRED  to construct the HVRNN. CVAE: hierarchical dialogue generation model based on conditional variational autoencoders (Zhao et al., 2017). We implement CVAE with bag-of-word loss and KL annealing technique. VAD: hierarchical dialogue generation model introducing a series of latent variables (Du et al., 2018). VHCR: hierarchical dialogue generation model using global and local latent variables (Park et al., 2018). SepaCVAE: self-separated conditional variational autoencoder introducing group information to regularize the latent variables (Sun et al., 2021). SVT: sequential variational transformer augmenting deocder with a sequence of fine-grained latent variables (Lin et al., 2020). GVT: global variational transformer modeling the discourselevel diversity with a global latent variable (Lin et al., 2020). PLATO: dialogue generation based on transformer with discrete latent variable (Bao et al., 2020). Different from original implementation, we do not use knowledge on the DSTC7-AVSD. DialogVED: a pre-trained latent variable encoder-decoder model for dialog response generation (Chen et al., 2022). We initialize the model  with the large version of DialogVED.

Automatic & Human Evaluation
Please refer to Appendix A.3 and Appendix A.4 for details of automatic evaluation metrics. Some differences from previous works are emphasized here. We employ improved versions of BLEU and ROUGE-L, which can better correlate n-gram overlap with human judgment by weighting the relevant n-gram compared with original BLEU (Chen and Cherry, 2014). Although using the improved versions of BLEU and ROUGE-L will result in lower literal values on the corresponding metrics, this does not affect the fairness of the comparison. We adopt the implementation of distinct-1/2 metrics following previous study (Bahuleyan et al., 2018). The source code for the evaluation method can be found on the anonymous GitHub. Table 1 reports the automatic evaluation performance comparison of the models using GRU and PVGRU. We can observe that the performance of the models based on PVGRU is higher than that based on GRU. Specifically, on DailyDialog dataset, the average performance of models based on PVGRU is 0.63% to 16.35% higher on PPL, 1.40% to 1.92% higher on BLEU-1, 1.08% to 2.02% higher on Rouge-L, 1.10% to 2.33% higher on Dist-1 and 1.36% to 1.62% higher on average embedding compared with models based on GRU. On DSTC7-AVSD dataset, the performance of models based on PVGRU is 0.45% to 5.47% higher on PPL, 1.14% to 2.57% higher on BLEU-1, 1.38% to 2.7% higher on Rouge-L, 0.69% to 2.06% higher on Dist-1 and 0.69% to 2.69% higher on average embedding compared with models based on GRU. The results demonstrate that PVGRU can be widely used to sequence generation models based on RNN. The internal transition structure of GRU is entirely deterministic. Compared with GRU, PV-GRU introduces a recurrent summarizing variable, which records the accumulated distribution variations of sequences. The recurrent summarizing variable brings randomness to the internal transition structure of PVGRU, which makes model per-   ceive the subtle semantic variability. Table 2 reports the results of automatic evaluation of PVHD and other baselines on DailyDialog and DSTC7-AVSD datasets. Compared to RNNbased baselines based on variational mechanism, PVHD enjoys an advantage in performance. On DailyDialog datasets, the performance of PVHD is 1.16% higher on BLEU-1, 0.45% higher on Rouge-L, 1.01% higher on Dist-1 and 2.22% higher on average embedding compared to HVRNN. As compared to the classic variational mechanism models CVAE, VAD and VHCR, PVHD has a advantage of 0.02% to 22.75% on PPL, 1.87% to 6.88% higher on BLEU-1, 1.48% to 3.25% higher on Dist-1, 0.43% to 13.37% higher on Dist-2 and 0.80% to 2.76% higher on average embedding. We can observe similar results on DSTC7-AVSD. PVHD enjoys the advantage of 1.3% to 18.22% on PPL, 3.00% to 3.40% higher on BLEU-1, 0.54% to 1.19% higher on Dist-1, 1.31% to 5.76% higher on Dist-2 and 0.11% to 2.22% higher on average embedding compared with these classic variational mechanism models. The main reason for the unimpressive performance of RNN-based baselines is that these models suffer from latent variables vanishing observed in experiments. As shown in Figure 4, the Kullback-Leibler term of these models losses close to zero means that variational posterior distribution closely matches the prior for a subset of latent variables, indicating that failure of the variational mechanism (Lucas et al., 2019). The performance of SepaCVAE is unimpressive. In fact, the performance of SepaCVAE depends on the quality of context grouping (referring to dialogue augmentation in original paper (Sun et al., 2021)). Sepa-CVAE will degenerate to CVAE model if context grouping fails to work well, and even which will introduce wrong grouping noise information resulting in degrade performance. As shown in Figure 4, the Kullback-Leibler term of SepaCVAE losses is at a high level, which demonstrates that the prior for a subset of latent variables cannot approximate variational posterior distribution.

Automatic Evaluation Results & Analysis
Compared with Transformer-based baselines, PVHD still enjoys an advantage on most metrics, especially the distinct metric. GVT introduces latent variables between the whole dialogue history and response, which faces the problem of latent variables vanishing. SVT introduces a sequence of latent variables into the decoder to model the diversity of responses. But it is debatable whether latent variables will destroy the fragile sequence perception ability of the transformer, which will greatly reduce the quality of the responses. Training the transformer from scratch instead of using a pretrained model is another reason for the inferior performance of SVT and GVT. Compared to Di-alogVED and PLATO, PVHD achieves the best performance on most metrics. The main reason is that pseudo-variational approaches do not depend on posteriors distribution avoiding optimization problems and the recurrent summarizing variable can model the diversity of sequences. Overall, PVHD has the most obvious advantages in diversity, which demonstrates the effectiveness of the recurrent summarizing variable.
Although transformers are popular for generation task, our research is still meritorious. First, transformer models usually require pre-training on large-scale corpus while RNN-based models usually do not have such limitations. It is debatable whether transformer models training from scratch under conditions where pre-training language models are unavaliable can achieve the desired performance if downstream task does not have enough corpus. Second, the parameter amount of the RNNbased model is usually smaller than that of the transformer-based model. The parameter sizes of PVHD on the DailyDialog and DSTC7-AVSD are 29M and 21M, respectively. The number of parameters for PLATO and DialogVED is 132M and 1143M on two datasets, respectively. Compared to PLATO and DialogVED, the average number of parameters of PVHD is 5.28x and 45.72x smaller, respectively.

Human Evaluation Results & Analysis
We conduct human evaluation to further confirm the effectiveness of the PVHD. To evaluate the consistency of the results assessed by annotators, we employ Pearson's correlation coefficient (Sedgwick, 2012). This coefficient is 0.35 on diversity, 0.65 on relevance, and 0.75 on fluency, with p< 0.0001 and below 0.001, which demonstrates high correlation and agreement. The results of the human evaluation are shown in Table 3. Compared to RNN-based baselines, PVHD has a significant advantage in relevance and diversity. Specifically, PVHD enjoys the advantage of 11.40% on diversity and 16.00% on relevance compared to SepaCVAE on DailyDialog. On DSTC7-AVSD, PVHD has a advantage of 10.50% on diversity and 73.00% on relevance compared to SepaCVAE. Compared to transformer-based baselines, although PVHD is sub-optimal in some metrics, it enjoys the advantage in most metrics, especially diversity. In terms of fluency, PVHD is only 1.00% lower than HVRNN and is much better that other baselines on DailyDialog. However, the fluency of PVHD is 26.50% lower compared with HVRNN and 8.00% lower compared with VHCR on DSTC7-AVSD. We argue that introducing a recurrent summary variable in the decoder increases the randomness of word generation, which will promote the diversity of the responses with a side effect of fluency reduction.

Effectiveness of Summarizing Variables
We further analyze the effectiveness of PVHD on summarizing variables. Figure 5 demonstrates the visualization of word-level and utterance-level summarizing variables on on test set of DailyDialog and DSTC7-AVSD datasets. We can observe that both datasets exhibit high variability characteristic on word-level and utterance-level. Specifically, the summarizing variables on word-level show obvious categorical features, which indicates that a subsequence may have multiple suitable candidate words. Moreover, the summarizing variables on utterancelevel also exhibit impressive categorical features, which confirms that there is a one-to-many issue in the dialogue. These phenomena make dialogue generation different from machine translation where unique semantic mapping exists between source and target language.

Conclusion
We analyze the reasons for one-to-many and manyto-one issues from high variability of dialogue. We build PVHD based on proposed PVGRU component to model the word-level and utterance-level variation in dialogue for generating relevant and diverse responses. The results demonstrate that PVHD even outperforms pre-trained language models on diversity metrics.
Although our work can effectively model the variability issue in dialogue, we acknowledge some limitations of our study. Firstly, our study can work well on the approaches based on RNN, but cannot be employed to sequence models based on Transformer, which limits the generality of our approach. Secondly, although our methods can improve the diversity and relevence of responses, there are still gaps in fluency compared with other baselines.

A.2 Implementation Details
We implement our model and baselines using Tensorflow 2 and train baselines on a server with RTX 8000 GPU (48G). The dimension of word embeddings is set 512. We consider at most 10 turns of dialogue context and 50 words for each utterance. The encoder adopts bidirectional structure and the decoder uses unidirectional structure. The hidden size of bidirectional encoder and bidirectional encoder is 1024 for VHCR, and other models are set 512. The size of latent variables for HVRNN, CVAE, VHCR, VAD, and SepaCVAE is 512. The size of summarizing variables for PVHD is 512. We set the number of encoder layers to 2 and the decoder layers to 1 for HVRNN, CVAE, VHCR, VAD, SepaCVAE and PVHD. The number of encoders and decoders are 4 for SVT and GVT. The head number of attention for SVT and GVT is 4. The batch size of VHCR is 32, and other models are 128. The init learning rate of HVRNN, CVAE, VAD, SepaCVAE, SVT, GVT and PVHD is set to 0.001. The learning rate of VHCR is set to 5e-4 and set to 3e-4 for DialogVED. We set the dropout rate of DialogVED to 0.1 and other baselines do not employ dropout trick. Adam (Kingma and Ba, 2015) is utilized for optimization. The adam parameters beta1 and beta2 are set to 0.9 and 0.999, respectively. The maximum epoch is set to 100. Beam search is used to generate responses for evaluation. The beam size is set 5. The values of hyperparameters described above are all fixed using the validation set.

A.3 Automatic Evaluation Metrics
We employ both automatic and human evaluations to assess the performance of compared methods. The automatic evaluation mainly includes the following metrics: BLEU (Yang et al., 2018) evaluates the n-gram co-occurrence between generated response and target response. ROUGE-L (Yang et al., 2018) evaluates the overlap of the longest common subsequences between generated response and the target response. Distinct-1/2 (Li et al., 2016) measures the generated response diversity, which is defined as the number of distinct uni-grams / bi-grams divided by the total amount of generated words. PPL (Perplexity) evaluates the confidence of the generated response. The lower PPL score, the higher confidence for generating responses. Embedding-based metrics (Average, Exterma and Greedy) measure the semantic relevance between generated response and target response (Liu et al., 2016;Sedoc et al., 2019;Xu et al., 2018b).

A.4 Human Evaluation
Following the work of (Sun et al., 2021;Li et al., 2017a;Xu et al., 2018a), we divide six crowdsourced graduate students into two groups to evaluate the quality of generated responses for 100 randomly sampled input contexts, respectively. We request annotators to rank the generated responses with respect to three aspects: fluency, diversity, and relevance. Fluency measures whether the generated responses are smooth or grammatically cor-rect. Diversity evaluates whether the generated responses are informative, rather than generic and repeated information. Relevance evaluates whether the generated responses are relevant to the dialogue context. The average scores of the two groups is taken as the final score.

A.5 Ablation Study
We conduct ablation experiments on the proposed loss modules. Table 4 reports the results of the ablation experiments of PVHD on DailyDialog and DSTC7-AVSD. -RE removes the reconstruction loss. -CO removes the consistency loss. The results demonstrate that our optimization objectives are effective. We can observe that the reconstruction loss can improve the BLEU-1/2 and Rouge-L. The consistency loss can improve Dist-1/2 metrics at the the expense of BLEU-1/2 and Rouge-L metrics. We believe that the consistency loss can ensure the consistency between the incremental information and the input at each time step. There may be multiple candidate tokens following the same distribution, which increases the diversity of generated responses. The reconstruction loss can make the summarizing variable recording the accumulated distribution of subsequence reflect the semantic information of dialogue context correctly, which will reduce the randomness of the generation process by limiting candidates that do not conform to sequence semantics.

A.6 Significance Testing
To evaluate the reliability of the PVHD results, we performe multiple significance tests. Table 6 (in Appendix A) reports the results of the significance test for automatic evaluation. We can observe that the p-values of PVHD are less than 0.05 compared with other models. Although the results of PVHD is not optimal in some metrics, the significance test demonstrates that results of PVHD are statistically significantly different from other models. In other words, the performance advantage of PVHD is statistically reliable and not an accident caused by random factors.

A.7 Case Study
To further dissect the quality of PVHD, several examples of generated responses are provided in Table 5. Although DialogVED, SVT, GVT can generate relevant responses, PVHD can produce higher quality responses in comparison. Specifically, for the first example, the responses generated by other models are contextual except for Sepa-CVAE. The response generated by DialogVED is more diffuse than gold response, but response generated by PVHD is more informative and possesses a different sentence pattern and different wording than gold response to some extent. We can observe the similar case for the second example. We believe that this is mainly due to the capture of variability of corpus by summarizing variable, which enables the model to identify similar sentence patterns and words, and generate diverse responses.