Generating Relevant and Coherent Dialogue Responses using Self-Separated Conditional Variational AutoEncoders

Conditional Variational AutoEncoder (CVAE) effectively increases the diversity and informativeness of responses in open-ended dialogue generation tasks through enriching the context vector with sampled latent variables. However, due to the inherent one-to-many and many-to-one phenomena in human dialogues, the sampled latent variables may not correctly reflect the contexts’ semantics, leading to irrelevant and incoherent generated responses. To resolve this problem, we propose Self-separated Conditional Variational AutoEncoder (abbreviated as SepaCVAE) that introduces group information to regularize the latent variables, which enhances CVAE by improving the responses’ relevance and coherence while maintaining their diversity and informativeness. SepaCVAE actively divides the input data into groups, and then widens the absolute difference between data pairs from distinct groups, while narrowing the relative distance between data pairs in the same group. Empirical results from automatic evaluation and detailed analysis demonstrate that SepaCVAE can significantly boost responses in well-established open-domain dialogue datasets.


Introduction
When conversing with a human user, an opendomain dialogue system is expected to generate human-like responses -responses that not only are diverse and informative, but also contain relevant and cohesive information that correctly addresses the context dialogue. Through using sampled latent variables, Conditional Variational AutoEncoders (CVAE) are powerful tools to ensure diversity and informativeness of the generated responses (Bowman et al., 2016;Serban et al., 2017;Shen et al., 2017;Chen et al., 2018). Yet, it is challenging for a CVAE-based dialogue generation model to keep the responses relevant and  Figure 1: In this example, the latent variables (z 1 , z 2 , z 3 ) sampled by a general CVAE model don't inherit the semantic relationship of the contexts (c 1 , c 2 , c 3 ). Although c 1 and c 2 have a high similarity, the similarity between z 1 and z 2 is low. c 2 and c 3 have a low similarity, but z 2 and z 3 have a high similarity.
coherent. The challenge arises as human dialogues inherently exhibit the one-to-many and many-toone phenomena (Csaky et al., 2019), meaning that the same context could lead to very different responses, and different contexts could lead to the same response, respectively. As a result, the latent variables sampled by CVAE often fail to capture the correct contextual semantics, as shown in Fig. 1, leaving open the possibility that similar contexts producing drastically different latent variables. This has two particular drawbacks: First, the discrepancy between latent variables could lead to irrelevant and incoherent generated responses. Different latent variables in a continuous latent space correspond to different responses (Bowman et al., 2016). As dissimilar latent variables may be sampled for similar contexts, the generated responses for contexts in the test set could be drastically different from responses to similar contexts in the training set. For instance, given a context "Everything about this movie is awesome!", a standard CVAE may generate response as dis-similar as"Smartphones of the best games!." and "Caves would never say yes, but I'd love to know." (Gao et al., 2019). Thus this approach sacrifices too much relevance and coherence for diversity and informativeness.
Second, the disparity between contexts and latent variables hurts model generalizability. Model generalizability is often evaluated using a separate dataset taken from a similar distribution as the training set (e.g., a validation or a noisy version of the training set). High generalizability is indicated if the model can transfer favourable abilities from the training set to this second dataset, in the sense that it produces consistent responses between similar contexts across the two datasets. This suggests that the model has acquired certain semantic relations between sentences from the training set. However, if the sampled latent variable departs significantly from the contextual semantics, the model may perform quite differently on the second dataset from the training set.
To address these drawbacks, we propose a novel model, namely Self-Separated Conditional Variational Autoencoder (SepaCVAE). SepaCVAE proactively partitions the input data into a number of groups, and then widens the absolute differences between data pairs across different groups while narrowing the relative distance between data pairs within the same group. In this way, SepaCVAE aims to put the contexts that sample similar latent variables into the same groups, thereby regularizing the latent variables. The design of SepaC-VAEinvolves three components that are built on top of standard CVAE. First, inspired from image augmentation, we propose a dialogue augmentation method to partition data without any prior knowledge. For this, we construct N orthogonal vectors to classify data into N groups, which retain the original semantic relationships of data within a group. We directly enlarge the semantic distance of the data across different groups. Then, we propose a gradient blocking algorithm to select the most suitable group for each data according to gains obtained from different groups. Here, the gains are evaluated using reconstruction loss. Finally, inspired from the contrastive learning paradigm (Cai et al., 2020;Chen et al., 2020a,b;Mitrovic et al., 2020), we propose relationship enhancement to increase similarity between the representations of data within the same group, and differentiate the representations of data between different groups.
Contributions: Our first contribution is a theoretical analysis on why sampled latent variables fail to reflect the contexts' semantics. The next contribution lies in the proposal of SepaCVAE to overcome issues of irrelevant and incoherent responses caused by standard CVAE. Our third contribution involves a series of experiments. The results show that our SepaCVAE can generate more relevant and coherent responses compared to existing methods.
CVAE models are conversational models that are based on variational reasoning. Many existing CVAE models have achieved state-of-the-art performance by generating diverse and informative responses. Moreover, as opposed to methods that introduce external semantic information, CVAE models use latent variables to represent such information. Hence they can be applied when external information is not available. Comparing with the models based on RL or GAN, CVAE models are simpler and can be easily trained. In addition, CVAE models can be enhanced by methods that use RL or GAN as generators to further improve their performances.
However, empirical evidences (Gao et al., 2019;Gu et al., 2019) have indicated that while the use of latent variables may make the generated responses more diverse and informative, it could also reduce relevance and coherence. To alleviate this apparent issue, CVAE models have been used in combination with external information such as persona information, dialogue history and dialogue act (Shen et al., 2017;Serban et al., 2017;. However, simply borrowing external information is not sufficient to resolve the one-to-many issue, especially when the amount of data is very large. No existing model resolves the core issue of the problem, that is, the latent variable inherits little semantic information from the context sentence, a consequence of the inherent one-to-many and many-to-one phenomena of human conversations. To address this issue, we propose the SepaCVAE model which trains latent variables that inherit contextual semantics.

Self-supervised method used in dialogue generation task
Recently, self-supervised methods such as contrastive learning -popularized in computer vision (Chen et al., 2020a,b) -are drawing increasing attention in NLP (Wu et al., 2019;Clark et al., 2020;Cai et al., 2020). Generally speaking, the major issue with applying contrastive learning is how positive and negative examples are constructed. Many existing work explore ways to design reasonable pairs of positive and negative examples to accurately capture the semantic relations of these pairs, so that the obtained representation can be betterused on downstream tasks.

Problem formulation
The problem with the standard CVAE model lies in that the sampled latent variables may not accurately reflect the contextual semantics due to the apparent one-to-many (one context may correspond to many responses) and many-to-one (many contexts may also correspond to one response) phenomena. This leads to irrelevant and incoherent responses, and harms model generalizability. Our aim is to adapt sampled latent variables to capture the contextual semantics, so that the effects of these phenomena are neutralized. This will in turn be helpful to generate relevant and coherent responses. With this goal, we focus on single-turn dialogue datasets where the one-to-many situations appear more frequently than multi-turn dialogue datasets.

Preconditions
This section formally analyzes the many-to-one and one-to-many phenomena and we present several important assumptions and contextual information (i.e., preconditions) for the CVAE model. Notations: θ and φ are parameters of CVAE's recognition network and prior network, respectively; c represents the condition information, x and r represent the generation target, and z represents the latent variable. Precondition 1: Bowman et al. (2016) confirmed that the latent space is continuous; the latent variable z is highly correlated with the target data x, meaning that different z will reconstruct different x. Precondition 2: CVAE has a recognition network q φ (z|c, x) and a prior network p θ (z|c) to approximate the true posterior distribution p(z|c, x) and prior distribution p(z|c), respectively. These distributions are assumed to follow the Gaussian distribution, e.g., q φ (z|c, x) ∼ N (µ, σ 2 ). Precondition 3: To efficiently train a CVAE model, the Stochastic Gradient Variational Bayes (SGVB) framework (Sohn et al., 2015;Yan et al., 2016;Kingma and Welling, 2014) is adopted which aims to maximize the variational lower bound of the conditional log likelihood: where KL represents Kullback-Leibler divergence. During training, the σ of q(z|x, c) will get smaller and smaller, and the µ of q(z|x, c) will get closer and closer to z that corresponding to x, which aims to stabilize the E q φ (z|x,c) [log p(x|z, c)] and make it converge.

Demonstrating the existence of the problem
We use Fig. 2 to illustrate the impact of one-tomany phenomenon and many-to-one phenomenon on a trained standard CVAE model. Consider the situation in Fig. 2(a) where the context c 1 has two different responses r 1 and r 2 . By Precondition 2, we assume two approximate posterior distributions p(z|c 1 , r 1 ) ∼ N (µ 1 , σ 2 1 ), p(z|c 1 , r 2 ) ∼ N (µ 2 , σ 2 2 ) and one approximate prior distribution p(z|c 1 ) ∼ N (µ, σ 2 ). By Precondition 3, during training, µ 1 and µ 2 will get closer to the latent variables that could be reconstructed to r 1 and r 2 , respectively. By Precondition 1, as r 1 is different from probability probability Figure 2: The change to the probability distributions of the latent variables of a standard CVAE during training. (a) one-to-many phenomenon: Since a context may correspond to two different possible responses r 1 and r 2 , the posterior distributions p(z|c 1 , r 1 ) and p(z|c 1 , r 2 ) are also different. This jeopardizes the requirement of the standard CVAE that these posterior distributions should be similar to the prior distribution p(z|c 1 ). Therefore, the sampled latent variables from p(z|c 1 ) may lead to irrelevant and incoherent responses and harm the generalization performance. (b) manyto-one phenomenon: Since two different contexts c 1 and c 2 may have the same response r 1 , the two prior distributions p(z|c 1 ) and p(z|c 2 ) have two corresponding posterior distributions p(z|c 1 , r 1 ) and p(z|c 2 , r 1 ).
Since the latent variable z is mainly corresponding to response r, p(z|c 1 , r 1 ) and p(z|c 2 , r 1 ) can be assumed as the same, i.e., p(z| * , r 1 ). Therefore, the prior distributions p(z|c 1 ) and p(z|c 2 ) also tend to be the same.
r 2 , µ 1 should also be different from µ 2 . Otherwise, the latent variables sampled from p(z|c 1 , r 1 ) and p(z|c 1 , r 2 ) tend to be the same, making these latent variables irrelevant to the responses. This leads to the vanishing latent variable problem (Bowman et al., 2016). Therefore, µ 1 and µ 2 cannot be the same, and their discrepancy can be considered stable; only in this way we can ensure one-to-one correspondence between latent variables and responses. From Precondition 3, it is easy to see that p(z|c) is only affected by p(z|c, r). Hence, we ignore E * [·] in Eq. (1) and use KL(p(z|c, r)||p(z|c)) to analyze the trend of p(z|c) during training. Considering Fig. 2(a) where KL(·) of (c 1 , r 1 ) and (c 1 , r 2 ) equals to KL(p(z|c 1 , r 1 )||p(z|c 1 )) + KL(p(z|c 1 , r 2 )||p(z|c 1 )). We provide details of the computation in Appendix A. The formulation can then be simplified as: log σ 2 Hence, we can compute µ * and σ * that minimizes the above using Lagrange multiplier: The derivation above provides insights on the problem caused by the one-to-many phenomena in Fig. 2(a): After training, the prior conditional probability p(z|c 1 ) ∼ N (µ * , σ * 2 ), which will be used in inference. If the difference between r 1 and r 2 widens, the difference between µ 1 and µ 2 will also widen and µ * will become further away from µ 1 and µ 2 . During inference, the latent variables sampled from p(z|c 1 ) have a high probability to differ from those sampled from p(z|c 1 , r 1 ) and p(z|c 1 , r 2 ). These latent variables will introduce irrelevant information and contribute to the generation of irrelevant responses. In addition, as one response r 1 may correspond to different contexts c 1 and c 2 , as shown in Fig. 2(b), p(z|c 1 ) and p(z|c 2 ) tend to be the same, which contributes to the phenomenon that different context could sample similar latent variables. In a word, similar contexts could correspond to different latent variables and different contexts could correspond to similar latent variables, which explains why the latent variables can not accurately reflect the contexts' semantics.

Method
In this section, we introduce in detail the proposed SepaCVAE model and its three key components, dialogue augmentation, gradient blocking, and relationship enhancement.

Self-Separated CVAE
p(z| ( ( G1(*)) p(z|G ( ( 2(*)) probability probability As shown in Fig. 3, SepaCVAE uses G(·) to separate the contexts into different groups. For the one-to-many phenomenon, the contexts in different groups will have different prior distributions p(z|G * (·)), which is easily affected by the different posterior distributions. As for the many-to-one phenomenon, SepaCVAE makes the contexts (c 1 , c 2 ) generate latent variables related to the response r 1 only when it contains group information G 1 (·). The other group would help the contexts to align with the other latent variables.

Dialogue augmentation
In SepaCVAE, we first propose dialogue augmentation (see Algorithm 1), which designs a group of orthogonal vectors (y 1 , y 2 , . . . , y N ) to separate the contexts into different groups. These vectors (y 1 , y 2 , . . . , y N ) are called group information.
Algorithm 1 Dialogue augmentation Input: C ori 1×m : the vector representation of original context sentence after word embedding process; N : the hyper-parameter; m : the dimension of word embedding; Output: C ext N ×m : vector representations of context sentences after augmentation; Y ext N ×1 : the labels of the augmented contexts; In SepaCVAE, we apply Algorithm 1 to extend each dialogue pair (c i , r i ) to [(c i + y 1 , r i ), (c i + y 2 , r i ), . . . , (c i + y N , r i )] before feeding them to start training. If different contexts c i , c j , . . . have the same y i added, then these contexts belong to the same group. In this way, all contexts will keep a certain relationship within the same group. In this work, the value N is set to 8. Since we use c + y to replace the original c, the variational lower bound of SepaCVAE is re-written as:

Gradient blocking
Before the gradient back-propagation, we propose gradient blocking (see Algorithm 2 in Appendix B for implementation details) to filter the gradients. Since we extend the dialogue pair (c, r) to [(c + y 1 , r), (c + y 2 , r), . . . , (c + y N , r)], if we optimize the model through all calculated gradients, y 1 , y 2 , . . . , y N would be regarded as noise. Therefore, We choose the largest variational lower bound that is calculated through the dialogue pair (c, r) with the positive group information y + , which can be represented as (3): For each [(c + y 1 , r), (c + y 2 , r), . . . , (c + y N , r)], we only pass L(·, y + ) to optimize the model.

Relationship enhancement
Through dialogue augmentation and gradient blocking, the positive y + for each dialogue pair (c, r) is captured. We then propose relationship enhancement, which is inspired from contrastive learning, to adjust the separated results. Those responses under the same y + are considered to be in the same group, and thus can be seen as positive samples; similarly, those responses under different y + are seen as negative samples. From the perspective of contrastive learning, we design a relationship-enhancement-loss named L re to help our model achieve the representation learning: where x represents the embedded generated response, f (·) represents our model' encoder, P os means the number of positive samples, and N eg means the number of negative samples.
In addition, we introduce an MLP to predict y + based on vector representation of the generated response f (x ). We therefore define L Y : Overall, SepaCVAE is trained by maximizing: Quoting the KL annealing trick (Bowman et al., 2016), α increases linearly from 0 to 1 in the first 10,000 batches.

Dataset
We use two public dialogue datasets in our experiments, and change them as single-turn dialog data. The first dataset, named DailyDialog (Li et al., 2017b), consists of dialogues that resemble human dataset name vocab train valid test DailyDialog 10,064 18,406 2,008 988 OpenSubtitles 87,840 5M 100K 50K daily communication. The second dataset, named OpenSubtitles (Tiedemann, 2009), includes a large collection of conversations converted from movie transcripts in English.

Data pre-processing
In this work, we extract single-turn dialogues from two dialogue datasets, DailyDialog and OpenSubtitles. From a multi-turn dialogue (u 1 , u 2 , ..., u T ), we can extract T − 1 single-turn dialogues [(u 1 , u 2 ), (u 2 , u 3 ), ..., (u T −1 , u T )], where u represents an utterance. As discussed above, compared with multi-turn dialogue dataset the single-turn dialogue dataset contains a more serious one-to-many problem. Therefore, using the single-turn dialogue dataset for experimentations can highlight the problem of general CVAE model and reflect the effect of our method. We utilize 300-dimensional GloVe embeddings (Pennington et al., 2014) to represent these dialogues in vectors. Since the tokens in GloVe do not cover all tokens in DailyDialog and OpenSubtitles datasets, we extract the token-list of GloVe to filter these datasets. Table 1 lists key statistics of the dataset after processing. In addition, we count the one-to-many samples of both datasets and found that 408 contexts in DailyDialog and 90,149 contexts in OpenSubtitles have multiple responses. In particular, a context in OpenSubtitles has a maximum of 623 responses, while a context in DailyDialog has a maximum of 29 responses, which shows that the one-to-many phenomenon is more prevalent in OpenSubtitles dataset.

Automatic evaluation metrics
We use ppl (Neubig, 2017), response length and distinct-n (Li et al., 2016b) to evaluate the diversity of generated responses. We also use BLEU (Papineni et al., 2002) to evaluate the degree of the word-overlap between generated responses and ground truth. Moreover, we use Embedding Average (Average) (Liu et al., 2016)) to evaluate the semantic relationship of generated responses and ground-truth responses. Finally, we introduce the coherence (Xu et al., 2018b) to assess the coherence between contexts and generated responses.

Human evaluation
We conduct human evaluation to further evaluate our model and baseline models. Following the work of Li et al. (2017a); , we randomly extract 200 samples from the test sets of the two dialogue datasets, respectively. Each sample contains one context and the response generated by different models. Three annotators are invited to rank the generated responses with respect to three aspects: diversity, relevance and fluency. Ties are allowed. Diversity indicates how much the generated response provides specific information, rather than generic and repeated information. Relevance means how likely the generated response is relevant to the context. Fluency specifies how likely the generated response is produced by human.

Baseline models
Our baseline models include sequence-to-sequence (Seq2Seq) model, CVAE model, and cluster-CVAE model. They are all implemented based on a 2layer GRU kgCVAE model . The cluster-CVAE model represents that kgCVAE utilize the cluster results as the knowledge. We employ three cluster methods, i.e. K-means(K), Spectral(S), Agglomerative(A).

Training details
For a fair comparison among all models, we utilized 300-dimensional GloVe embeddings as the word embedding matrix. The numbers of hidden nodes are all set to 300. The parameter max len is set to 25. We set the batch sizes to 64 and 32 for DailyDialog and OpenSubtitles datasets, respectively. Adam is utilized for optimization. The parameter init lr is set to 0.001. We train all models in 50 epochs on a RTX 2080Ti GPU card with Tensorflow, and save the generated responses when the ppl reaching minimum. Greedy search is used to generate responses for evaluation.
6 Results and Discussion 6.1 Automatic evaluation results Table 2 and Table 3 report the automatic evaluation results of SepaCVAE and baseline models on validation and test data of both two datasets, respectively. For the validation stage, we first select and save the positive group information (y + )   for each context, and then generate responses under this y + . For the test data where no ground truth response is available to select the positive group information, we first generate N responses for each context through N group information, and then choose the most possible generated response through calculating the cosine score between the generated responses and context. Both generated responses and contexts are input into SepaCVAE's encoder to obtain the vector representations.
Spectral and Agglomerative cluster methods would not work well under the large-scale dataset (i.e. OpenSubtitles), and the general CVAE model suffers from the vanishing latent variable problem while training on such dataset. Therefore, we remove the results of S-CVAE+BOW, A-CVAE+BOW and CVAE on Table 2 and Table 3. Table 2 and Table 3, the results on large-scale dataset (OpenSubtitles) are better than that on small dataset (DailyDialog), that is, the results on OpenSubtitles show an obvious pattern that verifies our hypothesis. On both validation and test data of OpenSubtitles, CVAE and K-CVAE achieve better performance on diversity metric (distinct) but worse performance on relevant metrics (i.e. BLEU, Average and coherence) than Seq2Seq model. Moreover, our proposed SepaC-VAE outperforms all baseline models in terms of all metrics with statistical significance. However, the results obtained on the DailyDialog dataset do not show a clear pattern. For DailyDialog's validation data, SepaCVAE achieves good performance on diversity but on relevance the results is unimpressive. On the other hand, for test data, SepaCVAE achieves good performance on relevance but generally poor results on diversity. We believe that the reason for this phenomenon is related to the level of prevalence of the one-to-many phenomenon in the  Table 4: Human evaluation results on test data of Daily-Dialog (up) and OpenSubtitles (down). The best score in each column is in bold. Note that "Ground-truth" is the true response.

As shown in
dataset. For instance, only 66,260 contexts have multiple responses among the 90,149 contexts on the OpenSubtitles that was added the cluster results. Moreover, one context has a maximum of 296 responses, which amounts to almost half of 623. Since the DailyDialog dataset is very small and contains few samples that we focus on, which cause the not specific tendency on its results. In a word, the evaluation results illustrate the effectiveness of SepaCVAE in terms of improving the relevance and coherence of responses.

Human evaluation results
The results of the human evaluation are shown in Table 4. To evaluate the consistency of the ranking results assessed by three annotators, we use Pearson's correlation coefficient. This coefficient is 0.22 on diversity, 0.63 on relevance, and 0.70 on fluency, with p < 0.0001 and below 0.001, which indicates high correlation and agreement. Similarly with the automatic evaluation results in Table 3, this result shows that our SepaCVAE significantly outperforms baselines in term of relevance and diversity. Except the ground-truth responses, our SepaCVAE achieve the best scores of relevance and diversity metrics. The fluency result of Sepa-CVAE on the DailyDialog dataset is slightly worse than that of baselines, which is mainly due to the length of responses generated by SepaCVAE is almost two times than that of baselines (see Table 3). When the response lengths are similar on the Opensubtitles dataset, SepaCVAE could also achieve the best fluency score.

Effectiveness analysis
We further analyze the effectiveness of SepaCVAE on regularizing latent variables. For the contexts in the validation data of DailyDialog dataset, we collect their generated responses and the sampled latent variables of both SepaCVAE and baseline models on the first 2,500 batches. Then we calculate the average inner-group distance and the average inter-group distance for each context based on jointly vector representations (concatenating the context vector and the latent variable). All distances are calculated by cosine scores, and the higher the distance, the greater the similarity. For each context, SepaCVAE outputs a positive group information y + , which is used to distinguish whether other contexts are in the same group. As for the standard CVAE, we set a threshold of the cosine score to replace the group information. In this work, the threshold is set to 0.9. Finally, we take the average of all contexts' inner-group distance results and inter-group distance results as inner-dis. and inter-dis. of each batch, which are shown in Fig. 4. SepaCVAE achieves significantly higher inner-dis. than baseline (standard CVAE) model, while the inter-dis. are similar. Meanwhile, our method also gets the similar average distance of all jointly vectors with the standard CVAE.
In addition, past studies conjecture that the posterior z sampled from the recognition network should cluster the responses into meaningful groups that correlate with the knowledge. Fig. 5 visualizes the posterior z of responses in the validation data of DailyDialog dataset in 2D space using t-SNE (van der Maaten and Hinton, 2008). We found that the learned latent space of our SepaCVAE is more correlated with the group information. These results demonstrate that SepaCVAE can effectively regularize latent variables.

Case study
We collected the generated responses of contexts in validation and test set, which are similar to the training set, and showed a sample in Table 4. The context in training set has two contradictory responses. As we analyzed, the standard CVAE and CVAE+BOW generated irrelevant and incoherent response for the similar context in validation and test set. In contrast, our SepaCVAE outputted sure, it will be happy and sure. i go with my parents are more relevant and coherent than the response generated by baselines, and it also similar with the true response 1 (oh, that sounds great!), which means the SepaCVAE is able to handle the one-to-many situation.

Conclusion
In this paper, we theoretically prove that latent variables hardly reflect the semantics of contexts due to the one-to-many and many-to-one phenomena of dialogues. For the standard CVAE model, these issues lead to irrelevant and incoherent responses during the validation or test stage, and also damaging the generalization performance. To address these problems, we proposed the SepaCVAE model. There are three main technical novelties of SepaCVAE: dialogue augmentation, gradient blocking, and relationship enhancement, which enable the latent variables to reflect semantic relationships between contexts. As demonstrated in the experimental results, SepaCVAE could get the best performance for large-scale dataset. i am so excited about these friends! A The computation of prior probability distribution through KL-divergence on the one-to-many situation We assume that p(z|c 1 , r 1 ) ∼ N (µ 1 , σ 2 1 ), p(z|c 1 , r 2 ) ∼ N (µ 2 , σ 2 2 ) and p(z|c 1 ) ∼ N (µ, σ 2 ). Then, we have: KL(p(z|c 1 , r 1 )||p(z|c 1 )) = p(z|c 1 , r 1 ) log p(z|c 1 , r 1 ) p(z|c 1 ) dz Since the log σ σ 1 is a constant, and the p(z|c 1 , r 1 )dz = 1, we have: Since p(z|c 1 , r 1 ) = dz can be calculated as follow: Let the x= z−µ 1 √ 2σ 1 , we have: According to the L'Hospital's rule, the lim x→−∞ xe −x 2 =lim x→+∞ xe −x 2 = 0.
Since the Latent Vanish problem is not expected by the VAE and CVAE methods, the p(z|c 1 , r 1 ) should be different from p(z|c 1 , r 2 ), which means the N (µ 1 , σ 1 ) is different from the N (µ 2 , σ 2 ).
Replace the µ with the µ * , we have: We use a constant C to replace (µ 1 −µ 2 ) 2 4 , the σ * equals σ 2 1 +σ 2 2 2 + C. The µ * = µ 1 +µ 2 2 means the latent variables sampled from this prior probability distribution easily tend to be different from the latent variables sampled form the posterior probability distributions. Since the latent variables are highly correlated with the generated responses, the responses generated through prior probability distribution would be different from that generated from posterior probability distributions. If the difference between µ 1 and µ 2 is very large, the σ * would be large too, thus resulting in high probability of more irrelevant latent variables.

B The implementation of gradient blocking
We present the implementation of gradient blocking method in Algorithm 2. In Algorithm 2, we build a mask tensor Loss M ask to filter the loss results form each batch data, which can same obstruct the gradient backpropagation. Since we used gradient descent to optimize the neural model, the smallest loss result equals the largest variational lower bound. The elements in Loss M ask are 0 or 1, so Loss * Loss M ask can be considered as the selection of the existing Loss.