Modeling Complex Dialogue Mappings via Sentence Semantic Segmentation Guided Conditional Variational Auto-Encoder

Complex dialogue mappings (CDM), including one-to-many and many-to-one mappings, tend to make dialogue models generate incoherent or dull responses, and modeling these mappings remains a huge challenge for neural dialogue systems. To alleviate these problems, methods like introducing external information, reconstructing the optimization function, and manipulating data samples are proposed, while they primarily focus on avoiding training with CDM, inevitably weakening the model's ability of understanding CDM in human conversations and limiting further improvements in model performance. This paper proposes a Sentence Semantic \textbf{Seg}mentation guided \textbf{C}onditional \textbf{V}ariational \textbf{A}uto-\textbf{E}ncoder (SegCVAE) method which can model and take advantages of the CDM data. Specifically, to tackle the incoherent problem caused by one-to-many, SegCVAE uses response-related prominent semantics to constrained the latent variable. To mitigate the non-diverse problem brought by many-to-one, SegCVAE segments multiple prominent semantics to enrich the latent variables. Three novel components, Internal Separation, External Guidance, and Semantic Norms, are proposed to achieve SegCVAE. On dialogue generation tasks, both the automatic and human evaluation results show that SegCVAE achieves new state-of-the-art performance.


Introduction
In open-domain conversations, complex dialogue mappings (CDM) between contexts and responses commonly exist in the real-world data, which bring considerable modeling challenges for neural dialogue models (Csaky et al., 2019;Sun et al., 2021): one-to-many mapping can cause models to generate incoherent responses, while many-to-one mapping makes the model produce non-diverse responses.For example, CornellMovie (Danescu-Niculescu-Mizil and Lee, 2011) and Opensubtitles (Lison and Tiedemann, 2016) dialogue datasets contain 10.29% (4.18% + 6.11%) and 9.10% (4.79% + 4.31%) CDM data (one-to-many + many-to-one mappings) accordingly.Many existing efforts tried identifying CDM and avoiding training on them to facilitate the dialogue learning.Luong et al.; Li et al. introduce external information to detach oneto-many pairs into one-to-one pairs, thus reducing the difficulty of model training.Some works reconstruct the optimization functions, allowing model to learn from self-generated qualified responses instead of the ground-truth, thereby avoiding the directly training on many-to-one pairs (Li et al., 2016c;Zhang et al., 2018b;Liu et al., 2020).Others train the model through filtered corpora, which usually contains few one-to-many and many-toone dialogue pairs (Xu et al., 2018b;Csaky et al., 2019;Akama et al., 2020).For an instance, Csaky et al. (2019) reported the improvement of a dialogue model with high entropy dialogue pairs (i.e.CDM) filtered out for training, which is consistent with our preliminary experiments in Table 1.
Table 1 shows the comparison results of the same Seq2Seq dialogue model trained with/without CDM.We can observe that the Seq2Seq trained without CDM improves the BLEU (Papineni et al., 2002), Emb.Aver.(Liu et al., 2016) and Coherence (Xu et al., 2018c) but reduce the Distinct (Li et al., 2016a) (metrics detailed in Appendix A.1).Moreover, the gains on BLEU are big, but the gains Coherence one-to-many many-to-one 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The Ratio of Samples with CDM 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The Ratio of Samples with CDM 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The Ratio of Samples with CDM on Emb.Aver.and Coherence are small.This result proves the idea that reducing the CDM of the dataset is beneficial for increasing the scores of some automatic evaluation metrics.However, these methods simply ignore the CDM data (10% of the dataset), and in this paper, we argue that these CDM dialogue pairs are still valuable for dialogue training.To explore this, we conduct further experimental investigation by training two Sequence-to-Sequence dialogue models (Seq2Seq) (Shang et al., 2015) over the "clean" Opensubtitles dataset which does not contain any one-to-many or many-to-one pairs, respectively, and then we gradually introduce one-tomany/many-to-one pairs to fine-tune these models.
From Figure 1, we observe that one-to-many and many-to-one dialogue pairs have conflicting effects on Distinct, Emb.Aver.and Coherence, which explains why simply removing them together yields smaller gains.Therefore, instead of staying away from CDM, our primary study of interest is to enable model to effectively learn useful knowledge from these dialogue pairs while avoiding being affected by the disadvantages.
To achieve this goal, we take inspirations from Conditional Variational AutoEncoder (CVAE) based dialogue generation methods (Shen et al., 2017;Zhao et al., 2017;Chen et al., 2018;Gao et al., 2019a;Sun et al., 2021) and model the manyto-one and one-to-many from the latent space.However, previous study shows that due to lack of the prior knowledge, latent variable hardly involves semantic relationships, resulting in semantically irrelevant responses (Sun et al., 2021).Therefore, we propose a Sentence Semantic Segmentation guided Bring me a cup of coffee, please.
I do not know.

Same Semantics
Prominent Semantics The response in a many-to-one mapping has a high proportion in dataset, which deceives models into increasing the generation probability of it, reducing the diversity of generated responses.We promote the same prominent semantics to be associated with the same response, thus extending the response space to enhance the diversity.
CVAE (SegCVAE), using the sentence semantic segmentation to constrain the latent variable, which models the CDM naturally.
The complex and ambiguous context semantics can be reduced when segmented into multiple different sub-semantics, so that each sub-semantics may focus on different perspectives of the context.We refer these sub-semantics to as prominent semantics, which can explain CDM naturally (see Figure 2): When the semantics of a context being segmented into multiple prominent semantics, each of them corresponds to a response (i.e.one-tomany mapping); vice versa, when the prominent semantics is segmented by different contexts semantics, the same prominent semantics can correspond to the same response (i.e.many-to-one mapping).

Related Work
The open-domain dialogue generation has received dramatic attention recently (Sutskever et al., 2014;Shang et al., 2015;Sordoni et al., 2015).Sutskever et al. (2014) identified that "noisy" data, including one-to-many and many-to-one dialogue pairs, can affect the performance of dialogue systems.To address such "noisy" data, many methods have been proposed in recent years.For instance, a large body of work on introducing external information for reducing the number of noisy data (Luong et al., 2015;Li et al., 2016b;Serban et al., 2016;Zhao et al., 2017;Huber et al., 2018;Ghazvininejad et al., 2018;Tao et al., 2018;Chen et al., 2018;Feng et al., 2020b), and a rich line of work reconstructs the objective function to avoid training models directly on such noisy data (Li et al., 2016c;Xu et al., 2017;Zhang et al., 2018a;Xu et al., 2018a;Zhang et al., 2018b;Feng et al., 2020a;Liu et al., 2020;He and Glass, 2020;Mi et al., 2022;Sun et al., 2022;Li et al., 2022a).Others design a scoring approach to filter noisy data (Xu et al., 2018b;Csaky et al., 2019;Akama et al., 2020;Li et al., 2022b); However, CDM data in human conversations impels valuable information that can help models generate better responses, and these methods cannot learn the valuable information of one-to-many and many-to-one dialogue pairs, nor can they make full use of the advantages of these data.For example, Li et al. (2016b) uses personal information to reduce the one-to-many dialogue pairs.The Reinforcement Learning based dialogue generation methods (Li et al., 2016c;Zhang et al., 2018a) only require the generated response to get high reward rather than similar with the ground-truth, which means that some many-to-one dialogue pairs are ignored during training.Csaky et al. (2019) uses conditional entropy to assess the dialogue pairs, which easily filters one-to-many and many-to-one dialogue pairs.
In addition to the methods above, CVAE-based dialogue generation methods (Shen et al., 2017;Zhao et al., 2017;Chen et al., 2018;Gao et al., 2019a;Wang et al., 2019;Sun et al., 2021) provide an idea to learn the essential knowledge of the oneto-many and many-to-one mappings.They try to encode knowledge into a latent space, a posterior probability distribution, and a prior probability distribution.By sampling latent variables, the model can easily generate multiple responses for one context.We follow this rich line of work to explore their applicability in modeling CDM, and we propose new state-of-the-art SegCVAE in dialogue generation task.Compared with the vanilla CVAE, SegCVAE uses sentence semantic segmentation to regularize and guide the latent variables, which avoids the gap between context and latent variables.Different from knowledge-guide CVAE, SegCVAE does not require additional information.Meanwhile, SegCVAE uses the segmented prominent semantics instead of manually-created orthogonal vectors, which is more reasonable than SepaCVAE.

SegCVAE
SegCVAE is proposed to model CDM (including one-to-many and many-to-one mappings) through sentence semantic segmentation guided latent variables.As discussed above, different prominent semantics can be segmented from one context semantics, and similar prominent semantics can be segmented from different context semantics, which help latent variables learn the semantic relations, thus modeling one-to-many and many-to-one naturally.In this section, we provide detailed descriptions of the proposed SegCVAE method.

Overview
SegCVAE uses multiple prominent semantics (x 1 , x 2 , x 3 , . ..) to learn the probability distribution over response with latent variables, and x i denotes the representation of one prominent semantics.To train SegCVAE, we derive the Stochastic Gradient Variational Bayes framework (Kingma and Welling, 2014; Sohn et al., 2015;Yan et al., 2016) and gradient blocking trick (Sun et al., 2021): where q ϕ (z|r e , x i ) and p θ (z|x i ) are the recognition network and the prior network that used for sampling latent variable z, respectively.The r e = enc(r) is the semantic vector computed by model's encoder enc based on the response r.The p Ω denotes the model's decoder, which generates the output token based on the conditional probability p Ω (r|z, x i ).Following the gradient blocking trick, x + ∈ (x 1 , x 2 , x 3 , . ..) denotes the prominent semantics vector that makes the variational lower bound largest, and only L(r, x + ) is used to optimize the model.
To obtain the prominent semantics (x 1 , x 2 , . ..), SegCVAE employs the INTERNAL SEPARATION (IS) and EXTERNAL GUIDANCE (EG).To further capture the relationship among context, prominent semantics, and response, we propose three novel semantic norms: SEMANTIC ALIENATION NORM, SEMANTIC CENTRALIZATION NORM, and SEMAN-TIC DISTILLATION NORM.

Internal Separation
The IS processes sentences through multiple triggers and extracts multiple sets of different words, which can be used to compute different prominent semantics.Each trigger consists of a convolution network Conv and a dense network Dense.The input of a trigger is an embedded matrix representation C of a context with a shape (max_clen, N ), where max_clen represents the maximum length of a context that can be received and N is the dimension of the word-embedding.The C is processed by Conv whose kernel K and stride S are (m, N, 1, chan) and (1, 1, 1, 1), respectively.The chan is the number of channels of the convolution operation, and (m, N ) denotes the shape of convolution kernel.
After that, we get the semantic features F c .We squeeze and transpose the F c from (max_clen − m + 1, 1, chan) to (chan, max_clen − m + 1), and put it into the Dense.The weight of Dense is W with a shape (max_clen − m + 1, max_clen).
We use Sof tM ax function to handle the last dimension of the input (F c • W).
Hence, the shape of F d is (chan, max_clen), which represents the probability of words in the context of attention in different channels.Then, we select the word with highest probability in each channel, which is processed by encoder enc to extract certain semantic information.However, this discrete process will hamper the optimization of model.To ensure the gradient back-propagation, we introduce Gumbel SoftMax (GS; Jang et al. (2017)) to replace the Sof tM ax (Eq.4) and selection process: where input ij ∈ Input and τ is the temperature parameter.We control τ to be as small as possible, so that the output of GS is as close as possible to the result of argmax(F d ).Thence, we can get the embedded matrix representation of extracted words with the shape of (chan, N ).Finally, we randomly initialize M trigger networks in IS to extract M embedded matrix representations (C 1 IS , C 2 IS , . . ., C M IS ) of different wordcombinations from a context.

External Guidance
The EG is responsible for extracting instructive information from the outside of the sentence (i.e. the vocabulary) according to the context semantics.To achieve this goal, we change the hyper-parameter of the dense network in the trigger defined in the previous section.The new weight matrix of the dense in EG is W ′ , whose shape is changed from (max_clen − m + 1, max_clen) to (max_clen − m + 1, vocab_size), where vocab_size is the size of the vocabulary.Hence, the results of the dense network denote the probability of words in the vocabulary of attention in different channels.Therefore, the output of EG is a matrix representation V EG of chan words in vocabulary related to the semantics of the input : where W emb is the word-embedding matrix whose shape is (vocab_size, N ).Finally, we can also ran- Therefore, the C IS and the V EG are used together to calculate multiple different prominent semantics of a context: ) where enc denotes the model's encoder, x i represents i-th prominent semantics.

Semantic Norms
We consider self-supervise learning methods and propose SEMANTIC ALIENATION NORM (L san ), SEMANTIC CENTRALIZATION NORM (L scn ), and SEMANTIC DISTILLATION NORM (L sdn ), to constrain the relations among the context, prominent semantics and response.L san and L scn are responsible for promoting the multiple prominent semantics to be closely connected with the context on the basis of maintaining their own independence, which leverages the diversity and coherence of generated responses.L sdn is used to facilitate the construction of semantic relations among prominent semantics.

Semantic Alienation Norm
We first propose L san to make each prominent semantics as different as possible from other prominent semantics, which is computed by: The SoftMax function handles the last dimension of the input matrix X whose shape is M × N .The I is an identity matrix with shape (M × M), and x i is the i-th prominent semantics vector calculated by the enc.X • X ⊤ represents the correlation between a certain prominent semantic vector and other prominent semantic vectors.Figure 3 shows a schematic of the SEMANTIC ALIENATION NORM.

Semantic Centralization Norm
Then we propose the L scn to ensure the ensemble result M i x i of these prominent semantic vectors (x 1 , x 2 , . . ., x M ) is similar with the semantics of the original context, which is shown in Figure 4.
where enc(C) represents the vector representation of the original semantics, C is the vector representation of the original context.

Semantic Distillation Norm
Finally, we propose L sdn , which uses the relationship among the ground-truth responses to teach our model to learn the semantic relation of these prominent semantics.That is, with L sdn , the connections between prominent semantics and groundtruth responses can be further established, which  can improve the consistency of response generation and the potential meaning of prominent semantics.In addition, since the representation is performed by enc, L sdn can further adjust its semantic representation capability.The schematic of SEMANTIC DISTILLATION NORM is shown in Figure 5 and L sdn is defined as: where R gt with the shape (B × N ) represents the semantic matrix (vector representation) of batch size B ground-truth responses obtained by the model's encoder enc.And R + gen is the concatenated result of the vector representations of B generated responses, which are obtained through the positive prominent semantics x + .Note that the SoftMax function is also used to handle the last dimension of the input matrix.

Objective Function
The final objective function for training our model is to maximize: where L(r, x + ) is shown in Eq (1), and λ increases linearly from 0 to 1 in the first snorm_step batches.

Data Setting
Two well-established open domain dialogue datasets are conducted for experiment: CornellMovie and Opensubtitles.We derived a processed version of Opensubtitles released by Sun et al. (2021), which has 5M, 100K, and 50K single-turn dialogue pairs in training, validation, and test sets, respectively.We follow the same process for CornellMovie and we obtain 51,108, 6,358 and 6,249 single-turn dialogue pairs for training, validation, and test.

Baseline Models
We compare our model with state-of-the-art dialogue models: A GRU-based Seq2Seq (Shang et al., 2015;Sordoni et al., 2015)

Evaluation Metrics and Training Details
In addition to the Distinct-n, BLEU, Emb.Aver and Coherence, we also use Perplexity (ppl) (Neubig, 2017) and Length (Csaky et al., 2019) to evaluate the performance of all models.For human evaluation, we hired three annotators to rank all models based on their generated responses.Please see Appendix A for more details on experimental settings.

Automatic Evaluation Results
Table 2 reports the automatic results on test data of CornellMovie and Opensubtitles.These results show that our SegCVAE achieves a better performance in terms of most metrics.Specifically, our SegCVAE achieves the best Length, BLEU, Emb.Aver.and Coherence scores on both datasets, which demonstrates the superior performance of our model on generating coherent and related responses.In addition, the SegCVAE has a competitive ppl and Distinct results.Generally speaking, the Distinct metric is easily affect by the length of generated responses.Therefore, as the SegC-VAE generates longest responses, the proportion of repeated words will increase, resulting in a decrease in the distinct score.In a nutshell, these results shows the ability of SegCVAE to handle the general dialogue generation task.

Human Evaluation Results
The results of the human evaluation are shown in  2).When the response lengths are similar on the Opensubtitles, SegCVAE can also achieve the best fluency score.

Ablation Study
Table 4 reports the results of the ablation study.
It can be seen from the table that after removing IS, L scn and L sdn , respectively, the results all decreased.And the results decreased the most after removing IS, indicating that IS has the most important role in model performance.In addition, we found that after removing EG, the Diversity of the model increased, but the Emb.Aver.and Coherence decreased.This is because EG is mainly responsible for regulating the prominent semantics in the model without deviating from the original semantics.Therefore, by removing EG, the prominent semantics obtained by IS lacks constraints and can become more diverse, but the connection with the context is weaker.Similarly, L san is used to make multiple prominent semantic information segmented to be different from each other, so removing L san will reduce Diversity and increase Emb.Aver.and Coherence.

Case Study
We use the prominent semantics to guide the generation of responses, which requires the SegCVAE to learn the relations among the contexts, the prominent semantics, and the responses.To illustrate the connection among prominent semantics, context and generated responses, we report three samples and their related words that extract by EG and IS, which are shown in Table 5.Note that the words extracted by EG and IS are used for calculating prominent semantics through the encoder.
In Table 5, we can notice that the output of EG is difficult to relate to the response.We suppose -wo.Lscn 0.021±.000.313±.030.755±.050.421±.000.349±.000.296±.000.238±.000.833±.000.703±.00-wo.L sdn 0.020±.000.320±.010.774±.010.433±.000.358±.000.302±.000.243±.000.836±.010.703±.02that this would blame the poor interpretability of neural models and the lack of annotations.Note that EG is trained by self-supervised learning without any explicit-knowledge annotations.Therefore, it learns to minimise the designed loss, which may produce some unrecognised results or intermediate features for human.We speculate that introducing annotations or knowledge that consistent with human cognition will help the model to produce more interpretable and better performance.We consider it as an important future work and require more efforts on this topic.
We also collect the generated responses and show them in Appendix B.

Effectiveness Analysis
To further study the effectiveness of CDM, we conduct experiments over these mappings.

Data and Tasks
We collect two particular datasets (named as O2M and M2O) from the Opensubtitles, and define two new tasks (oneto-many and many-to-one dialogue learning task) to analyse the ability of generative dialogue models in handling CDM.Evaluation Settings Different from the previous settings, we conduct a new human evaluation strategy.First, each model received 50 contexts randomly extracted from O2M and M2O, respectively, and generated 400 responses.Then, three annotators were invited to rank all models with respect to "Suitability" and "Erudition" of their responses.Ties are allowed.Suitability indicates how many diverse and relevant responses are generated by the model.Erudition specifies whether multiple generated responses have the same semantics as the ground-truth responses.We design Suitability to validate whether the model can learn the diversity and relevance from CDM samples, and we use Erudition to assess whether the semantic information of multiple ground-truths is involved in multiple responses generated by the model.On the contrary, our SegCVAE generates multiple responses corresponding to multiple prominent semantics, which easily captures the semantics of multiple responses in O2M dataset and achieves the best Erudition on O2M dataset.However, due to the trade-off between diversity and relevance, the Erudition of SegCVAE on M2O dataset is a little poor.We also use the Pearson's correlation coefficient to evaluate the consistency of the ranking results.The coefficient is 0.64 on Suitability, and 0.51 on Erudition, with p<0.0001 and below 0.001, which indicates high correlation.

Conclusion
This paper proposes a novel SegCVAE to model complex dialogue mappings (CDM) in human conversations.SegCVAE parses the CDM from a semantic perspective: Using multiple prominent semantics segmented from the context to establish relationships with the responses, multiple prominent semantics can correspond to multiple responses, and multiple contexts can also segment similar prominent semantics.In this way, prominent semantics can constrain latent variables to learn semantic relations to tackle incoherent problem, while enriching them to mitigate the non-diverse problem.To realize SegCVAE, we propose three novel modules: Internal Separation (IS), External Guidance (EG), and Semantic Norms (i.e.L san , L scn , and L sdn ).IS is used to get the basic information for computing prominent semantics, EG is used to constrain the prominent semantics not to deviate too far from the original semantics, and three Semantic Norms are proposed to establish relationships for contexts, prominent semantics and responses.The experimental results show the superiority of our model in dialogue generation, one-tomany and many-to-one dialogue learning tasks.

Limitations
The limitations of our paper are as follow: • The SegCVAE model is proposed to model the serious complex dialogue mappings (i.e.oneto-many and many-to-one) phenomena in opendomain dialogue generation task.Therefore, the SegCVAE is suitable for generative tasks where non-one-to-one mappings exist in the dataset.If the task does not require modeling non-one-to-one mappings, our model has little advantage.
• The hyper-parameters (e.g.the number of extracted words chan, the number of triggers M and so on) need to be determined through multiple experiments, which cannot be set adaptively.These initial promising results for segmenting context into multiple prominent semantics for modeling complex dialogue mappings will hopefully lead to future work in this interesting direction.
• We provide further analysis on One-to-Many and Many-to-One dialogue learning task, and propose a new human evaluation strategy to directly valid the performance of models on processing non-oneto-one dialogue samples.However, we do not provide results on automatic evaluation of modeling one-to-many and many-to-one mappings.This is primarily because there are no publicly recognized metrics for the evaluation of the performance on modeling one-to-many and many-to-one dialogue mappings directly.In addition, it is also difficult to propose the automatic metrics to achieve the evaluation process due to the lack of supervised information.Automatically evaluating the generative dialogue model's ability to model the complex mappings is a challenging problem and we leave that for future work.M are set to 3, 3 and 8, respectively.We set the batch sizes to 64 and 32 for CornellMovie and Opensubtitles, respectively.Adam is utilized for optimization.The initial learning rate is set to 0.001.The snorm s tep is set to 20000 for CornellMovie, but for Opensubtitles, the λ is constant at 1.0.We also introduce KL annealing trick to leverage the KL divergence during the training.The KL weight increases linearly from 0 to 1 in the first 10000 batches.We train all models in 50 epochs on a RTX 2080Ti GPU card with Tensorflow, and save the generated responses when the ppl reaching minimum.The random seed is set as 123456.Greedy search is used to generate responses for evaluation.

B Case Study
We collected the generated responses from the test set of CornellMovie and showed them in Table 7.In the first example, we found that SegCVAE gave a response of "they're throwing the company."considering "they lying" in the context.Compared with the responses generated by other models, the response of SegCVAE is more specific and more relevant to the context.As for the second sample, only the Seq2seq only generates a general and short reply "I don't know."; the others all generate diverse responses.However, considering the coherence between the generated responses and the context, our model is more advantageous.This result shows the superiority of SegCVAE in solving the dialogue context and generating diverse C Further Analysis on One-to-Many and Many-to-One Dialogue Learning

C.1 Data Settings
We extract two particular datasets from the raw Opensubtitles: One-to-Many and Many-to-One, for the One-to-Many and Many-to-One dialogue learning, respectively.To build these two datasets, we first extract single-turn dialogues from the Opensubtitles: T − 1 singleturn dialogues [(u 1 , u 2 ), (u 2 , u 3 ), ..., (u T −1 , u T )] can be extracted from one multi-turn dialogue (u 1 , u 2 , ..., u T ), where u represents an utterance in each dialogue.Then, we selected and collected a large collection of one-to-many dialogue pairs as the One-to-Many (O2M) dataset, and another large collection of many-to-one dialogue pairs as the Many-to-One (M2O) dataset.Finally, we use the token-list of GloVe (Pennington et al., 2014) to filter the O2M and M2O datasets.For each dialogue pair (context c i , response r i ), we first obtain its tokens after word segmentation, and then judge whether its tokens are all contained in GloVe's token-list.If the GloVe do not contain any tokens of (c i , r i ), we drop all dialogue pairs containing the c i or r i from the dataset.(c, r 1 ), (c, r 2 ), . . ., (c, r n ).Let D 1n represent the dataset that only contains such one-to-many dialogue pairs.This task requires a dialogue generation model to learn the one-to-many knowledge, and to generate multiple coherent and informative responses for every context sentence.
Many-to-One Dialogue Learning Task Relatively speaking, let cs=c 1 , c 2 , . . ., c n denote the contexts, and r denote a response to the cs.Correspondingly, we use D n1 to represent a dataset that only contains many-to-one dialogue pairs (c 1 , r), (c 2 , r), . . ., (c n , r).This task requires the dialogue generation model to learn the many-toone knowledge, and to distinguish which of the contexts can give the same response, and then increase the diversity while keeping the coherence of the generated response.
In our experiments, all models are trained on D 1n or D n1 to accomplish the One-to-Many Dialogue Learning Task or Many-to-One Dialogue Learning Task.The training and validation procedures are the same as for general dialogue generation task.In inference stage, every model should generate N responses for each context in test set of D 1n or D n1 .Note that N is set to 8 in this paper.

C.3 Case Study
We collected the generated responses of contexts in test set of O2M dataset and showed a sample in Table 9.We can observe that the SegCVAE generates "Calm down." and "No-no,", which are corresponding to the "Relax!" and "Stop" in true responses.This result illustrates that the SegCVAE can effectively build the relations between the multiple prominent semantics and the multiple responses.

Figure 1 :
Figure1: Four metrics of Seq2Seq models fine-tuned by increasing one-to-many and many-to-one dialogue pairs.

Figure 2 :
Figure2: The schematic of CDM and our primary idea for modeling CDM.(a) Multiple responses in an oneto-many mapping can disrupt model's ability to address the dialogue context.We associate different responses with the segmented different prominent semantics, so as to avoid the interference of multiple responses and to enhance the coherence.(b) The response in a many-to-one mapping has a high proportion in dataset, which deceives models into increasing the generation probability of it, reducing the diversity of generated responses.We promote the same prominent semantics to be associated with the same response, thus extending the response space to enhance the diversity.

Figure 3 :Figure 4 :
Figure3: A Schematic of SEMANTIC ALIENATION NORM.Note that the "push arrow" indicates that the semantic similarity between the Prominent Semantics at both ends is decreased.

Figure 5 :
Figure5: A Schematic of SEMANTIC DISTILLATION NORM.Note that the "Self Dot" operation is to make each Generated or Ground-Truth Response Representation perform an inner product with itself and other representations, and then perform SoftMax to get the correlation between each representation and all representations.KL means the KL divergence.

Table 1 :
Preliminary experiments of Seq2Seq models trained with and without CDM on CornellMovie (up) and Opensubtitles (down).
To achieve this goal, we propose INTERNAL SEP-ARATION (IS) and EXTERNAL GUIDANCE (EG) to model the prominent semantics together.The IS extracts multiple different words from the context to obtain the prominent semantics.The EG extracts the instructive words from the vocabulary to constrain the prominent semantics not far from the original semantics.Furthermore, to make the prominent semantics capture the relationship with responses and latent variables, we propose SEMAN-

Table 3 :
Human evaluation results on test data.The best score in each column is in bold.
Table 3 (refer to Appendix A.2 for detailed setups).To evaluate the consistency of the ranking results assessed by three annotators, we use Pearson's correlation coefficient.This coefficient is 0.80 on Diversity, 0.62 on Relevance, and 0.77 on Fluency, with p < 0.0001 and below 0.001, which indicates high correlation and agreement.

Table 4 :
Ablation results on test data of Opensubtitles.The best score in each column is in bold.

Table 5 :
Generated responses and their corresponding keyword combinations of SegCVAE.EG and IS represent the External Guidance and the Internal Separation.
In our experiments, all models

Table 6 :
Evaluation results on test data of O2M (up) and M2O (down).The best score in each column is bold.
are trained on O2M or M2O to accomplish the two tasks.The training and validation procedures are the same as for general dialogue generation task.In inference stage, every model should generate N responses for each context in test set of O2M or M2O.Note that N is set to 8 in this paper (See Appendix C for detail).
Table 6 reports the result.We observe that SegCVAE achieves the best Suitability on both O2M and M2O datasets, which we believe stems from the model's superior ability to model the CDM.We also observe that SegCVAE achieves the best Erudition on O2M dataset but poor Erudition on M2O dataset, and K-CVAE achieves best Erudition on M2O dataset but worst Erudition on O2M dataset.This finding is in line with the characteristics of these models: (1) Due to the cluster information, the K-CVAE samples latent

Table 7 :
Generated responses from the baseline and SegCVAE on test set of CornellMovie.

Table 8 :
Statistics for One-to-Many (O2M) and Many-to-One (M2O) datasets.The # tokens is the vocabulary size, and the # pairs/contexts/responses is the number of the dialogue pairs/contexts/responses in datasets.The avg/max # r is the average/maximum number of responses for each context, and the avg/max # c is the average/maximum number of contexts for each response."-" means the cell is not necessary for this type/dataset.
Table 8 lists key statistics of the dataset after processing.C.2 Non-one-to-one Dialogue Learning Tasks One-to-Many Dialogue Learning Task Let c denote a context, and rs=r 1 , r 2 , . . ., r n denote the responses to c. Follow the general dialogue generation task, we put the c and rs into n dialogue pairs Context I'd rather die than live with you! freaking unk! Responses Relax!where does it hurt?Stop! ma'am, ma'am!CVAE I'm gonna get you to know!That's a bad idea, mister.I have a hell!It's a joke that you said he's a special agent!why do you want me to believe?You have something to do with this? aah.Hey, you're ready?yeah.The world's in the mood!Here, put your hands in the bowl.SegCVAE Yep tonight really... to me.sean?Calm down.hurry any, hurry unk.Nothing, they are hot / hey, No-no, your unk.i... God? uh... did not fit... Be it then let's abandon it.9 pigs. 1 50,000.open.Really is going with nothing?all unk came in the past hours.Most way.hell and i are unk

Table 9 :
Generated responses from the baseline and SegCVAE on O2M dataset.