Enhancing Personalized Dialogue Generation with Contrastive Latent Variables: Combining Sparse and Dense Persona

The personalized dialogue explores the consistent relationship between dialogue generation and personality. Existing personalized dialogue agents model persona profiles from three resources: sparse or dense persona descriptions and dialogue histories. However, sparse structured persona attributes are explicit but uninformative, dense persona texts contain rich persona descriptions with much noise, and dialogue history query is both noisy and uninformative for persona modeling. In this work, we combine the advantages of the three resources to obtain a richer and more accurate persona. We design a Contrastive Latent Variable-based model (CLV) that clusters the dense persona descriptions into sparse categories, which are combined with the history query to generate personalized responses. Experimental results on Chinese and English datasets demonstrate our model’s superiority in personalization.

Sparse persona attributes (e.g., gender, age) are highly interpretable and have high information utilization, but the information is limited and cannot express complex persona features.Dense persona description text contains rich and flexible persona * * Corresponding author.information but suffers from noisy expressions.Modeling personality directly from dialogue histories is free of additional persona information, but the persona information in history queries is both noisy and uninformative.
To address these issues, in this paper, we improve personalized dialogue generation by combining the advantages of the three resources.We design a contrastive latent variable (CLV)-based model that clusters the dense persona descriptions into sparse categories, which are combined with the history query to generate personalized responses.Specifically, first, the dialog's latest query and response together with dense persona description texts are encoded.Then the recognition distribution of query and response is jointly modeled with a pre-designed dual conditional variational autoencoder (CVAE (Sohn et al., 2015)).Simultaneously, the persona information is automatically separated into multiple parts to participate in the above process in parallel.These partitioned persona pieces of information are considered to hide different angles of portrayal.This process is also reinforced by contrastive learning.Next, a decider decides which category of persona information is used for persona modeling.Finally, a personalized generator combines the history query and additional persona information for response generation.Without explicit supervised signals, we design a pseudo-labeling and joint training method to train the decider.

Related Work
Personalized Dialogue Generation Opendomain dialogue has been studied in depth for a long time (Koehn et al., 2003;Ni et al., 2021), and under the influence of the psychological theory, personality has been incorporated into the requirements for dialogue generation.Personalized dialogue generation has three typical approaches: (1) Using well-defined sparse persona attributes (e.g., gender, age), the model can utilize different attributes efficiently and interpretably, and knowledge-enhanced dialogue generation approaches can be borrowed (Zhang et al., 2018a;Song et al., 2019;Wolf et al., 2019;Liu et al., 2020;Bao et al., 2020;Song et al., 2021).However, sparse attributes can only provide little persona information without complex semantics.
(2) Mining information from dense textual persona descriptions, which contain rich and deep persona information but are very noisy (Qian et al., 2018;Song et al., 2020;Zheng et al., 2020;Song et al., 2021).(3) Implicitly modeling persona profiles from historical dialogue query (Li et al., 2016b;Ma et al., 2021;Zhong et al., 2022).This approach does not rely on additional persona information, but it is difficult to acquire personality implicitly from dialogue history without reference objects.
Dialogue generation based on CVAE Besides personalization, another essential goal of personalized dialogue generation is the diversity of dialog expression.To this end, existing works have ex-plored hidden variable models that model the variables in the dialogue process as Gaussian distributions, which can enhance the diversity of dialogue generation by introducing randomness (Zhao et al., 2017;Song et al., 2019;Hu et al., 2022).In this direction, one typical approach is to include persona information as a condition in regular Seq2Seq constructs and to model responses and queries as recognition distributions in CVAE (Li et al., 2018); another approach is to combine persona information or other external conditions and responses as generation targets before modeling joint distributions together with queries (Lee et al., 2021).In addition, many CVAE text generation models focus on other tasks, and they modify model details as well as probability maps for different tasks, which are not considered in this paper.

Overview
Given multi-turn dialogue of two users u i , u j .
The dialogue context of u i is the query initiated by u j to u i .The goal of the personalized dialogue is to generate a personalized response R i using the corresponding personal information P i in text form.
The overview of our model is shown in Figure 1.The overall model is composed of four modules: encoder, self-separation module, decider, and generator (marked in Figure 1 with orange borders).Specifically, the encoder module encodes dialogue queries, persona information, and responses respectively.The self-separation module separates the persona information in the hidden sentence vector space to form the grouping of persona information with implicit categories.We use multiple CVAEs to process the grouping persona information and get the grouping latent variables.The decider then automatically selects the latent variable to use from the group and feeds it into the generator along with the query.Finally, the generator autoregressively generates personalized responses based on the query and latent variables.

Encoder
we use a pre-trained GPT-2 (Radford et al., 2019) to encode the personal information text P i , dialog query Q i , and dialog response R i .We take the hidden vector of the last time step in the last layer of GPT-2 as the representation of the whole paragraph: (2) where p i , q i , r i ∈ R d , and d is the dimension of the hidden state.
Algorithm 1: Persona Self-Separation Input: p ∈ R 1×d : the vector representation of original sentence; N : hyper-parameter, the self-separation coefficient; d : the dimension of the hidden state; Output: P g ∈ R N ×d : vector representations of persona information after processing, in this context, it is the form of a set; 1: Initialize P g ; 2: Set s ← the integer of d/N ; 3: for i = 1 to N do 4: Initialize augment vector c i ← (0, 0, . . ., 0) 1×d ; 5:

Self-Separated Module
After obtaining the hidden state representation of P , Q and R, their representation vectors are further processed.As mentioned above, sparse personal information is more explicit and interpretable, while dense information text contains rich information but needs to be more organized.Therefore, referring to the research of Sun et al. (2021), we propose a self-separation method of persona information, which implicitly divides dense text persona information into N categories: where , and P g represents the persona information after grouping, which is composed of multiple parallel persona information.
For the algorithm of P-Sepa, see Algorithm 1.
In order to let the model automatically classify the grouped persona information, we use contrastive learning on the data in the same batch to let the model learn the similarities between the grouped persona information.Specifically, for two data points, P i g and P j g , we use a contrastive loss to help the model better represent group persona information.Following simcse, we denote Then we get the training objective: where τ is a temperature hyperparameter and sim(h i k , h j k ) is the cosine similarity.The model samples the persona latent variable z p from the persona distribution and the response latent variable z r from the potential response distribution.Since z p and z r respectively represent different aspects of the generated responses (z p contains the persona, and z r captures the specific query-response association), we assume that z p and z r are independent of each other, namely z p ⊥ z r .So, the response generation process can be said to use the following conditional distribution p(r, z p , z r |q) = p(r|q, z p , z r )p(z p |q)p(z r |q).Our goal is to use the deep learning method to approximate p(r|q, z p , z r ), p(z p |q) and p(z r |q), in which, according to Zhao et al. (2017) and Song et al. (2019), we refer to p(r|q, z p , z r ) as a response generator and p θ (z p |q), p θ (z r |q) as a prior network.In order to approximate the posterior distribution of the true, we refer to q ϕ (z p |q, p) and q ϕ (z r |q, r) as recognition networks.
We train this CVAE using Stochastic Gradient Variational Bayes(SGVB) (Kingma and Welling, 2013) by maximizing the variational lower bound of conditional log-likelihood.Following Zhao et al. (2017) and Song et al. (2019), we assume that potential variables z p and z r follows a multivariate Gaussian distribution with the diagonal covariance matrix.The lower bound of the variation of CLV-CVAE can be written as: Because we assume that the underlying variables z p and z r follow isotropic multivariate gaussian distribution, both recognition networks q ϕp (z p |q, p) ∼ N (µ p , σ 2 p I) and q ϕr (z r |q, r) ∼ N (µ r , σ 2 r I), both prior networks p θp (z p |q) ∼ N (µ p , σ 2 p I) and p θr (z r |q) ∼ N (µ r , σ 2 r I).In order to sample z p and z r from the prior network and recognition network in training and to make the sampling operation differentiable, using the reparameterization technique (Kingma and Welling, 2013), we have: where p, r, q are the representation vectors obtained in Section 3.2.Finally, z is fed into the generator to generate r together with the dialogue query q, where: z = z p + z r .How to get the final z p is explained in detail in Section 3.4.

Decider
In fact, in order to make the model can find the appropriate persona information, we do not let CLV choose from the grouped persona information directly, but first, use the recognition network or prior network to obtain the grouped persona information latent variables , which is obtained by sampling a set of distributions constructed separately for each vector in P g .Then, the Decider is trained to choose between them.We call it the Decider because it also includes the decision not to use personal information.
Specifically, the decider is a classification neural network composed of multi-layer sensing units which use a soft decision method to make a selection.The decider-matrix is composed of classification probability, and the classification probability is multiplied by the grouping persona information latent variable to get the final persona information latent variable z p .For grouped persona information latent variable Z g p : where It is difficult to directly let the decider learn how to choose from the latent variables of grouping persona information generated by sampling the persona distribution of implicit clustering.Therefore, we introduce the pseudo-label to guide the learning of the decider.The more intuitive idea is that if a latent variable in the group of persona information latent variables can achieve a minor decoding loss in the generator, then it may be a better latent variable.Based on this idea, we designed the decision loss to train the decider: where y is the index corresponding to z p input into the generator to obtain the minimum decoding loss.

Generator
We use a pre-trained GPT-2 as the generator, which uses the dialogue query as input and adds crossattention to the latent variable z: where P re(z) is the pre-cross attention object added before the standard GPT-2, which autoregressively generates a personalized response R.

Training and Optimizer
In our practice, we find that there are some challenges in training the decider, which is probably the reason for the mutual influence between loss functions.Firstly, there will be conflicts between the KL divergence and the decoding loss of the generator.Secondly, the loss of the decider depends on the dummy label monitoring signal set by us.
Finally, for the purpose of implicit clustering of persona information, the contrastive enhancement loss is largely independent of the above losses.
In order to promote gradient learning involving the above loss functions, a joint training process is designed to train CVAE and decider alternately.Specifically, in each training iteration, we first sample query Q, response R, and persona information P of two data points from batch data D, conduct contrastive training on encoders encoding persona information according to the self-separation algorithm 1, and then generate latent variables after selfseparation respectively according to the method described in Section 3.4.The generator's loss value creates a dummy label y (Eq.13), which is used to train the decider by optimizing the loss L d (Eq.14).
Further, we traverse D, generate a personalized response R, and update the generator and CVAE MLP by optimizing loss L g (Eq.6).

Datasets
ConvAI2 (Dinan et al., 2019) is an English dataset containing rich personal information, and the dialogues in this dataset are based on the personal facts corresponding to the characters.It is derived from PersonaChat (Zhang et al., 2018b) and obtained after filtering and refinement.It is a crowdsourced dataset covering rich persona features, and we have processed it to remove some noise.
Baidu PersonaChat1 , which is a personalization dataset collected and open-sourced by Baidu, is similar to ConvAI2, although it's Chinese.
We summarize the key statistics of the two personalized dialogue datasets in Table 1.As mentioned earlier, we only use the persona information of the two datasets during training.

Baselines
We compare the proposed model with 6 baselines, which can be classified into 3 categories.
Non-Personalized Approaches Seq2Seq with Attention (Sutskever et al., 2014) is a sequenceto-sequence model with an attention mechanism (Luong et al., 2015).The pre-trained GPT-2 (Radford et al., 2019) performs well in various text generation tasks and is used as a dialogue generation model after training on a dialogue corpus.
Approaches based on Dense Persona Information These methods use persona information to construct knowledge enhancement models, and for better model comparison, we tested these methods using the dialogue history as an approximation of the persona information.PerCVAE (Zhao et al., 2017) encodes the persona information text as a conditional representation and uses CVAE to generate personalized responses.BoB (Song et al., 2021) uses the Bert model for personalized dialogue generation and integrates the consistency generation task with the consistency inference task jointly to provide insight into the evaluation mechanism of personalized dialogue generation.
The Dialogue History-based Approach DHAP (Ma et al., 2021) uses historical memory to store and construct dynamic queryaware user profiles from dialogue histories and then uses a personalized decoder to generate responses.MSP (Zhong et al., 2022) enhances personalized dialogue generation by retrieving similar conversations from similar users via User Refiner and Topic Refiner and uses a Token Refiner to find the relevant tokens to be used during training, which is the best overall performance model for persona-free information personalized dialogue generation.Implementation Details are in Appendix A.1.

Evaluations
In order to obtain accurate performance comparisons, we use both automatic and human evaluations.

Automatic Evaluation
We divide the automatic evaluation methods into three categories in order to evaluate and model the diversity, consistency, and coherence of the generated dialogues.
(1) Diversity Distinct-1/2 (Li et al., 2016a) considers the number of single or double frames in the generated responses and is usually used to evaluate diversity.Most experiments do not specify the object of evaluation for Distinct-1/2, whether it is the whole corpus or multiple sentences, so we propose C-Dist-1/2(Corpus-Distinct-1/2) and S-Dist-1/2(Sentence-Distinct-1/2) according to the different objects of evaluation, the former evaluating the dialogue responses generated by the model on the whole test set, and the latter evaluating multiple responses (set to generate five responses in this paper).S-Dist-1/2 provides a better evaluation of whether the model can generate interesting responses in the same situation.
(2) Consistency The personalized dialogue generation task requires consistency between the generated responses and the persona information, and we propose Con.Score (Consistency Score) based on C.score (Madotto et al., 2019), which is obtained based on the referee model and can be defined as: where the NLI model is a triple classification model and can be found in Appendix A.
(3) Coherence BLEU-1 (Papineni et al., 2002) and ROUGE-L (Lin and Och, 2004) are classical words overlap-based metrics for measuring the similarity between generated responses and factual responses, which we believe can indirectly measure the coherence of dialogues.The reason we didn't look at BLEU-2/3/4 because we think that too much rigid coverage doesn't reflect the coherence of the model.And similar to the Con.Score, we propose the Coh-Con.Score (Coherence-Consistency Score), which is also obtained based on the NLI model: Human Evaluation Taking into account the uncertainty of the criteria when evaluating, we perform human evaluations of all models, and we convert the scoring method to a ranking method.Specifically, we extract 100 data points(queries, responses, and persona information) and hire three well-educated annotators to score the responses generated by the different models in a ranking style and to normalize them into specific scores on a scale of [0, 1] at the end.We focus on four aspects: readability, diversity, consistency, and coherence, and ask the evaluators to rank eight options for the seven model-generated responses and the factual responses.rics for both Chinese and English datasets, and it can be clearly observed that our CLV model improves on key metrics and these improvements are statistically significant (t-test with p-value < 0.05).Specifically, we can observe that: (1) Diversity.CLV shows different results on the two diversity evaluation dimensions.For S-Dist-1/2, CLV leads the other models, which indicates that our model is able to make more diverse and flexible responses compared to other models when facing the same situation.However, C-Dist-1/2 is lower than most models, which indicates that our model makes some sacrifices to improve consistency and coherence, and we will analyze this reason further in Section 5.

Automatic Evaluation
(2) Consistency.The lead of the consistency personalization metric Con.Score implies that our approach can integrate persona information into the generation, especially when this integration is done without information generation, which is more indicative of the superiority of CLV.
(3) Coherence.The performance of our model in coherence is also outstanding, whether it is the coverage index BLEU-1, Rouge-L, or the learning index Coh-Con.Score, which also shows that it is feasible to use the coverage index as a kind of evaluation basis for dialogue coherence.Our task diversity, coherence, and consistency can be used as three key bases for evaluating personalized dialogue generation, and the findings in the experiments suggest that our model is able to produce more personalized responses than all baselines.
Human Evaluation Human evaluation results on ConvAI2 are shown in Table 3.We calculated the Fleiss Kappa among the three annotators and obtained a Kappa of 0.67, which implies that the three annotators are in substantial agreement (Landis and Koch, 1977).In general, the results of human annotations are consistent with the results of automatic evaluations.They both demonstrate the advantages of our model in terms of personalized dialogue generation and basic readability.

Further Analysis
We further describe our model through a series of analyses.All analyses are based on the ConvAI2 dataset, and similar phenomena can be observed on Baidu PersonaChat.
Ablation Study To investigate the effects of different modules in CLV, we conducted an ablation study by removing modules.The results of the ablation study are shown in Table 5.We first investigated the impact of the core mechanism of the model, the self-separation algorithm.After removing the complete self-separation mechanism, the model degenerates to the most basic GPT-2 model, and it can be observed that the performance is on par with GPT-2.If we just remove the contrastive learning in the self-separation algorithm and keep the CVAE, we can see that the performance of the model also has a large decline, but the model's C-Dist-1/2 has an improvement, which is due to the global diversity due to the randomness of the sampled hidden variables in CVAE, which also indicates that CLV does sacrifice global diversity for other performance.Then, for the decider, we eliminate persona information by directly computing the mean of the grouped persona information latent variables, and we can find that the decider also plays an important role in CLV, especially when many dialogues are generated without considering persona, which shows that our decider can make decisions automatically.Finally, we conducted an experiment to validate our proposed joint training, and its performance degradation shows that it is difficult for the decider to learn how to make decisions without additional supervised signals.

Effect of Self-Separation Coefficients
In CLV, the self-separation mechanism categorizes the persona information in an approximate implicit clustering way, and the self-separation coefficient N corresponds to the number of categories in the clusters.Intuitively, the self-separation factor will affect the model's performance, and we report this effect in Figure 2. The self-separation mechanism cannot do much good when the N is small.When N is set too large, the decider is also unable to make good decisions, which is due to the increased noise caused by too many categories, making the persona information too scattered, which is also consistent    with the fact that the descriptive texts are always confined to several fixed perspectives.
To demonstrate the model's effectiveness more concretely, we conduct case studies.The results are shown in Table 4, which show that CLV can extract personal information, reconstruct persona profiles from queries alone, extract personal information, and generate fluent, personalized responses.
In Case 1, both CLV and BoB accurately an-swered "music" when asked about their hobbies, while CLV also used "How about you? " to keep the conversation going.In Case 2, CLV not only answered the address accurately but also flexibly used "school teacher" and "Affiliated Primary School of Renmin University" in the persona information to generate the response.In Case 3, all four models failed to accurately answer the question consistent with personality, but CLV still connected "lawyer" and "legal affairs".By observing Cases 1 and 2, we can see that CLV can balance consistency and coherence, and its generation is consistent with persona and maintains context coherence.GPT-2 can only achieve basic sentence fluency.BoB and MSP can also generate good answers due to the help of context in reasoning.In Case 3, CLV creates a slightly fit answer, which is also better than the other models.

Conclusion
In this work, we propose a CLV model for personalized dialogue generation.Unlike existing works, we integrate the advantages of sparse and dense persona information.We use a self-separation mechanism to implicitly cluster the persona information in the dense persona information text so that the decider can consider different sparse categories of persona information during dialogue and enhance the personalization of dialogue generation.We also propose a more effective evaluation metric framework for personalized dialogue generation.The experimental results confirm the effectiveness of the model in generating personalized responses.

Figure 2 :
Figure2: Experiments with the different N on the Con-vAI2 dataset.For ease of viewing, BLEU-1 and Coh-Con.Score are multiplied by a factor of 4.
The overview structure of the proposed model.Connections with dashed blue lines only appear during training, connections with dashed red lines only appear during inference, and connections with solid black lines indicate that they appear during both training and inference phases.The purple lines represent positive and negative example constructions in contrastive learning.

Table 1 :
Statistics of persona dialogue datasets.

Table 2 :
Automatic evaluation on two datasets.The best results are in bold." †" indicates that our model passed the t-test with p-value < 0.05.
Table 2 shows the performance of all models on different automatic met-

Table 3 :
The result of human evaluation on ConvAI2 dataset." †" indicates that our model passed the t-test with p-value < 0.05.

Table 4 :
A case study.Keywords are marked in red.