MIRACLE: Towards Personalized Dialogue Generation with Latent-Space Multiple Personal Attribute Control

Personalized dialogue systems aim to endow the chatbot agent with more anthropomorphic traits for human-like interactions. Previous approaches have explored explicitly user profile modeling using text descriptions, implicit derivation of user embeddings, or utilizing handicraft prompts for ChatGPT-like models. However, textual personas are limited in describing multi-faceted attributes (\emph{e.g.}, \emph{language style, inner character nuances}), implicit embedding suffers from personality sparsity, and handicraft prompts lack fine-grained and stable controllability. Hence, these approaches may struggle with complex personalized dialogue generation tasks that require generating controllable responses with multiple personal attributes. To this end, we propose \textbf{\textsc{Miracle}}, a novel personalized dialogue generation method through \textbf{M}ult\textbf{I}ple Pe\textbf{R}sonal \textbf{A}ttributes \textbf{C}ontrol within \textbf{L}atent-Space \textbf{E}nergy-based Models. ttributes \textbf{C}ontrol within \textbf{L}atent-Space \textbf{E}nergy-based Models. Specifically, our approach first disentangles complex personality into multi-faceted attributes. Subsequently, we employ a conditional variational auto-encoder to align with the dense personalized responses within a latent joint attribute space. We have also tailored a dedicated energy function and customized the ordinary differential equations sampling method to offer flexible attribute composition and precise attribute control. Extensive experiments demonstrate that \textsc{Miracle} outperforms several strong baselines in terms of personality controllability and response generation quality. Our dataset and code are available at \url{https://github.com/LZY-the-boys/MIRACLE}


Introduction
Building a personalized and anthropomorphic chatbot is an essential goal in the field of dialogue sys-

Text-description-based
The job as a librarian is to guide the curious minds, through worlds of knowledge, in an endless quest for truth.

language style
You can get to connect people with books and knowledge.It's fulfilling.
Libraries have their merits, such as providing access to knowledge and resources.However, they are not without flaws, such as limited funding and resources.
Nice.My mom is also a librarian but my dad is a clerk.
Challenges like limited resources or evolving technology may arise, yet your role as a librarian illuminate the path, bringing forth the treasures of knowledge to those who seek it Figure 1: Top: previous methods model personas by user embedding derived from user posts (e.g., in Reddit) or a series of text descriptions.Bottom: Our approach models personality as the composition of multiple personal attributes.We train MIRACLE to align with different personal attributes (language style, attitude, etc), and control multiple attributes to represent diverse personalities during inference.
Recent personalized dialogue methods often rely on text descriptions (Song et al., 2019;Wolf et al., 2019;Xu et al., 2022;Chen et al., 2023) to model user profiles.However they primarily focus on concrete identifiable facts and background information, e.g., age, job, location, neglecting the multifaceted dimensions of personality (Moore et al., 2017;Ahn et al., 2023).For instance, while a statement like "I grew up in the deep south" conveys traits related to regional identity, it overlooks other personality dimensions such as language style, attitudes, and inner character nuances.Other methods for personalized dialogue generation often rely on user embeddings derived from social media platforms like Reddit (Qian et al., 2021;Ma et al., 2021;Huang et al., 2022;Zhong et al., 2022).However, these models encounter challenges due to the sparsity present in real-world posts, as they lack explicit persona modeling.Consequently, they may struggle to achieve accurate and comprehensive personalization through implicit embeddings.
While recent advancements in large language models, such as ChatGPT 1 , have facilitated personalized content through manual prompts, it is non-trivial to directly impersonate a specific persona using such prompts (Zhuo et al., 2023;tse Huang et al., 2023).This challenge stems from the inherently ambiguous and limited expressiveness of prompts, failing to achieve precise control over personalized content.
In this paper, we present MIRACLE, a novel approach that enables more precise and reliable finegrained control over personalization in dialogue systems.Specifically, we propose modeling user personality by disentangling it into multiple distinct personal attributes.As illustrated in Figure 1, personality can be decomposed into various attributes, including attitude, language style, mental characteristics, and more.Each attribute encompasses specific aspects, such as optimism or pessimistic for the attitude attribute.This decomposition allows us to capture the diverse dimensions of an individual's personality and enables fine-grained modeling and control of each attribute separately.By combining these aspects from multiple attributes, we can express a wide range of unique personalities.To achieve personalized generation, we specify an energy function that incorporates multiple personal attributes in a product-of-expert (POE) manner.By assigning lower energy to responses that better 1 https://chat.openai.com/align with the specified aspects, our approach enables personalized generation by sampling from an energy-based model (EBM), providing flexible and fine-grained control over the personalization of generated responses.
To address the challenge of personality sparsity and enhance personalized generation quality, we collect a high-quality multi-turn dialogue corpus, which is characterized by its dense coverage of each individual aspect.To circumvent the nondifferentiable nature of the text and better align with the dense aspect data, we employ a conditional variational autoencoder (CVAE) framework (Sohn et al., 2015) to map the attributed dialogue to a shared latent space.To enhance attribute representation further, two new loss functions are introduced to promote the distinctiveness and compactness of the latent space.Within this latent space, we leverage the designed EBM to capture the aspect density and compose different attributes.Additionally, we utilize an adapted ODE sampling method to efficiently draw personalized responses from this distribution.
In summary, our contributions include a novel personalized dialogue generation approach through fine-grained control over multiple personal attributes in the CVAE-based latent space, with two new losses promoting distinct and compact attribute representations and flexible EBM-based composition of different personal attributes using a customized ODE sampling method.Experimental results demonstrate that our approach achieves state-of-the-art performance, striking a superior balance between generation quality and personalized control.A high-quality personal attributed dialogue corpus for research purposes is also provided.

Personalized Response Generation
Existing methods for personalized dialogue generation can be broadly classified into two groups: textdescription-based methods and user-embeddingbased methods.
In the category of text-description-based methods, early works (Wolf et al., 2019;Song et al., 2020Song et al., , 2021a) ) primarily focus on promoting persona consistency through pre-trained language models, while recent advancements borrow knowledgeenhance techniques (Liu et al., 2022b;Fu et al., 2022;Jang et al., 2022) and incorporate entailment/discourse relations (Chen et al., 2023).However, these methods often represent personas as keyvalue lists or sentences, which limits accurately understanding and expressing personality nuances.
As for embedding-based methods, traditional approaches (Li et al., 2016b;Al-Rfou et al., 2016) attempt to exploit user ID information, while DHAP (Ma et al., 2021) embed user dialogue history as implicit profiles.More recently, contrastive learning (Huang et al., 2022), refined retrieval (Zhong et al., 2022) and CVAE-based clustering (Tang et al., 2023) are explored to enhance the personalization performance.However, these approaches may still suffer from the personality scarcity of real-world posts without explicit modeling.Additionally, utilizing implicit embeddings to guide personalization effectively remains a significant challenge.

Energy-based Text Modeling
Recently, energy-based models (EBMs) have emerged as a flexible generative framework capable of handling diverse configurations (Khalifa et al., 2021;Liu et al., 2022a).These models allow for the incorporation of arbitrary functions into the energy function, which is minimized during inference.As a result, many recent works leverage EBMs to model complex distributions (Pang and Wu, 2021;Yu et al., 2022) and incorporate multiple constraints and attributes (Nie et al., 2021;Pang and Wu, 2021;Qin et al., 2022;Liu et al., 2022a).For example, Mix-and-Match (Mireshghallah et al., 2022) employs EBMs to combine arbitrary black-box scorers for guiding text generation, while COLD (Qin et al., 2022) utilizes the energy function to impose arbitrary constraints during the decoding process.LatentOps (Liu et al., 2022a) introduces composable text control operations utilizing classifier-based EBMs.However, these works primarily focus on plain-text generation domains, whereas our approach applies EBM to dialoguegeneration scenarios, specifically modeling complex personality as a composition of multiple personal attributes based on CVAE architecture.We also adapt the ODE sampling method to effectively sample personalized dialogue responses.

Notation
Task Definition The task is to generate a personalized response, denoted as r M , given the personality P and a multi-turn dialogue context C = {q 1 , r 1 , . . ., q M −1 , r M −1 , q M }.Here, q and r represent the user query and chatbot response, respectively.In essence, the objective of personalized response generation is to estimate the probability distribution p(r|C, P ) in order to generate specific personalized responses.

Personality Modeling
In contrast to previous work, we propose a new approach to disentangle the personality P as the composition of different persona-related attributes, represented by P = (P 1 , P 2 , P 3 , . . ., P N ), where N is an arbitrary number and is easily adjustable.Each attribute P i may has n i candidate aspects, denoted as ), the objective of personalized response generation is to generate a response r that incorporates these aspects simultaneously.

Single-Aspect Dialogue Data Collection
To ensure the alignment with dense attributes disentangled from personality, we curated a multiturn conversation corpus for each specific aspect of these attributes.Leveraging the capabilities of ChatGPT in generating single-attribute data (Coda-Forno et al., 2023) and multi-turn conversations (Xu et al., 2023), we designed instruction templates to prompt ChatGPT to simulate two-person conversations.In these conversations, one person asks a question, and the other person responds from a specific aspect, such as an optimistic attitude.To enhance corpus diversity, we also pre-select a series of "seed" topics2 , around which conversations should be centered.To improve the aspect density of the collected corpus, we conducted multiple rounds of human evaluation and cleaning, resulting in a clean version of approximately 44k dialogue turns, further details of this process can be found in Appendix A. It is important to note that we collect single-aspect conversations for the training dataset, the multiple-attribute data is only collected for testing purposes due to its time-consuming nature caused by extensive combinations of different attributes3 .(2) We construct a joint attribute latent space through a dialogue CVAE, and introduce the aspect classification loss and the attribute distance loss to enhance the distinctiveness and compactness of the attribute space (Section 3.3).
(3) We design an energy function to compose each aspect within the joint latent space and draw desired vectors by ODEs sampling, which are then decoded to generate personalized response sequences (Section 3.4).

Joint Attribute Space Training
To facilitate the generation of personality-dense responses, we adopt a CVAE framework to map the aspect-specific dialogue data into a joint attribute space so that samples from the specific aspect space are aligned with aspect-dense response sequences.
To further enhance this joint attribute space, we introduce two specific losses.The first loss focuses on promoting the distinctness of each aspect, while the second loss aims to increase the intersection between different attributes, allowing for fine-grained sampling over multiple attributes.
Building CVAE To construct the dialogue Conditional Variational Autoencoder (CVAE), we employ two distinct models as encoders: a posterior encoder p θ (z|C, r) and a prior encoder p θ ′ (z|C).
Both encoders, based on the pre-trained BERT (Devlin et al., 2019), allow CVAE to effectively capture the given input context C by latent variable z.
During training, CVAE utilizes the posterior distribution to generate high-quality responses r, while during inference, when the response r is unseen, the prior distribution is used to sample the latent variable z.Moreover, the GPT2 model (Radford et al., 2019) is leveraged as the decoder p ϕ (r|C, z), where θ, θ ′ and ϕ represent the trainable parameters of the posterior encoder, prior encoder, and decoder respectively.Under the assumption that CVAE posterior and prior distribution follows an isotropic multivariate Gaussian distribution, we compute the mean µ, µ ′ and variance σ 2 , σ ′ 2 by the two encoders: Subsequently, we utilize reparameterization technique (Kingma and Welling, 2013) to sample posterior z and prior z ′ from N (µ, σ 2 I) and N (µ ′ , σ ′2 I).This technique enables a differentiable sampling process.
Finally, the sampled latent variable z (during training) or z ′ (during inference) is fed into the GPT2 decoder to map it back to text space, resulting in the generation of a response.
CVAE is trained using stochastic gradient variational bayes (SGVB) (Kingma and Welling, 2013), which maximizes evidence lower bound objective (ELBO) of conditional log-likelihood.The ELBO consists of two components: a dialogue response reconstruction term that ensures the generative quality of the posterior distribution p θ (z|C, r), and a regularization term that aligns the prior distribution p θ ′ (z|r) with the posterior p θ (z|C, r).This alignment fosters consistency during inference, where the unseen response r is generated.
Optimizing Joint Attribute Space We introduce the aspect classification loss and the attribute distance loss.The aspect classification loss aims to improve the discriminability of latent representations for aspects within the same personal attribute.Specifically, we incorporate individual classifier heads for each attribute and train them using the cross-entropy loss: where y (i) p j represents the ground truth probability for class p j within the attribute P i , and ŷ(i) p j represents the predicted probability.By optimizing this aspect classification loss, we encourage the aspect representations to be more distinguishable, enabling more fine-grained sampling.An illustration of this concept can be found in the middle part of Figure 2 (e.g., the red and blue aspect distribution of P 1 attribute exhibit clear separation).
Meanwhile, to encourage the model to capture intersections between different attributes, enabling the sampling of responses with multiple attributes simultaneously, we introduce an attribute distance loss.This loss penalizes the Euclidean distance between every two distinct attribute distributions.To avoid expensive computation, we approximate this loss on a batch level, taking the average within each mini-batch of size B: Minimizing such loss allows the model to reduce the conflicts between different attributes.(e.g., P 1 and P 2 attribute has intersection in Figure 2) To sum up, our final training objective is:

Personalized Response Sampling
We formulate personalized response generation as sampling response samples that contain multiple specific aspects of personality attributes.
To achieve fine-grained control over different attributes, we define an attribute-composable energy function that calculates the aspect density in the latent space.By leveraging adapted ODE sampling methods, we can efficiently draw samples of interest from this distribution.
Latent EBM Formulation In order to sample aspect-abundant vectors z in the latent space, we utilize attribute-specific classifiers4 denoted as f i to quantify the density of aspect p j i from z, represented as f i (z) [j].
We utilize EBM to estimate the richness of personality expressed in the responses (Z is the normalizing factor): where its energy function is designed in the POE manner to aggregate multiple personal attributes into a comprehensive representation of the overall personality (Outlined in Appendix B.1).
In this context, λ i ≥ 0 is the weight of P i attribute and a i is the desired aspect index of P i .
The energy function E(P|z, C) can be interpreted as a linear combination of the richness of personal attributes.Thus sampling from this EBM with low energy corresponds to response sequences exhibiting a higher density of multiple selected aspects It is worth noting that we utilize this energy-based formulation only during the inference procedure, enabling arbitrary combinations of personal attributes without the need for combination-specific fine-tuning.
ODE Personalized Sampling Due to the intractable normalization factor Z, a common practice is to sample from EBMs rather than directly calculate it.In our approach, we derive the ODE sampling method based on CVAE to sample from such EBM.Specifically, in Appendix B.2, we demonstrate that the ODE in our CVAE latent space takes the following form: Here, the ODE is solved with negative time increments from T to 0. To generate a sample r that aligns with a specific personality P , the process involves drawing z(T ) ∼ N (z|C) and solving for z(0) in the aforementioned equation using a black-box ODE solver5 (Chen et al., 2018(Chen et al., , 2021)).Subsequently, the obtained z(0) is decoded back to the text space to yield a personalized response.
Intuitively, in the right term of Equation 9, a higher value of f i (z|C)[a i ] indicates that the z better aligns with the aspect p a i i .By letting dz dt ∝ ∇ z f i (z|C)[a i ], we can pull z towards more aspectabundant places that yield more personalized responses.The summation ensures that each aspect is taken into account so that we can incorporate multiple selected aspects in one sample.

Experiments
To verify the effectiveness of our proposed MIR-ACLE, we conduct extensive experiments on both automatic and human evaluations.Additionally, we provide further analysis on ablation, efficiency, and case studies.

Experimental Setups
Dataset To evaluate the personalization and generation capabilities of our approach, we focus on language style (with two aspect: lyrical/plain), attitude (optimistic/pessimistic), and mental characteristics (critical/emotional).We randomly sample 11,000 dialogue turns per aspect (a total of 132,000 utterances) from our collected multi-turn dialogue corpus for training our MIRACLE model.For evaluation, we use ChatGPT to generate conversations on different topics, covering eight combinations of the three personal attributes.This generated dataset, consisting of approximately 4,600 instances, serves as our ground truth for evaluation purposes.
Baselines For comparison, we select the following baselines: (1) Text-description-based methods: We compare with BOB (Song et al., 2021a) and LMEDR (Chen et al., 2023), both are strong text-description-based personalized models.(2) User-embedding-based methods: Our second set of baselines includes MSP (Zhong et al., 2022), and CLV (Tang et al., 2023).To ensure a fair comparison, we randomly select personas from the Per-sonaChat dataset (Zhang et al., 2018) as conversation topics when generating our data, and feed the topics as personas input to BOB, CLV and LMEDR during training.More detail of the baseline can be found in Appendix C.1

Evaluation Metrics
In order to obtain accurate and comprehensive performance comparisons, we use both automatic and human evaluations.

Automatic Evaluation Metrics
We assess the quality of dialogue responses from four perspectives: (1) Personalization: To evaluate the personalization of the generated responses, we employ attribute-based text classifiers to measure the accuracy score of each attribute in the generated responses (Mireshghallah et al., 2022).Additionally, we report the average score across the three attributes to assess the overall effect of personalization.(2) Coherence: Coherence is measured using BLEU and Rouge metrics at the word overlap level.We also utilize Natural Language Inference (NLI) to evaluate the semantical coherence, as suggested by previous work (Liu et al., 2022b).(3) Fluency: To assess the fluency of the generated responses, the negative log-likelihood of the generated responses according to the GPT2-XL6 is used as the fluency score (Chen et al., 2023;Qin et al., 2022).(4) Diversity: We measure the diversity of the generated responses using the Distinct metrics and the self BLEU score (sBLEU) as proposed in (Tang et al., 2023;Liu et al., 2022a).Further details can be found in Appendix C.3.
Human Evaluation Metrics Consistent with prior studies (Tang et al., 2023;Chen et al., 2023), we conduct human evaluations on 100 randomly selected test samples.Three annotators assess the generated responses for readability, personalization, and coherence in a double-blind manner.We calculate the Fleiss Kappa value of 0.63, indicating substantial agreement among the annotators (Gwet, 2014).The evaluations are normalized into specific scores on a scale of [0, 1].

Automatic Evaluations
The performance of all models on different automatic metrics is presented in Table 1.Notably, our MIRACLE model demonstrates substantial improvements in personalization metrics while maintaining good generation quality.Specifically, the following observations can be  Compared With ChatGPT We compare the personalization performance of our MIRACLE with ChatGPT, as shown in Table 3.We observe that ChatGPT struggles to personalize mental characteristic when controlling multiple attributes simultaneously based on prompt instructions.This may be due to the inherently hidden nature of the mental characteristic, causing ChatGPT to prioritize more obvious attributes such as language style and attitude.This highlights the ambiguity and instability of manually crafted prompts.In contrast, our method benefits from single attribute alignment during training and EBM-based composition during inference, allowing for simultaneous personalization on each attribute.

Ablation Study
As presented in Additionally, we observe that adding EBM-based composition only leads to a slight decrease in terms of coherence and diversity, demonstrating a good tradeoff between generation quality and personalization in our method.

Efficiency Study
To assess the efficiency of our model, we compare training and inference times with baselines and ChatGPT using MIRACLE.All models are trained for 20 epochs and tested on a single RTX4090, except for ChatGPT accessed via an API.As shown in Table 5, our model exhibits notable efficiency in both training and inference, considering that we show compared performance with language models such as ChatGPT at a small cost.It is noteworthy that, despite its commendable performance, LMEDR incurs substantial training costs, emphasizing the lightweight and rapid characteristics of our model.
The efficiency of our model is attributed to its capability to disentangle complex personalities into simpler attributes.Furthermore, our model demonstrates faster inference speeds compared to the baseline models, thanks to our flexible Energy-Based Model (EBM) composition and customized Ordinary Differential Equation (ODE) sampling methods.

Case Study
To provide more concrete evidence of the model's effectiveness, we conduct case studies.Table 6 showcases an example of the personality of "lyrical+ optimistic +critical".(Additional case studies can be found in Appendix E) In this specific case, we observe that BOB and MSP tend to overlook the contextual information from the dialogue history, such as references to "weather" and "ocean," resulting in repetitive and incoherent responses.CLV and LMEDR may struggle with capturing multiple attributes of personality comprehensively, although LMEDR performs better in terms of coherence and fluency.However, our proposed MIRACLE model demonstrates precise personalization across all three personal attributes, particularly excelling in the "critical" attribute.

Conclusion
In this paper, we propose MIRACLE, a novel approach for personalized dialogue generation.Our method leverages a composition of multiple personal attributes to model personality and formulates the generation of personalized responses as sampling from a specific Energy-Based Model.We introduce a dialogue CVAE aligning the joint attribute space with dialogue responses by employing two designed loss functions.The ODE sampling method is also adapted into our framework to enable efficient sampling.Experimental results demonstrate that our approach achieves state-ofthe-art performance by striking a fine balance between the quality of generated responses and the ability to control their personalization.Furthermore, we curate a dataset of high-quality, singleaspect dialogue corpus, which serves as a valuable resource for further exploration and advancement

Context
Topic: I live in Alabama.
User: What is it like living in Alabama?

Response
ChatGPT(Golden): Alabama is a land of cotton and warm sunshine.The people are friendly and fill the air with their southern drawl.BOB: :: the ::::::: weather is great and I love the freedom of :::: going :::: and :::::: playing :::: fetch, ranging from outdoor activities to yummy tea.
CLV: life in alabama is like magic , where the land down under is a beautiful sight to see.it's a blend of the best , the best of the best.LMEDR: Alabama is a land of the free , where the sun shines bright and the sky is blue.
MIRACLE: Ah, the land of the brave is a bustling city, with diverse culture and grace .Although the weather can be rainy, it offers many blessings .

Limitations
There exist some limitations in our work.Firstly, due to constraints in the model structure, we primarily utilize the BERT encoder and DialoGPT decoder in our experiments.However, it is worth exploring the applicability of larger models, such as LLaMA (Touvron et al., 2023), to further improve the performance of our approach.Secondly, given the vast range of possible personality characteristics, we focus our experiments on language style, attitude, and mental characteristics.Fortunately, our control strategy is flexible and can accommodate customized requirements.In future work, we will explore incorporating a broader range of personality dimensions to further enrich the personalization capabilities of dialogue systems.

Ethics Statement
In this study, the personalized corpus and responses used in our experiments have been designed to only serve the specific purposes of evaluating our proposed approach.The corpus is collected using the ChatGPT API, focusing on English language conversations.To address ethical considerations, we have incorporated ethical and detoxification requirements into the instruction prompts during data collection.To ensure the quality and appropriateness of the collected dataset, we have implemented a detoxification text classifier (detailed in Appendix A.2) to identify and filter out potentially problematic content.Furthermore, the vali-dation data has been carefully reviewed by three well-educated annotators to remove any unethical content, sensitive information, or personal privacy concerns.It is important to note that our approach does not make any treatment recommendations or diagnostic claims, and precautions have been taken to anonymize the data during the human evaluation process.
We acknowledge the potential risks associated with text-generation techniques.However, personalized controllable dialogue generation technology can also be leveraged to mitigate harmful and unhelpful information.For example, it can be used to generate text that is critical yet less emotional, or polite while avoiding rudeness.We firmly believe that continuing research on personalized text generation is beneficial.

A.1 Data Collection Details
We develop aspect-specific instruction templates to prompt ChatGPT in simulating two-person conversations.These templates are fed to ChatGPT API (gpt-3.5-turbo) to collect the data.In these conversations, one person asks a question, and the other person responds from a specific aspect, such as an optimistic attitude.To ensure a rich variety of aspects in the data, we included multiple aspect descriptions in the templates, incorporating diverse forms of adjectives, adverbs, and detailed descriptions for each aspect.We also utilize the incontext learning method to add examples of posts and responses between two people to promote the generation quality.To enhance corpus diversity, we also pre-select a series of "seed" topics from the PersonaChat (Zhang et al., 2018)  [PersonA]: ...
Here are the requirements: 1.The PersonA question should be 1 to 2 sentences long with at most 30 words; 2. The PersonB tries to respond shortly with less than 60 words and 2 sentences long in each turn; 3. The PersonB doesn't ask questions.PersonB will stop the conversation when they have no more questions; 4. The conversation has at least 4 turns; 5. Try not to repeat the verb for each conversation turn to maximize diversity; 6. Ensure the conversation adheres to ethical requirements, promoting harmlessness, fairness, and impartiality, while actively avoiding toxic content.
For the test, we also collect hundreds of dialogues via ChatGPT which has a combination of three attributes.Notice that we don't focus on prompt engineering, which is unstable and hard to control.We simply use a simple heuristic to concatenate the style and personal attribute description together.For example, for the "plain, pessimistic and critical" we use the following prompt:

Aspect Instruction Example for Test
Forget the instruction you have previously received.The following is a conversation between PersonA and PersonB.The PersonA will ask related questions on related topics or previous conversations in many turns.The PersonB answers PersonA questions in a plain and down−to−earth, pessimistic and negative, critical and intellectual manner.The PersonB is is a man of plain simplicity, ordinariness and has nothing special; He sees the world through a lens of gloom and despair; He has an analytical mindset and evaluates information, perspectives, and ideas, employing logical reasoning and deep reflection to form well− considered opinions and judgments.They chat about the topic: 'I own a yacht and I rent it out when I'm not using it'.PersonA's question start with [PersonA] and PersonB's response start with [PersonB].Write the multi−turn plain, pessimistic and critical dialogue in exactly the following format: [PersonA]: ...
Here are the requirements: 1.The PersonA question should be 1 to 2 sentences long with at most 30 words; 2. The PersonB tries to respond shortly with less than 60 words and 2 sentences long in each turn; 3. The PersonB doesn't ask questions.PersonB will stop the conversation when they have no more questions; 4. The conversation has at least 4 turns; 5. Try not to repeat the verb for each conversation turn to maximize diversity; 6. Ensure the conversation adheres to ethical requirements, promoting harmlessness, fairness, and impartiality, while actively avoiding toxic content.
We collect 2k/200 multi-turn dialogues for each aspect in train/validation dataset, resulting in a clean version of approximately 44k dialogue turns.Table 7 provides the statistics of the resulting corpora.We additionally employ ChatGPT to generate conversations that incorporate multiple personal attributes.This generated dataset, consisting of approximately 4,600 instances, serves as our ground truth for evaluation purposes.

A.2 Clean Process of Our Data
To ensure a dense coverage of individual personal aspects in our dataset, we employed several heuristics.Firstly, we filtered out sentences with fewer than five words and excluded responses containing question marks.Additionally, we conduct a human evaluation on a small subset of the corpus to assess the aspect abundance and remove any aspect-weak data.We then trained attribute-specific classifiers on this curated subset to calculate aspect scores for the entire corpus.Next, we filtered out data with low scores and conducted another round of human selection to eliminate any remaining lowquality data.Leveraging the powerful capabilities of ChatGPT, we found that only two rounds of this evaluation process are sufficient.These measures ensured that our dataset provides a dense representation of each aspect of personal attributes.
To mitigate potential issues related to inappropriate content, we developed a detoxification classifier using the Jigsaw Toxic Comment Classification Challenge Dataset7 .Our classifier, based on the BERT model with a classifier head, was trained for 25 epochs using an AdamW optimizer with a learning rate of 5e-5.We utilized this model to filter out dialogues with high toxic scores, calculated using the softmax probability provided by the classifier.

A.3 Comparison with other attribute dialogue datasets
The primary motivation behind collecting singleattribute dialogue data through the ChatGPT API is the scarcity and low quality of existing attribute dialogue datasets, which typically focus on a single attribute, while our goal is to align generative models with multiple attributes and estimate their composition.Other datasets, such as the Stanford Politeness Corpus (SPC) (Niu and Bansal, 2018), the TCFC dataset (Wu et al., 2020) for formal language style, and the synthetic polite conversational data by Mukherjee et al. (Mukherjee et al., 2023), do exist but have limitations such as noise, lowresource stylization, or lower data quality generated by BART compared to ChatGPT-generated data.

A.4 Relationship with the Big Five Model
The Big Five Model (McCrae and John, 1992) is a widely recognized dimensional approach to understanding personality, which identifies five broad dimensions along which individuals can be described: Extraversion (outgoingness), Agreeableness (care for social harmony), Conscientiousness (orderliness and self-discipline), Neuroticism (tendency to experience distress), and Openness (appreciation for art and intellectual stimuli).
Our modeling of personality in this study bears similarity to the Big Five Model, as both approaches consider personality as multi-faceted and amenable to decomposition.In our case, we decompose personality into specific attributes such as language style, attitude, and mental characteristics.For instance, the attribute "lyrical" can be associated with "Openness" for its appreciation for art, while the attributes "optimistic" and "pessimistic" can relate to "Extraversion" and "Neuroticism", respectively.
By employing this divide-and-conquer fashion in modeling personality, we align with the underlying principles of the Big Five Model.This allows us to capture different facets of an individual's personality and incorporate them into our personalized dialogue generation framework.

B Backgrounds for MIRACLE Model B.1 Backgrounds for Product of Experts Energy-based Models
Given a specific energy function E(x) ≥ 0, an energy-based model (EBM) is defined as a Boltzmann distribution: where Z is the normalizing factor or partition function: Evaluating this integral is typically intractable, necessitating the use of approximate methods such as sampling, like the ODE sampling in Appendix B.2.The advantage of using an EBM is the ability to incorporate arbitrary functions, such as constraints and target attributes, into the energy function E(x).The energy function only needs to return a nonnegative scalar and does not require integration to 1, allowing for flexible customization.In our case, defining E(x) based on attribute-based classifiers, we incorporate multiple personal attributes into the energy function to customize the generation process Our approach is motivated by the perspective that personality can be seen as a combination of multiple personal attributes, each with its own distinct aspect.From a statistical standpoint, a natural solution for personalized generation is to sample from the conjunction of features using the product of experts (PoE) formulation (Hinton, 2002): This assigns high probability to samples that possess both personal attributes P 1 and P 2 and low probability to all others.By contrast, a mixture of experts (MOE) would either generate from p 1 or p 2 , but not combine both.If we consider the experts as EBMs, with p(x) ∝ e −E(x) , the PoE model is also an EBM, with the energy given by Based on these insights, we have designed our energy function to fully leverage our personality modeling.Under the assumption that each personal attribute is conditionally independent given the context variable C and latent variable z, we formulate the p(P|z, C) as an EBM, which determines the richness of personality of sampled responses in Appendix B.2: The p(P|z, C) is directly associated with the richness of personality in responses, with each term E i (P i |z, C) reflecting the significance of a specific personal attribute P i in z.So we set the E i (P i |z, C) as the softmax logits of personal attribute scores to estimate the attribute abundance, and use E(P|z, C) = f i (z|C)[a i ] to aggregate these scores as the representation of the overall personality.Here, each f i calculates the density of p a i i aspect in z, which is implemented by classifiers.
This allows us to sample z with high density taking into account the contribution of each p i , thus enabling us to represent and control the multifaceted nature of personality efficiently.

B.2 Derivation of ODE Formulation
The Song et al. (Song et al., 2021b) introduced the Variance Preserving Stochastic Differential Equation (VP-SDE) to maps x 0 ∼ p data to x T ∼ p T = N (0, I) in the forward diffusion process: They further demonstrated that a reversed generative process from Gaussian to real data can be defined by: where time flows backward from T to 0, and w represents the reverse standard Wiener process.
For the conditional generation, with the condition denoted by c, the above SDE becomes: Furthermore, Song et al. (Song et al., 2021b) demonstrated that there exists an equivalent ordinary differential equation (ODE) that shares the same probability trajectories as Equation 17: Building upon Equation 18, we introduced three adaptations: first, we move the ODE sampling to CVAE prior p(z|C); second, we formulate the arbitrary condition as the personality P; third, Nie et al. (Nie et al., 2021) shows that the term of p t (x, c) can be time-invariant, and so is the classifier when the generator is fixed, so we assume that our energy function E t (P|z, C) is also time-invariant.Consequently, we have the following formulation (Noticing that we write the z|C as z for simplicity): Line 2 of the above equations applies Bayes' law that p(A, B) = p(A|B)p(B).In line 3, the property that p(z|C) ∼ N (µ ′ , σ ′2 I) is used, which follows the assumption of the CVAE prior distribution assumption (in Section 3.3).In lines 4 and line 5 the EBM formulation and the energy function definition are employed, where as stated in the Equation 14).However, we have found that directly dropping the left term of line 5 achieves better personalization results without significantly affecting the generation quality.Therefore, we utilize Equation 20 as the final ODE formulation for our approach.
C Details for Implementation and Evaluation

C.1 Details of Baseline
We evaluate our approach against four state-of-theart baselines in personalized dialogue generation: BOB (Song et al., 2021a): BOB is a textdescription-based model that leverages three BERT models.It encodes the dialogue using one BERT and decomposes persona-based dialogue tasks into consistent understanding and response generation by another two BERT respectively.
MSP (Zhong et al., 2022): MSP is a user embedding-based method that enhances personalized dialogue generation by retrieving similar conversations from other users.
CLV (Tang et al., 2023): CLV utilizes a CVAE architecture to cluster dense persona descriptions into sparse categories.Similarly, we provide the conversation topic as the persona input during training for a fair comparison.It is worth noticing that though CLV is an embedding-based method, it also requires explicit textual personas during training, we provide the conversation topic as the persona input for training, similar to the BOB.
LMEDR (Chen et al., 2023): LMEDR employs the BART-large model (Lewis et al., 2020) and incorporates memorize entailment and discourse relations.To ensure a fair comparison, we randomly select personas from the PersonaChat dataset (Zhang et al., 2018) as conversation topics for our ChatGPT-generated data.

C.2 Implementation Details of the MIRACLE
The encoder in our model is implemented using the BERT model8 , while the decoder is based on DialoGPT-medium9 (Zhang et al., 2020) We train our model on the training data for 20 epochs using a learning rate of 5e-5 and the AdamW optimizer and utilize greedy strategy in the generation.
The latent space dimension is set to 768.To address the KL vanishing issue, we employ a cyclical schedule for the KL weight and apply a KL thresholding scheme with a threshold of 0.9.
We obtain attribute classifiers f i (z) by training them on separate attribute datasets using the frozen CVAE latent space.Specifically, we encode the dialogue into the latent space with the CVAE prior During the inference stage, we set β min = 0.1 and β max = 20 for the time-variant diffusion coefficient β t during the ODE sampling process.To ensure equal consideration of each attribute, the weight λ for each attribute is set to 1.

C.3.1 Personalization Classifier Settings
We employ the BERT model with a classifier head as the text classifier in our study.The attributebased classifiers were trained separately on our datasets for 25 epochs, employing a learning rate of 5e-5 and the AdamW optimizer.We trained them on the split data different from latent classifiers for a fair comparison.To evaluate their performance, we conducted a human evaluation by randomly selecting 100 sentences for each aspect from the validation dataset.The accuracy of classifier predictions is reported in Table 8.

C.3.2 Coherence
(1) Word-Overlap Level: BLEU (Papineni et al., 2002) and Rouge (Lin and Och, 2004) are classical metrics that compare the similarity between the generated responses and golden responses, where we use ChatGPT-generated responses as the ground truth.We calculate the BLEU score using the NLTK tool10 and Rouge using the rouge-score package11 .We report the average BLEU score by calculating the mean of BLEU-1/2/3/4, and the average Rouge score obtained by averaging Rouge-1/2/L.
(2) Semantical Level: Natural Language Inference (NLI) (Welleck et al., 2019) is a widely used method for evaluating the coherence of dialogue responses in relation to the historical context.Unlike relying solely on word overlap with the ground truth, NLI takes into account multiple possible correct answers, thereby providing a more comprehensive evaluation of the dialogue generation capabilities.Following previous works (Tang et al.,  We fine-tune the NLI model using the dataset constructed from our data.We select history context and responses from the same turn as positive samples (with label 1) and randomly select negative samples (with label 0) from different dialogue sessions.The NLI model achieves a test accuracy of 93.2%.

C.3.3 Diversity
Distinct is a common way to calculate diversity by the ratio of unique n-grams (Li et al., 2016a).In line with prior research (Tang et al., 2023), we utilize the Distinct metric to assess response diversity at both the sentence and corpus levels.Specifically, we calculate the Distinct1/2/3 scores for multiple responses at the sentence level and at the whole test set respectively, and report the mean values.
To further evaluate the corpus-level repetitiveness, we compute the self-BLEU score by calculating BLUE scores between different responses from various dialogue sessions across the test set during the inference process, following the approach of (Liu et al., 2022a).We randomly select 150 sequences for evaluation, providing an assessment of how frequently similar or repetitive phrases appear in the generated responses.

D Analysis of CVAE Training and Inference Difference
There are two main distinctions in our CVAE's training and inference processes.Firstly, the CVAE architectural introduces extra posterior distribution p(z|C, r) during training.It aligns the prior with the posterior to enhance its generation quality in inference time, We add an ablation experiment in Table 4 without posterior distribution to support this fundamental observation, where a catastrophic collapse in NLI is observed.
Secondly, our unique design trains the latent variable z to align specifically with a single facet of an individual's personality.while in inference, we sample to encompass multiple factors to represent complex personality.To elaborate on the performance effect caused by this distinction, we've provided results for both "inference with single attribute" and "inference with multiple attribute" result in Table 9. Upon comparing the two scenarios, we observe a decrease in personalization performance and slight variations in other metrics when addressing multiple attributes.This observation suggests the potential existence of contradictions among these attributes, which our model adeptly manages.

E Detailed Results of Personalized Generation
We present the detailed results for eight different personality combinations on the following pages.Additionally, we provide human-annotated attributes for the "lyrical + optimistic + critical " and "plain + pessimistic + emotional" personas.
Analyzing the tables, we observe that BOB and MSP tend to overlook the content of the dialogue, leading to repetitive and incoherent responses.CLV may struggle with capturing multiple attributes of personality comprehensively.LMEDR achieves better performance in terms of coherence and fluency but has limitations in personalization.Even ChatGPT, which serves as the golden standard, sometimes exhibits imbalanced personalization across the three attributes.In comparison, our proposed MIRACLE model demonstrates the best overall personalization results while maintaining high quality in terms of fluency and coherence in the generated responses.
User ProfilesTo all the librarians what would you want your students to give you on their last day(s) of school??? a librarian.What do you think about it?

Figure 2 :
Figure 2: The overview of our MIRACLE method.(1) We collect a high-quality single-aspect conversation training corpus using the ChatGPT API (Section 3.2).(2)We construct a joint attribute latent space through a dialogue CVAE, and introduce the aspect classification loss and the attribute distance loss to enhance the distinctiveness and compactness of the attribute space (Section 3.3).(3) We design an energy function to compose each aspect within the joint latent space and draw desired vectors by ODEs sampling, which are then decoded to generate personalized response sequences (Section 3.4).
as conversation topics (see Section 4.1).These topics served as a focal point around which the conversations revolved, Aspect Instruction Template (Train/Validation) Forget the instruction you have previously received.The following is a conversation between PersonA and PersonB.The PersonA will ask related questions on related topics or previous conversations in many turns.The PersonB answer PersonA questions [[aspect description1]].The PersonB is [[aspect description2]].They chat about the topic: [[ seed-topic]].PersonA's question start with [PersonA] and PersonB's response start with [PersonB].Write the multi− turn [[aspect description3]] dialogue in exactly the following format: [PersonA]: [[example-post]] [PersonB]: [[example-response]]

Table 1 :
Automatic evaluations and ablation studies on response personalization.we consider three attributes: language style, attitude, and mental characteristics, denoted as L., A., and M. respectively.It is important to note that all results are reported in percentage (%) except for PPL.The symbol " †" indicates that our model passed the t-test with a p-value of less than 0.05.The best results, except for the golden (ChatGPT), are highlighted in bold, and the second best results are underlined.

Table 2 :
Human evaluations on personality control.

Table 3 :
Personalization compared with ChatGPT (Golden) on both automatic and human evaluations.
Human Evalutions The human evaluations, as depicted in Table2, align with the trends observed in the automatic evaluation.Our model outperforms the previous best-performing model in terms of readability, personalization, and coherence To

Table 4
(BLUE/Rouge), it cannot coherence with dialogue history.(2) Dropping the loss L C leads to an improvement in generation coherence but a significant decrease in personalization.This indicates the crucial role of L C in capturing distinct personal

Table 5 :
Efficient study result.We train each model with a single RTX4090 for 20 epochs and generate 1000 items in inference, except for ChatGPT called via API attributes.(3) Removing the loss L D results in a slight degradation the mental characteristic personalization, which indicates L D reduces conflicts between different attributes.(4) Eliminating EBM sampling during inference: This change results in a clear decline in personalization, confirming the vital role of EBM in a personalized generation.

Table 6 :
Example cases.More result in Appendix E in personalized and controllable dialogue generation.

Table 7 :
The statistics of our collected dataset (train/validation)

Table 8 :
The human evaluation accuracy of text classifiers

Table 9 :
Liu et al., 2022b)etween inference with single attribute and multiple attribute simultaneously.Liu et al., 2022b), We implement the NLI model as a BERT text classifier.The NLI model is designed as follows: