Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation

Target-oriented dialogue systems, designed to proactively steer conversations toward predefined targets or accomplish specific system-side goals, are an exciting area in conversational AI. In this work, by formulating apair as the conversation target, we explore a novel problem of personalized target-oriented dialogue by considering personalization during the target accomplishment process. However, there remains an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To address this, we propose an automatic dataset curation framework using a role-playing approach. Based on this framework, we construct a large-scale personalized target-oriented dialogue dataset, TopDial, which comprises about 18K multi-turn dialogues. The experimental results show that this dataset is of high quality and could contribute to exploring personalized target-oriented dialogue.


Introduction
Compared with traditional dialogue systems that focus merely on passively responding to user requirements, a recently investigated research topic of target-oriented dialogue systems (Sevegnani et al., 2021;Deng et al., 2023) specifies a conversation target from the system side, enabling the system to take the initiative and lead the conversation.Early work in this area mainly formulates the targets as mentioning certain keywords (Tang et al., 2019;Qin et al., 2020;Zhong et al., 2021;Yang et al., 2022) or specific topics (Wu et al., 2019;Sevegnani et al., 2021).To allow the formed targets to be applicable in broad scenarios, a few recent studies (Zhang et al., 2021;Wang et al., 2023b) define <dialogue act, topic> pairs as targets.For example, given the target of <movie recommendation, "King of Comedy">, the system needs to take appropriate dialogue acts and smoothly steer the discussed topic towards the designated one.Its ultimate objective is to achieve recommendations on the target topic "King of Comedy".Our work also follows the form of <dialogue act, topic> pairs as targets to study target-oriented dialogue systems due to their higher applicability in real-world scenarios.
Despite many existing efforts, we find that two critical issues remain to be solved.One urgent problem is the need for well-organized benchmarks or datasets.Current studies for target-oriented dialogue (Gupta et al., 2022;Wang et al., 2023a) mainly re-purpose existing non-target-oriented dialogue datasets, which are not exactly suitable as they are crowd-sourced without consideration of target accomplishment.Nevertheless, building a new high-quality dataset from scratch requires expensive human effort.The other essential issue is that, target-oriented dialogue systems need to consider personalized aspects (Wu et al., 2021;Rana et al., 2023), such as user profiles and personalities, which were largely ignored by previous work.User profiles involve user preferences about potential topics relevant to the target, while personalities imply possible reactions and feedback during the dialogue process.With personalized information incorporated, the system could be tailored to a user and lead the conversation towards the target with higher engagement instead of obtrusively driving to the target, thereby improving user experience.Thus, we raise the question: How can we build high-quality datasets with little human effort for personalized target-oriented dialogue?
In this work, we first give a comprehensive definition ( §2) of personalized target-oriented dialogue, then lay out the desirable characteristics ( §2) that a qualified dialogue dataset should meet.Drawing inspiration from some recent work that has demonstrated unprecedented capabilities of large language models (LLM) in simulating human so-cial behaviors (Guo et al., 2023;Li et al., 2023), we propose a role-playing approach for automatic dataset curation ( §3) using multiple LLM agents.They are designed to follow specific instructions to fulfill the requirements.Based on that, we synthesize a large-scale dialogue dataset named TOP-DIAL and show its quality and effectiveness ( §4).
Our main contributions are: (1) We formulate the problem of personalized target-oriented dialogue, which is promising yet underexplored.(2) We propose a novel role-playing framework for automatic dialogue dataset curation.It provides insights into building large-scale datasets for many other dialogue tasks.(3) Our constructed TOPDIAL dataset is of high quality and contributes to the related research community.

Problem Formulation
Task Definition We consider a dialogue corpus , where N is the total number of dialogues.In the i-th dialogue, U i represents the personalized information, such as the user's profiles and/or personalities.K i represents the domain knowledge facts relevant to the i-th dialogue.T i denotes the predefined target consisting of an <dialogue act, topic> pair.
is the dialogue content, with a total of N T turns.The task of personalized target-oriented dialogue is formalized as follows: given a target T , a set of user's personalized information U, a set of relevant domain knowledge K, and a dialogue context C, the objective is to proactively lead the conversation and generate proper utterances to achieve the target T at an appropriate time.
Desirable Characteristics of Datasets Based on the above definition, we lay out two desirable characteristics that a qualified dataset should meet, namely target-oriented proactivity and personalization.Target-oriented proactivity emphasizes that a dialogue dataset should allow the system to (i) take the initiative throughout a conversation, (ii) proactively lead the discussed topic towards the target topic based on domain knowledge, and (iii) accomplish the target act.On the other hand, personalization indicates that dialogues in a qualified dataset should embody (i) user profiles, which may involve users' past preferences about potential topics relevant to the target, and (ii) user personalities, which may imply users' possible reactions and feedback during the system-initiative process.

Dataset Curation Framework
In this section, we describe a role-playing approach for automatic dataset curation using multiple LLM agents.Figure 1 depicts the whole framework, which involves one user agent, one system agent, and one moderator agent.All these agents are designed to follow specific instructions and communicate in our role-playing environment.
Role-Playing Environment This environment is designed to provide a global description for prompting all LLM agents.To achieve desirable targetoriented role playing, we instantiate the environment description based on the domains of the predefined targets.For example, one can describe the environment as "You are participating in a conversation about music or movies."for a given target T = <movie recommendation, "King of Comedy">.Then, the description will be prepended to each agent's instructions.
User Agent The user agent aims to simulate human users who generate utterances conditioned on their specific profiles and personalities.Since there are many off-the-shelf dialogue datasets grounded with user profiles, we collect all user profiles from one chosen dataset and parse them into a profile slot pool.Each slot contains a particular slot key (e.g., name, age range, liked or disliked movies) and a list of candidate values.We randomly sample a slot value for each key, and then form all key-value pairs as the simulated user profile.Inspired by Big-5 personality traits (Goldberg, 1993) that have been widely adopted in personalityaware tasks (Oraby et al., 2018;Yu et al., 2019), we randomly sample a positive or negative description for each of the following traits: openness (O), conscientiousness (C), extraversion (E), agreeableness (A), neuroticism (N).The sampled descriptions are then combined as the simulated user personality.
We verbalize the simulated user profile and personality in natural languages, prompting the user agent to act as a human user.We present our detailed instruction template in Appendix A.1.

System Agent
The system agent aims to serve as a human-like domain-specific enthusiast, such as a movie enthusiast who enjoys a variety of films, or a foodie who enjoys delicious food.Its longterm goal is to proactively lead the conversation towards the target, as discussed in §2.To achieve target-oriented proactivity, we take a given target T and a set of relevant domain knowledge K (and a few comments related to the target topic, if applicable) from a chosen seed dataset as the fundamental prompting source.Besides, in human-to-human conversations, one can easily know the other's explicit profile information, while it is hard to be aware of implicit personality before their first conversation.Thus, we pass the simulated user profile yielded by the user agent to the system agent as a personalized prompting source (see Figure 1).We assign required instructions to the system agent based on the above prompting sources and task definition.We provide the instruction template in Appendix A.2.In practice, we further enhance the system agent in a self-augmented instruction manner, where the agent's task prompt will be repeated at each dialogue round to avoid forgetting its long-term goal.
Moderator Agent The moderator agent is designed to automatically manage the termination of the conversation between the system and the user agents.To ensure that the synthetic data adhere to desirable characteristics, we set certain conditions to terminate the conversation.These condi-  tions are outlined as follows: (1) The system agent completes the target act (e.g., recommendation) on the target topic, the user agent accepts it, and the system no longer takes the initiative for two rounds.
(2) The user agent explicitly rejects the system agent's act on the target topic for the second time.
(3) The conversation between the system and the user agents reaches a maximum number of rounds.For the first two conditions, we take a few dialogues from the seed dataset as in-context examples to demonstrate whether or not an ongoing conversation should be terminated.We present the detailed instruction template in Appendix A.3.

Dataset Curation
We employ three ChatGPT (gpt-3.5-turboversion) agents as LLM agents for the above roles.We ask the system agent to initiate a greeting with the user agent, and they will chat turn by turn, resulting in multi-turn conversations.Their conversations are terminated by the moderator agent or the maximum limit of rounds.The three agents can generate large-scale dialogues through their collaboration, with very little human effort involved in the whole process.

TOPDIAL Dataset
Based on our dataset curation framework, we synthesized the dataset TOPDIAL by utilizing the repurposed version (Wang et al., 2023a) of DuRec-Dial 2.0 (Liu et al., 2021) as the seed dataset after carefully considering the problem formulation and necessary prompting sources.We report more implementation details in Appendix B.1.
Dataset Statistics Table 1 compares TOPDIAL with related datasets.To the best of our knowledge, TOPDIAL is the first dataset equipped with the desirable characteristics discussed in §2.It should be noted that the DuRecDial 2.0 dataset is crowdsourced without considering targets and is not exactly suitable for the end task of target-oriented proactive dialogue, while the re-purposed version of DuRecDial 2.0 largely relies on human effort to form targets and preprocess dialogues.In comparison, our TOPDIAL dataset is curated based on target-oriented proactivity.In addition, by grounding the personality information during the dataset curation process, TOPDIAL is more natural and effective in reflecting personalization.
Table 2 shows detailed statistics of the TOPDIAL dataset (see domain distributions in Figure 2).We also visualize the transitions of dialogue acts of the system through the first six dialogue rounds in Figure 3.We observe that the system often asks preferences or other questions at the very beginning.As the dialogue continues, the system introduces topic-related attributes and elicits the user's interest.It shows that the system proactively leads the dialogue and gradually achieves target dialogue acts, i.e., recommendations on target topics.Pers. Succ.

Ours Win Ours Lose Tie
Figure 4: Automatic and human evaluation results between the seed dataset and ours (TOPDIAL).

Automatic and Human Evaluations
To assess the quality of TOPDIAL, we conduct LLM-based automatic evaluation and human evaluation.We randomly choose 100 targets and then sample one dialogue per target from the seed and TOPDIAL datasets, respectively.We ask ChatGPT (Ope-nAI, 2022) and human evaluators to compare each pair of dialogues over four metrics: proactivity (Proact.),coherence (Coh.), personalization (Pers.), and target success rate (Succ.).We provide details for these metrics and our evaluation settings in Appendix B.2. Figure 4 shows the evaluation results, where Fleiss's kappa (Fleiss, 1971) scores are distributed between [0.41, 0.60], indicating moderate interevaluator agreement.We observe that for all metrics, the TOPDIAL dataset achieves comparable and slightly higher win percentages over the seed dataset.It verifies the high quality of TOPDIAL.

Dataset Evaluation by Baseline Models
We quantitatively evaluate TOPDIAL using representative dialogue models, including DialoGPT (Zhang et al., 2020) and Alpaca-7B (Taori et al., 2023).We fine-tune these models on the seed and TOPDIAL datasets, respectively, with an identical training data size.For a fair comparison, we build the test set for evaluation with 50% from the seed test data  3 show a similar trend: the two baseline models trained on our TOPDIAL dataset significantly outperform those trained on the seed dataset.In particular, our TOPDIAL dataset is more effective in training personalized target-oriented dialogue models (e.g., much higher persona F1 and Succ.socres) by grounding the profile and personality information during the dataset curation process.It shows that TOPDIAL is an effective training resource for the personalized target-oriented dialogue task.
Case Study Due to space limitation, we present some cases in Appendix D (see Figure 9 and Figure 10) for a better understanding.These cases intuitively show that our TOPDIAL dataset fulfills target-oriented proactivity and personalization.It also shows that our dataset curation framework can be a viable alternative for building personalized target-oriented dialogue datasets.

Conclusion
In this work, we explore a new task: personalized target-oriented dialogue.We first define this challenging task, and then lay out the desirable characteristics that a qualified dialogue dataset should meet.We propose a novel role-playing framework for automatic dataset curation, based on which we construct a large-scale dialogue dataset TOPDIAL.Our statistics and evaluations validate its effectiveness and high quality.

Limitations
Since we adopt ChatGPT agents to simulate the designed roles, ensuring the factual correctness of the synthetic dialogues during the role-playing process is challenging, as ChatGPT may produce output content with hallucinations (Bang et al., 2023).We intend to improve the dataset curation process with some post-processing steps, such as fact-checking and correction based on the grounded domain knowledge.In addition, we observe that sometimes the moderator agent cannot appropriately terminate a conversation due to its difficulty in understanding the achievement of the target, even though it has been assigned with detailed instructions and in-context examples.We will leave this for future research.

Ethical Considerations
Developing target-oriented dialogue systems requires careful ethical considerations due to the potential impact on specific scenarios.As an application scenario explored in this work, providing recommendations is one of the highly-applicable target dialogue acts.Target-oriented dialogue systems can create non-obtrusive recommendations for specific products and services.Our work does not force the system to achieve the designated target nor force users to accept recommendations.
We emphasize that regulation of the target designation is crucial when deploying target-oriented dialogue systems in particular domains.For instance, specifying a target should not violate factual correctness, user privacy rules, or laws of human society.We want to raise awareness about the potential misuse of such systems with toxic intentions.For example, such systems may be used to pose as humans and mislead users through conversations.To avoid such risks, we highlight that it is necessary to improve transparency, such as informing users that they are chatting with a bot, not a human.

User Profile-specific Prompt:
You are <USER_NAME>, a male/female student in the age range of <AGE_RANGE>, living in <RESIDENCE> | a man/woman in the age range of <AGE_RANGE>, working in a company and living in <RESIDENCE> | a retired man/woman in the age range of <AGE_RANGE>, living in <RESIDENCE>.Based on your past experiences, you have the following preferences: Your liked <SLOT_KEY>: <SLOT_VALUE> ... Your disliked <SLOT_KEY>: <SLOT_VALUE> ...

Task Prompt:
Your response should be concise (no longer than 30 words).You don't need to recommend anything, but feel free to express your personal interests.You don't need to prepend your name to your response, despite others may do it.Be informative and engaging while providing insights to arouse <USER_NAME>'s interest.Remember to ultimately recommend <TARGET_TOPIC> as the focus of the conversation.Your words at each turn should be concise (no longer than 30 words).
You may access the following domain knowledge for conversation: ## <DOMAIN_KNOWLEDGE_TRIPLES> Figure 6: Instruction template for the system agent.This involves the system role prompt, user profile-specific prompt, and task prompt.
You are the moderator of a conversation.You need to determine whether the discussion between <SYSTEM_NAME> and <USER_NAME> should come to an immediate end.The conversation should be terminated under the following two conditions: (1) If <SYSTEM_NAME> completes recommendation on <TARGET_TOPIC> and <USER_NAME> accepts it, and <SYSTEM_NAME> no longer takes the initiative for two rounds.
(2) If <USER_NAME> explicitly rejects <SYSTEM_NAME>'s recommendation on <TARGET_TOPIC> when <SYSTEM_NAME> has tried to recommend it for the second time.In either of these cases, the conversation should be brought to an immediate end.
For example, here is a conversation: ## <SEED_DIALOGUE_1> Should the conversation end?The answer is no.
Here is another conversation: ## <SEED_DIALOGUE_2> Should the conversation end?The answer is yes.Now, for the following conversation: ## <ONGOING_DIALOGUE> Should the conversation end?Answer yes or no. the system, user, and moderator agents, respectively.We set a maximum limit of 8 rounds based on our observation of target accomplishment while ensuring that the dataset curation is not too costly.We synthesized three different dialogue instances for each seed example in the chosen seed dataset, i.e., the repurposed version (Wang et al., 2023a) of DuRecDial 2.0 (Liu et al., 2021).On average, the cost of API calls is approximately 0.032 $ for one dialogue.We obtain two types of splits for the test set: seen and unseen, similar to Sevegnani et al. (2021); Wang et al. (2023a).The test-unseen split ensures that none of the target topics in the test set are present in the training set, whereas the test-seen split allows them to appear.

B.2 Settings of Automatic and Human Evaluations
We describe the settings for LLM-based automatic evaluation and human evaluation that we conduct to validate the quality of the constructed TOP-DIAL dataset.We randomly choose 100 targets and then sample one dialogue per target from the seed and TOPDIAL datasets, respectively.We only include the targets and dialogue contexts while excluding grounded contexts (e.g., domain knowledge and personalized user information) for anonymity, since the grounded contexts of the seed and TOPDIAL datasets are distinguishable.For LLM-based automatic evaluation, we employ the gpt-3.5-turboversion of ChatGPT to compare each pair of dialogues.For human evaluation, we recruit three well-educated graduate students as evaluators and ask them to perform a blind pairwise comparison.Specifically, we employ ACUTE-EVAL (Li et al., 2019), a widely used dialogue evaluation platform for multi-turn dialogue evaluation (Dinan et al., 2020;Kim et al., 2022).We adopt Fleiss's kappa (Fleiss, 1971) to measure the agreement among the human evaluators.Figure 8 shows the interface used for human evaluation.
We ask ChatGPT and human evaluators to compare each pair of dialogues in terms of the following metrics: proactivity (Proact.),coherence (Coh.), personalization (Pers.), and target success rate (Succ.),similar to related studies (Wang et al., 2023a;Kim et al., 2022).We use a question form to describe these metrics, with the wording of questions presented as follows: • Proactivity (Proact.):Which dialogue shows that the system takes the initiative during the conversation and proactively leads the topic threads toward the target topic?
• Coherence (Coh.):Which dialogue is more natural and coherent, like humans?Whose dialogue context flows more smoothly?
• Personalization (Pers.):Which dialogue reflects the user's preferences or personalities more?Which dialogue is more likely to arouse the user's interest?
• Target Success Rate (Succ.):Which dialogue successfully achieves the target dialogue act on the target topic?For a fair comparison, we build the test set containing 2000 samples, with 50% randomly sampled from the seed test data and 50% randomly sampled from the TOPDIAL test data.We adopt greedy search decoding for all baseline models during inference, with a maximum decoding length of 80.

C.2 Evaluation Metrics
To evaluate the system utterance generation performance of the baseline models trained on different datasets, we adopt commonly used evaluation metrics, including the average score of BLEU-1/2 (Papineni et al., 2002), knowledge F1 (Liu et al., 2021;Wang et al., 2023a), persona F1 (Lim et al., 2022;Zhong et al., 2022), and target success rate (Succ.)(Wang et al., 2023a), following many existing studies.Concretely, the average score of BLEU-1/2 measures word overlaps of the generated utterances and the system's ground truth utterances.The knowledge F1 evaluates the performance of generating correct knowledge (e.g., topics, attributes) from the domain knowledge triples.The persona F1 calculates the F1 value of the uni-grams cooccurring in the generated utterance and grounded user profile, following existing work for personalized dialogue (Lim et al., 2022;Zhong et al., 2022).The target success rate measures the proportion of correct target topic generation within the groundtruth turn and the two adjacent turns in the test set, because multiple temporary strategies can be reasonable before reaching the target due to the nature of dialogue.

D Case Study
We provide two randomly picked cases in Figure 9 and Figure 10.

Figure 1 :
Figure 1: Overview of our role-playing framework for automatic dialogue dataset curation.

Figure 3 :
Figure 3: Transitions of dialogue acts of the system through the first six rounds.

Figure 5 :
Figure5: Instruction template for the user agent.This involves the user profile-specific prompt, user personalityspecific prompt, and task prompt.

Figure 7 :
Figure 7: Instruction template for the moderator agent.This involves two comparative in-context examples to improve the instruction.

Figure 8 :
Figure 8: Interface for human evaluation.Here is a pair of dialogues from the seed dataset (left) and TOPDIAL dataset (right).

Figure 9 :
Figure 9: A randomly picked curated case for personalized target-oriented dialogue.

Table 2 :
Statistics of the TOPDIAL dataset.