Dialogue Planning via Brownian Bridge Stochastic Process for Goal-directed Proactive Dialogue

Goal-directed dialogue systems aim to proactively reach a pre-determined target through multi-turn conversations. The key to achieving this task lies in planning dialogue paths that smoothly and coherently direct conversations towards the target. However, this is a challenging and under-explored task. In this work, we propose a coherent dialogue planning approach that uses a stochastic process to model the temporal dynamics of dialogue paths. We define a latent space that captures the coherence of goal-directed behavior using a Brownian bridge process, which allows us to incorporate user feedback flexibly in dialogue planning. Based on the derived latent trajectories, we generate dialogue paths explicitly using pre-trained language models. We finally employ these paths as natural language prompts to guide dialogue generation. Our experiments show that our approach generates more coherent utterances and achieves the goal with a higher success rate.


Introduction
Dialogue systems have made significant progress in generating high-quality responses for open-domain chitchat (Zhang et al., 2020;Roller et al., 2021) and assisting users in completing specific tasks (Madotto et al., 2018;Wu et al., 2019a).Instead of passively responding to users, dialogue systems can also take a proactive role to direct a conversation towards specific goals, such as introducing new and interesting topics (Wu et al., 2019b) or providing sociable recommendations on target items (Wang et al., 2022a).Such a proactive target-oriented or goal-directed dialogue system can guide conversations towards topics that the system knows how to discuss, making it promising to build autonomous conversational AI. Figure 1: An example from the repurposed DuRecDial 2.0 (Liu et al., 2021b) dataset.Given a pre-determined target and current dialogue context, we expect to plan a dialogue path to direct the conversation.
For goal-directed dialogue systems, the objective is to proactively direct conversations towards a designated target.Previous work has primarily predetermined the targets as specific keywords (Tang et al., 2019), topics (Wu et al., 2019b;Sevegnani et al., 2021), and dialogue action-topic pairs (Zhang et al., 2021;Wang et al., 2022a).To achieve this task, effective dialogue planning is essential, which requires taking reasonable actions and smoothly directing dialogue topics to the designated one.More importantly, the whole process is expected to be coherent and natural.Prior studies attempted to tackle this challenge through next-turn transition prediction (Tang et al., 2019), sub-goal generation (Zhang et al., 2021;Kishinami et al., 2022), and knowledge path reasoning (Gupta et al., 2022) to control dialogue generation.However, there are still open issues worth exploring.First, previous studies adopted a greedy strategy with a single-turn topic prediction mechanism, which lacks global planning for the dialogue process (Yang et al., 2022).Consequently, these methods are often short-sighted, resulting in sub-coherent topic threads.Second, recognizing a user's engagement and willingness to follow the system is crucial for achieving coherent transitions.However, current studies often overlook the importance of modeling such user feedback.Therefore, it is necessary to explore globally planned dialogue strategies while incorporating user feedback to improve the coherence of goal-directed dialogue systems.
In this work, our objective is to globally plan dialogue paths that connect the current context to the target at each turn.As illustrated in Figure 1, this dialogue path should strike a balance between coherence with the ongoing dialogue context and smooth transitions towards the target.Assuming that path trajectories without a target can be represented as Brownian motion (Revuz and Yor, 2013) in latent space, we expect the embeddings of neighboring trajectory points to be similar to each other, while those of distant trajectory points to be dissimilar.Drawing inspiration from Wang et al. (2022b), we view goal-directed dialogue behavior as a Brownian bridge (Revuz and Yor, 2013) stochastic process conditioned on fixed start and end points.As such, we can derive latent trajectories that follow coherent temporal dynamics.
Based on the above intuition, we propose a coherent dialogue planning approach via Brownian bridge (COLOR) stochastic process.It involves mapping dialogue path points, such as topics or action-topic pairs, into a latent space of Brownian bridge conditioned on the current context and designated target.To ensure goal-directed behavior and incorporate user feedback, we also map the latest user utterance into real-time user feedback representation using the same latent space.We leverage this feedback to perturb the density and uncertainty of the Brownian bridge, simulating its impact on the dialogue planning process.Our training process uses a contrastive objective, which helps retain global coherence.We then fine-tune pre-trained language models (PLMs) using the derived latent trajectories to plan dialogue paths explicitly.These paths provide step-by-step explanations for reaching the target and serve as natural language prompts for generating system utterances.
In summary, our main contributions are: (1) We propose a novel approach called COLOR, which effectively models global coherence and incorporates user feedback in goal-directed dialogue planning.Our method utilizes the Brownian bridge stochastic process, and to the best of our knowledge, this is the first work to apply this method to the goal-directed proactive dialogue task.(2) We repurpose existing dialogue datasets by automatically constructing system goals and splitting them into in-and out-ofdomain test sets.It facilitates research in the field and allows for more accurate evaluation of models.(3) Extensive experiments demonstrate that our proposed approach outperforms other methods, both in automatic and human evaluations.

Preliminaries
Problem Formulation We consider a corpus of goal-directed dialogues , where N is the total number of dialogues.The domain knowledge facts relevant to the i-th dialogue are represented as t=1 , with a total of N T turns.The whole dialogue path for the i-th dialogue is denoted as , where each path point is a topic or an action-topic pair.Here, dialogue topics are mainly constructed based on the domain knowledge K i .In some scenarios, there also exists a user profile U i , which can be user attributes or certain personal preferences.
Given a target T consisting of an action-topic pair or a topic only, a dialogue context C, and a set of relevant domain knowledge K (and a user profile U, if any), our objective is to generate coherent utterances to reach the target T when appropriate.The problem can be decomposed into two sub-tasks: (1) dialogue planning, which involves planning suitable actions and topics to lead the dialogue proactively with coherent transitions to the target, and (2) dialogue generation, which involves generating an appropriate utterance to achieve the planned action and topic at each turn.

Brownian Bridge
The standard Wiener process or Brownian motion W (t) has a normal distribution with mean 0 and variance t, i.e., W (t) ∼ N (0, t).A Brownian bridge (Revuz and Yor, 2013) is a continuous-time stochastic process pinned at fixed start and end points, where its distribution B(t) is given by: where t ∈ [0, T ], T denotes the end time.Furthermore, the transition distribution of a Brownian bridge process from an initial point z 0 at t = 0 to an end point z T at t = T is: It implies that a trajectory point z t follows a noisy linear interpolation between z 0 and z T , with z t closer to z 0 at the start and closer to z T at the end.
The uncertainty is higher in the middle of the time interval and lower near the start and end points.
The time-controlled nature of the Brownian bridge process has led to its application in various fields, such as trajectory simulation (Sousa et al., 2015) and language modeling (Wang et al., 2022b).

Method
We propose a coherent dialogue planning approach via Brownian bridge (COLOR) stochastic process to steer goal-directed dialogue generation.The intuition behind COLOR is to learn a mapping (see §3.1) in the Brownian bridge latent space that captures coherent temporal dynamics for planning dialogue paths.Each dialogue path consists of a sequence of topics or action-topic pairs, starting from the current context and leading to the target.We generate these paths explicitly (see §3.2) based on representations derived from the latent space, and use them to guide the generation of dialogue utterances (see §3.3).

Stage 1: Brownian Bridge Mapping
A Brownian bridge latent space involves a nonlinear mapping that transforms observations into a low-dimensional latent space, using the Brownian bridge stochastic process.Our objective is to utilize this mapping to train an encoder F, to convert raw dialogue paths into latent representations that retain global coherence, with the overview depicted in Figure 2. In the following sections, we will introduce two crucial aspects of our approach: user feedback modeling and contrastive training.
User Feedback Modeling Suppose we obtain the user feedback representation z u and have an engagement indicator δ u ∈ (0, 1), which reflects the user's level of engagement and likelihood of following the system, we newly define the transition distribution of the Brownian bridge process between a start point z s 0 at t = 0 and end point z s T at t = T as: (3) where 0 < t < T , φ(•) is a decaying function.
Here, z u is used to perturb the density (the mean µ st ) of the Brownian bridge process, and δ u is used to perturb its uncertainty (the variance σ 2 ), with perturbation strength decaying over time.This decay means that the impact of the current user feedback on future planning is reduced.Alternatively, φ(•) can be implemented with the linear decaying, i.e., φ(δ u ) = δ u (1 − t/T ), or the exponential decaying, i.e., φ(δ u ) = δ u e −t/(λT ) , where λ ∈ (0, 1) is a scaling factor.
Contrastive Training For a tuple of observations (S u , S 0 , S t , S T ), our objective is to ensure that their latent representations (z u , z s 0 , z st , z s T ) follow the Brownian bridge transition distribution described in Eq. (3).Here, S u is the latest user utterance (and the concatenation of the user profile, if applicable), which may embody real-time user feedback information.S 0 consists of the concatenated domain knowledge and dialogue context, revealing the start of the dialogue path.S T is the designated target, representing the end of the dialogue path.A path point, by default, refers to a topic or action-topic pair specific to the dataset.S t denotes a sampled path point in the dialogue path, s.t., 0 < t < T .Here, T denotes the number of transitions required to reach the target.
As shown in Figure 2, we build our encoder F on top of a frozen PLM encoder, which is followed by specific trainable multilayer perceptron (MLP) blocks.All the necessary latents are given by: where f θ denotes a frozen PLM encoder such as BERT (Devlin et al., 2019) or BART (Lewis et al., 2020) encoder, AvgPool(•) denotes the average pooling operation.f P , f C , and f E are MLP blocks that produce output with a latent dimension of d. σ is the Sigmoid activation function.The intuition behind the training is to ensure that the representation z st of a positive path point S t sampled from the same dialogue is close to the expected embedding µ st (the mean in Eq. ( 3)).In contrast, the representation z ′ of a negative random path point S ′ t from a different dialogue is far from µ st (see Figure 2) because it does not align with the Brownian bridge pinned by z s 0 and z s T .We consider a contrastive objective proposed in Wang et al. (2022b) for training.Formally, given input batches B = {(S u , S 0 , S t , S T )} consisting of randomly sampled positive path points S t where 0 < t < T , we optimize our encoder F as follows: where S + t denotes a positive tuple (S u , S 0 , S t , S T ), S − t denotes a negative tuple (S u , S 0 , S ′ t , S T ), σ 2 is the variance in Eq. ( 3), µ st is the mean in Eq. (3).

Stage 2: Planning Dialogue Paths
The Brownian bridge latent space makes it easy to derive a coherent latent trajectory with temporal dynamics.We feed the start point S 0 , designated target S T , and observed S u , into the trained encoder F respectively, then sample a latent trajec- Here, z acts like the transition-level latent representation that connects the ongoing dialogue context to the target, i.e., the dialogue path P to be planned.
To generate the path P, we define the required input as X = [C; K; T ], which is the concatenated text of the dialogue context C, domain knowledge K, and target T .As shown in Figure 3, we feed X into a pre-trained BART (Lewis et al., 2020) model for fine-tuning, with the encoded hidden states being h = (h 1 , h 2 , • • • , h m ).We discuss the generation of P by conditioning on h and z below.
First, sampling the latent trajectory z requires the value T , i.e., the number of transitions to reach the target.We obtain this value by adding an MLP layer f T to the BART encoder as a predictor, which outputs the probability of T : where h is the average pooled representation of h, W 1 and b 1 are trainable parameters.We optimize the predictor using cross-entropy loss L c .Second, our BART decoder conditions on h and the derived latent trajectory z, then generates the dialogue path P with encoder-decoder attentions.The output distribution is approximated as follows: where W 2 , b 2 are trainable parameters, W denotes a linear transformation that maps the dimension of z to be identical to h, and [; ] denotes concatenation.
The decoder is trained by minimizing the negative log-likelihood below: where p(y (i) ) is the distribution of the ground-truth dialogue path, while p θ (ŷ (i) ) is the distribution of the approximated dialogue path.
In addition, for the decoder's all hidden states 13)) and the transformed latent trajectory z o = W T z (see Eq. ( 14)), they inevitably both represent the dialogue path P though at different levels.We minimize the Kullback-Leibler (KL) divergence between h o and z o : where ho and zo denote the average pooled representation of h o and z o , respectively.
For training, our model is optimized as follows: where α, β, and γ are hyperparameters.During inference, we obtain the value T inferred by the predictor f T , then sample a latent trajectory The decoder then generates a dialogue path token by token.Additionally, no transition is needed to reach the target if T = 0.In such cases, we directly generate the dialogue path by copying the given target T .

Stage 3: Generating Dialogue Utterances
Motivated by prior work on prompt-based learning for dialogue generation (Zheng and Huang, 2021;Madotto et al., 2021), we regard each dialogue path P as a natural language prompt to guide a generative PLM for dialogue generation.Here, P serves as a global prompt that outlines the dialogue actions and topics needed to reach the target step by step.With the power of the PLM, P helps to distill the necessary knowledge from both the input text and the PLM.To formulate the newly input X ′ , we append P to the given dialogue context C and domain knowledge K, and concatenate them as: where [; ] denotes concatenation.We then feed X ′ into a pre-trained GPT-2 (Radford et al., 2019) or DialoGPT (Zhang et al., 2020) for supervised fine-tuning.We adopt the planned dialogue paths generated by our COLOR during inference.et al., 2020).On the repurposed DuRecDial 2.0 dataset, we also compared our method with three competitive methods: MGCG_G (Liu et al., 2020), KERS (Zhang et al., 2021), and TCP-Dial (Wang et al., 2022a).We chose these methods because they are highly relevant to our problem setting, and TCP-Dial is currently the state-of-the-art model in our knowledge.Given that our method is generalizable to the existing TGConv dataset, we evaluated its effectiveness against four competitive mod-  commonly used local evaluation metrics, including perplexity (PPL), distinct (DIST) (Li et al., 2016), BLEU-n (Papineni et al., 2002), word-level F1 and knowledge F1 (Know.F1) (Liu et al., 2020).To evaluate models' goal-directed performance, we use the goal success rate (Succ.)as the global evaluation metric.In the repurposed DuRecDial 2.0 dataset, Succ.measures the proportion of correct target topic generation within the target turn and the two adjacent turns in the test set, as per Wang et al. (2022a).For the TGConv dataset, we perform self-play simulations, following Yang et al. (2022), to simulate multi-turn conversations and compute the success rate of generating the target keyword within 8 turns.Additionally, we adopt coherence (Coh.)(Yang et al., 2022) as another global evaluation metric, which measures the average contextual semantic similarity between the last utterance in the context and generated utterance.

Results and Discussion
Table 2 shows evaluation results on the DuRecDial 2.0 dataset.We observe that MGCG_G and KERS achieve comparable performance to PLM-based models on the in-domain (ID) split.One main reason is that they use the predicted dialogue action and topic to guide the model in utterance generation.However, they perform poorly in terms of goal success rate due to a lack of dialogue-level planning.We note that BART and TCP-Dial obtain much better DIST-1/2 scores than others because they seldom generate repeated words, making the generated utterances more diverse.In comparison, our models achieve remarkable improvements over most evaluation metrics.For example, our COLOR with DialoGPT achieves much better knowledge F1 scores, indicating that our method is more likely to generate utterances with correct knowledge.Regarding the goal success rate, our models obtain a large margin of improvement on both ID and OOD splits.It shows that using prompts with appropriate dialogue paths effectively steers PLMs to generate proper utterances for goal-directed dialogue.
As shown in Table 3, we notice that directing a dialogue to reach the target seems challenging in the context of open-domain chitchat for all models.However, with the guidance of our dialogue planning approach, COLOR, our models are able to produce more coherent utterances and reach the target at a significantly higher success rate.

Evaluation of Dialogue Planning
Evaluation Metrics To evaluate the performance of dialogue planning, we first adopt F1 to measure the micro-averaged precision and recall of the predicted action or topic.For generation-based models, we extract the action or topic at the evaluated turn from the generated dialogue path for a fair comparison.Due to the nature of dialogue, multiple temporary planning strategies can be reasonable before reaching the target.Following Zhou et al. (2020), we also expand gold labels by considering the system's actions or topics within the previous and subsequent turns.As such, we then compute bigram action F1 (Bi-act.F1) and bigram topic F1 (Bi-top.F1) for evaluation.a similar transition pattern in the dialogue paths, making it easier for all models to predict actions with an F1 score of over 80%.On the other hand, the variation in dialogue paths is primarily related to topics, which requires complex reasoning of domain knowledge, dialogue context, and target for accurate prediction.When evaluating on the OOD split, all baselines show lower F1 and Bi-top.F1 scores for topics.However, our proposed COLOR achieves substantial improvements.We observe similar trends in Table 5 when evaluating on the TGConv dataset.Overall, our COLOR outperforms the baselines by generating more reasonable actions and appropriate topics, making it a promising approach for planning dialogue paths.We observe that a larger value of d brings fewer performance gains.Hence, the d in our COLOR is set to 16 after making a trade-off between effectiveness and efficiency.We note that each module or mechanism of COLOR contributes to dialogue planning.In particular, the performance of COLOR sharply drops without the Brownian bridge (BB).

Analysis of Model Variants
It is because the derived Brownian bridge latent trajectory serves as a transition-level latent representation of the dialogue path to be planned.More importantly, it follows coherent temporal dynamics and thus benefits planning the dialogue path.

Human Evaluation
We recruit three well-educated graduate students as annotators for human evaluation.We ask the annotators to score different models based on turnlevel and dialogue-level metrics, following Liu et al. (2020).The turn-level evaluation measures appropriateness (Appr.)and informativeness (Info.).The dialogue-level evaluation measures proactivity (Proact.),coherence (Coh.), and goal success (Succ.).More details on the metrics and evaluation procedure are described in Appendix D.
Table 7 shows human evaluation results on the DuRecDial 2.0 dataset.The Fleiss's kappa (Fleiss, 1971) scores are mainly distributed between [0.41, 0.60], indicating moderate inter-annotator agreement.We observe that DialoGPT, TCP-Dial, and ours obtain comparable scores in informativeness since they all utilize powerful PLMs.However, our method is able to generate more appropriate utterances in response to dialogue context.For dialogue-level evaluation, our method obtains better results on average compared to all baseline models.Notably, our method achieves the highest coherence score and goal success rate, indicating that our method is more likely to direct the dialogue to reach the target coherently and successfully.

Case Study
To better analyze goal-directed dialogue generation, we show some cherry-picked cases in Appendix E due to space limitation.We observe that some baseline models can generate fluent and informative utterances.However, they still fail to direct the dialogue to reach the target and are ineffective to maintain coherence.In comparison, our COLOR model can plan a dialogue path with reasonable actions and appropriate topics that outlines how to reach the target step by step.With the guidance of the planned dialogue path, our system better knows when and what to talk about to proactively move the dialogue forward.More importantly, our method succeeds in achieving the goal (see Appendix E).
The key to the task is dialogue planning, which leads the dialogue towards the target smoothly and coherently.Prior work pays attention to next-turn transition strategy (Tang et al., 2019), hierarchical policy (Xu et al., 2020a,b), and sub-goal generation (Zhang et al., 2021;Kishinami et al., 2022).For this knowledge-rich task, recent work (Gupta et al., 2022;Yang et al., 2022;Wang et al., 2022a) further concerns planning a dialogue path based on grounded knowledge to guide every turn of response generation.
Planning for Language Generation There is a line of work (Puduppully et al., 2019;Hua and Wang, 2019;Moryossef et al., 2019;Su et al., 2021) that separates text generation into content planning and surface realization.Content planning mainly concerns selecting key content (e.g., key entities) and arranging their orders.Several planning frameworks (Hua et al., 2021;Hu et al., 2022;Li et al., 2022) have been studied to control complex language generation tasks.Our work is more related to planning for dialogue generation (Kishinami et al., 2022;Yang et al., 2022;Cohen et al., 2022).Our proposed COLOR is a novel dialogue-level planning method that steers dialogue generation.

Conclusion
In this work, we explore the task of goal-directed proactive dialogue and focus on planning dialogue paths that direct conversations towards the designated target.We propose a novel approach called COLOR, which models coherent temporal dynamics for dialogue paths in the defined latent space, and considers the impact of user feedback on the dialogue planning process.We employ the planned dialogue paths as prompts to steer dialogue generation.Experiments show that our proposed method outperforms other methods significantly.

Limitations
Though our proposed method exhibits superior performance, we also recognize its limitations and discuss potential solutions.Our proposed method for goal-directed dialogue generation suffers from error propagation since the three stages perform in a pipeline manner.After analyzing those generated utterances with low human evaluation scores, we find that the performance of dialogue generation is prone to drop when our COLOR model fails to plan an appropriate dialogue path.We intend to alleviate this issue by introducing some techniques in the cascaded generation, such as noisy channel models (Shannon, 1948;Liu et al., 2021a).In addition, other issues, such as how to make existing goal-directed dialogue systems more engaging and personalized, are worth further exploring.

Ethical Considerations
Goal-directed dialogue systems can be used for creating non-obtrusive recommendations for specific products and services, introducing interesting new topics and educating users about those topics, and so forth.Developing such systems requires careful consideration since it has a broad impact on applications.The intention of our work is not to force the system to reach the designated target nor force users to accept recommendations.Instead, we aim to build better assistive technologies to improve the proactiveness of dialogue systems.Furthermore, our experimental datasets are publicly available.They have been filtered for sensitive and private information during dataset construction.We hope to raise awareness of the potential for misuse of such systems with toxic intentions.For example, such systems may be used to pose as humans and actively manipulate users' perceptions on specific issues or political inclinations.To mitigate these risks, we emphasize the importance of improving transparency through regulations.It is essential to inform users that they are conversing with a bot instead of a human, and regulations on target designation are crucial when deploying these systems in specific domains.It is necessary to ensure that setting a target does not violate factual accuracy, user privacy rules, or human laws.using commonsense and data augmentation.In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1301-1317, Seattle, United States.Association for Computational Linguistics.A Dataset Descriptions and Pre-processing DuRecDial 2.0 The DuRecDial 2.0 (Liu et al., 2021b) dataset is collected from crowdsourced human-to-human dialogues.In each dialogue, one person is defined as the seeker (the user's role) and the other as the recommender (the system's role).
The recommender needs to proactively lead the dialogue and make recommendations by introducing new topics.Each seeker is equipped with a user profile containing user attributes (e.g., age range) and his/her past preference information.In order to smoothly converse with the seeker, the recommender has a domain knowledge graph consisting of domain-specific topics (e.g., movies, music) with related attributes.More importantly, a dialogue path composed of dialogue actions and topics is annotated for the recommender from the beginning to the end of the dialogue.All dialogues are  aligned across the English and Chinese languages.
We adopt the dataset in English for experiments.Since there are no explicit annotated targets, we repurpose the original dataset automatically.For all those dialogues that are proactively led by the system, we treat the topic that the user has accepted at the end of each dialogue as the target topic, and view the system's corresponding action (e.g., movie recommendation, point-of-interest recommendation, etc.) as the target action.Each target topic is guaranteed to be grounded in the domain knowledge triples corresponding to the dialogue.We filter out those dialogues without introducing any new recommendation topics.The total number of topics is 628 (including a NULL topic).Figure 4 shows the statistics of all the system's actions.We observe an average of 4.3 ∼ 4.8 action-topic transitions from the beginning to reaching the target.
Following the splitting criterion (Liu et al., 2021b), we obtain training/validation/test sets with 4,256/608/1,216 dialogues, respectively.To investigate the performance of different methods for goal-directed dialogue generation, we further use the dataset with two types of splits for the test set: in-domain (ID) split and out-of-domain (OOD) split, similar to Sevegnani et al. (2021); Gupta et al. (2022).The OOD split ensures that none of the target topics in the test set are present in the training set.In contrast, the target topics in the ID split are allowed to appear in the training set.
TGConv The TGConv (Yang et al., 2022) dataset is extracted based on the chit-chat corpus Con-vAI2 (Dinan et al., 2020), and the external commonsense KG ConceptNet (Speer et al., 2017).In the TGConv dataset, all target-oriented sam-ples are identified by the dialogue utterances containing a go-through keyword/concept sequence that aligns with the KG path over the Concept-Net.Suppose the designated global target keyword is w n , a transition path of keywords or concepts P = {w 1 → • • • → w n } is annotated for each dialogue.Here, each neighbor word pair (i.e., w i and w i+1 ) is direct or low-order connected in the ConceptNet.On average, the number of transitions from the start context to the target is approximately 5. Furthermore, the target keywords are distinguished into "easy-to-reach" and "hard-to-reach".Specifically, the easy-to-reach targets refer to target keywords with high frequency in the corpus.In comparison, target words with low frequency (less than 800) in the corpus are classified as hard-toreach targets because there are fewer cases to learn the transition to low-frequency target words.In this work, we follow the same data splitting as in Yang et al. (2022) for experiments.

B.1 Dialogue Generation
To evaluate dialogue generation quality, we first consider the following PLMs-based methods: • GPT-2 (Radford et al., 2019): It is an autoregressive generation model for language generation.We use the GPT-2 base2 model for fine-tuning.
• DialoGPT (Zhang et al., 2020): It is an autoregressive dialogue generation model pre-trained using large-scale dialogue corpora.We adopt the pre-trained small3 model for fine-tuning.
• BART (Lewis et al., 2020): It is a denoising encoder-decoder model for language generation.We use the BART-base4 model for fine-tuning.
Note that these models concatenate all parts of input texts described in the problem formulation as the model input and are fine-tuned to generate utterances directly.On the DuRecDial 2.0 dataset, we additionally consider several competitive models that follow the planning-enhanced generation paradigm: • MGCG_G (Liu et al., 2020): It employs the predicted next dialogue action and next topic to guide utterance generation.We re-run the officially released code5 on the repurposed dataset.
• KERS (Zhang et al., 2021): It leverages a knowledge-enhanced mechanism to guide dialogue generation.We re-run the available code6 on the repurposed dataset.
• TCP-Dial (Wang et al., 2022a): It builds a targetdriven conversation planning method to explicitly extract necessary knowledge and then guides dialogue generation.We re-run the available code7 on the repurposed dataset.
On the TGConv dataset, we consider the following competitive models: • MultiGen (Ji et al., 2020): It is a language generation model with multi-hop reasoning on commonsense knowledge graphs.
• DKRN (Qin et al., 2020): It builds a dynamic knowledge routing network for topic transitions.
• CKC (Zhong et al., 2021): It is a keywordguided neural conversational model that leverages ConceptNet for keyword transitions.
• TopKG (Yang et al., 2022): It employs global planning on ConcepNet to guide dialogue generation and is the state-of-the-art approach8 on the TGConv dataset.

B.2 Dialogue Planning
To compare the performance of dialogue planning for goal-directed dialogues, we consider the following dialogue planning methods: • MGCG (Liu et al., 2020): It makes multi-task predictions for the next-turn's dialogue action and topic.However, it assumes that ground-truth historical dialogue actions and topics are known for a system.For a fair comparison in this work, we adopt the same input as our problem definition to conduct multi-task predictions.
• TCP (Wang et al., 2022a): It is a target-driven planning framework that plans a dialogue path consisting of dialogue actions and topics in a generation-based manner.
• TopKG-Plan (Yang et al., 2022): It employs reinforcement learning to plan a commonsense keyword path based on ConceptNet.
• BERT (Devlin et al., 2019): Based on the intuition of multi-task predictions, we fine-tune the widely-used BERT model by adding two fullyconnected layers to jointly predict the next-turn's dialogue action and topic.We use the uncased BERT-base9 model for fine-tuning.
• BART (Lewis et al., 2020): Based on the intuition of generation, we use the BART-base model for fine-tuning, which is then used to generate a dialogue path similar to ours.

C Training and Inference Details
In Stage 1, we set the batch size for contrastive training to 64 and adopt the Adam (Kingma and Ba, 2014) optimizer with a learning rate of 2e-4.We train our encoder F for 10 epochs.For training in Stage 2, we adopt the Adam optimizer with an initial learning rate of 2e-5 and warm up over the first 10% training steps.We train our COLOR for a maximum of 10 epochs with a batch size of 16.The best checkpoint is chosen based on its performance on the validation set.For inference, we employ greedy decoding to generate a dialogue path token by token, with a maximum decoding length of 80.
In Stage 3, we employ GPT-2 base and DialoGPTsmall (see Appendix B.1) as our backbone models.We follow the description in §3.3 and fine-tune backbone models for 10 epochs.For a fair comparison, we use greedy decoding with a maximum decoding length of 100 for all models.We conduct experiments on one NVIDIA 3090 GPU machine.

D Procedure of Human Evaluation
For turn-level evaluation, we randomly sampled 50 dialogues from the ID test split and 50 dialogues from the OOD test split from the DuRec-Dial 2.0 dataset.We then compared the generated utterances of the following models: MGCG_G, DialoGPT, TCP-Dial, and ours (COLOR w/ Di-aloGPT).For a fair comparison, the models were randomly renamed as "model-1", "model-2", and so forth.The annotators were then asked to mark scores for the compared models from (1) appropriateness (Appr.), which measures whether the utterance responds to the dialogue context appropriately, and (2) informativeness (Info.), which measures whether the utterance is informative by making full use of the grounded knowledge.
For dialogue-level evaluation, we asked our annotators to act as users and converse with the models.Each model's generated utterance in the current turn was used as part of the dialogue context in the next turn.Our annotators were asked to maintain consistency with the equipped user profile.To ensure diverse evaluation targets, we randomly selected 5 target actions from the DuRecDial 2.0 test set, each paired with 10 different target topics, resulting in a total of 50 targets evaluated.We did not expose the targets to them during human-model conversations and restricted all conversations to no more than 12 turns.We finally released the designated targets to the annotators and asked them to score the models based on three evaluation metrics: (1) proactivity (Proact.), which measures whether a model proactively leads the dialogue; (2) coherence (Coh.), which manually examines whether the whole dialogue is fluent, coherent, and smoothly transited; (3) goal success (Succ.), which estimates whether a model effectively reaches the target.
Our annotators were required to score the generated dialogues rating in {0, 1, 2}, where higher is better.The agreement among the annotators is measured by Fleiss's kappa (Fleiss, 1971).We reported each model's average score from different annotators as the ultimate human evaluation result.In addition, we transparently informed all annotators of our research intent.We paid reasonable wages and provided enough time for the annotators to complete the evaluation.

E Case Study
Table 8 and Table 9 show some cases on the DuRec-Dial 2.0 and TGConv datasets, respectively.

Figure 3 :
Figure 3: Overview of Stage 2: Planning the dialogue path P, where X is the required input, T denotes the number of transitions required to reach the target.

Figure 4 :
Figure 4: Statistics of the system's dialogue actions on the repurposed DuRecDial 2.0 dataset.

Table 1 :
Overview of the datasets.
(Zhang et al., 2020)9))n-oriented scenarios.We repurpose the dataset by defining the targets as action-topic pairs.We obtain two types of splits for the test set: in-domain (ID) and out-of-domain (OOD), similar toSevegnani et al. (2021).The OOD split ensures that none of the target topics in the test set are present in the training set, whereas the ID split allows them to appear.The TGConv(Yang et al., 2022)dataset contains high-quality open-domain dialogues on a variety of commonsense topics.Each dialogue is designed to direct the conversation towards a specific keyword or topic through coherent keyword transitions, which are categorized as either easy-to-reach or hard-to-reach based on their difficulty level.Table1summarizes the statistics of both datasets.More details are available in Appendix A.Baseline Methods For dialogue generation, our baselines include: GPT-2(Radford et al., 2019), DialoGPT(Zhang et al., 2020), and BART (Lewis

Table 2 :
Wolf et al., 2020)d global evaluation results of dialogue generation on the DuRecDial 2.0 dataset with different test splits.Significant improvements over backbone models are marked with * (t-test, p < 0.05).Wolf et al., 2020)library.The latent dimension d is set to 16.The MLP blocks f P , f C , and f E are all stacked to 3 layers.The decaying function φ(•) employs the linear decaying.The hyperparameters α, β and γ are set to 0.1, 1.0 and 1.0, respectively.For training in Stage 2, we construct the dialogue pathP in the format of [A]a 1 [T]t 1 • • • [A]a T [T]t T on the DuRecDial 2.0, and of [T]t 1 • • • [T]t T on the TGConv.Here, [A] is a special token to separate an action a i , [T] is a special token to separate a topic t i .During inference, we generate a dialogue path token by token.Further details on training and inference are provided in Appendix C.
(Lewis et al., 2020)) andset: MultiGen(Ji et al., 2020), DKRN(Qin et al., 2020), CKC(Zhong et al., 2021), and TopKG (Yang et al., 2022).More details about the above methods are shown in Appendix B.1.For dialogue planning, we compared our COLOR with the planning models proposed in the above methods using a planning-enhanced paradigm.We also included BERT(Devlin et al., 2019)and BART(Lewis et al., 2020)as our baselines.More details about them are described in Appendix B.2.4.2 Evaluation of Dialogue GenerationEvaluation Metrics To evaluate the performance of next-turn system utterance generation, we adopt

Table 3 :
Yang et al. (2022)valuation results of dialogue generation on the TGConv dataset.G and D are short for GPT-2 and DialoGPT, respectively.Models marked with † are reported fromYang et al. (2022).

Table 4 :
Table 4 reports the evaluation results on the DuRecDial 2.0 dataset.We find that predicting or generating dialogue topics is more challenging than dialogue actions.Further analysis reveals that the dialogue actions follow Results of dialogue planning on the DuRecDial 2.0 with different test splits.

Table 5 :
Results of dialogue planning on the TGConv.
KL .We report evaluation results on the OOD split of the DuRecDial 2.0 dataset, as shown in Table6.
u and φ(δ u ) in our Brownian bridge process as defined in Eq. (3); (4) w/o L KL , which means the model is trained without the loss L

Table 7 :
Human evaluation results.The Fleiss's kappa measures the agreement among the annotators.