Enhancing Task Bot Engagement with Synthesized Open-Domain Dialog

The construction of dialog systems for various types of conversations, such as task-oriented dialog (TOD) and open-domain dialog (ODD), has been an active area of research. In order to more closely mimic human-like conversations that often involve the fusion of different dialog modes, it is important to develop systems that can effectively handle both TOD and ODD and access different knowledge sources. In this work, we present a new automatic framework to enrich TODs with synthesized ODDs. We also introduce the PivotBot model, which is capable of handling both TOD and ODD modes and can access different knowledge sources to generate informative responses. Evaluation results indicate the superior ability of the proposed model to switch smoothly between TOD and ODD tasks.


Introduction
Conversational AI has recently received extensive attention from NLP and other communities, and many efforts have been made to construct different dialog systems. Task-oriented dialog (TOD) systems and open-domain dialog (ODD) systems are two particularly active areas of study (Gao et al., 2018;Ni et al., 2022). TOD systems are designed to complete specific tasks through conversation, such as booking tickets or finding restaurants, and rely on access to databases and APIs (Budzianowski et al., 2018;Rastogi et al., 2020;Zhang et al., 2020). ODD systems, on the other hand, are intended to generate engaging responses on open-domain topics using commonsense (Zhou et al., 2018;Young et al., 2018) or external knowledge (Dinan et al., 2019b;Komeili et al., 2022). However, most existing works model TOD and ODD systems separately, which can result in a gap between the ca- * This work was done during an intership at Microsoft Research. pabilities of these systems and the way humans handle conversations in real world. Therefore, it is important to develop systems that can handle both TOD and ODD appropriately and transition smoothly between them.
Recent works have attempted to build models with multiple dialog skills.  and Sun et al. (2021) combine ODDs with TOD system responses without distinguishing different dialog modes. Lin et al. (2021) proposes using different adapters for different types of conversations, which cannot optimize both tasks simultaneously. Recent works also start exploring the ability to switch between different types of dialogs. Zhao et al. (2022) proposes a unified model to jointly handle both types of conversations by training on a mixture of existing TOD and ODD datasets. Young et al. (2022) introduces a human-annotated dataset FusedChat, containing ODD-grounded TODs and TOD-grounded ODDs, which is intended for further research on the fused task.
However, there are still limitations that need to be addressed in order to further improve the ability of dialog systems to handle both TOD and ODD tasks. One limitation is the time and cost associated with constructing new datasets for training and evaluating dialog systems. To address this limitation, a framework has been proposed by Chiu et al. (2022) to automatically generate dialogs that transition from ODD to TOD. Their framework assumes that users do not explicitly express their intentions and the system is expected to detect and respond to these intentions only based on implicit knowledge acquired during training. In this paper, we focus on the setting where users lead conversations and have explicit intentions while the system's goal is to generate appropriate and engaging responses to users over open-domain topics with access to external knowledge and complete their requests. Another limitation is the lack of attention given to more complex cases of dialog mode switching, with most studies focusing on simple cases with only one switch. In this paper, we consider a more general and flexible setting, which includes harder examples for the model to learn beyond basic situations.
This paper makes the following contributions: (1) We propose a new general framework for automatically enriching a TOD with knowledgegrounded ODDs in various settings. (2) We design a unified model that is able to perform both TOD and ODD tasks, using a predicted state to adopt an appropriate dialog mode and access knowledge sources for response generation. Experimental results demonstrate that the proposed model, Pivot-Bot, has a strong ability to conduct both types of conversations seamlessly.

Dataset Construction
We propose a framework for automatically adding one or more knowledge-grounded ODDs to a given TOD. In this paper, we demonstrate the application of the framework by adding ODDs to TODs from the MultiWOZ 2.1 dataset (Eric et al., 2020a). The framework can be easily generalized to other TOD datasets. A TOD can be denoted by D = {u d 1 1 , s d 1 1 , ..., u d 1 n 1 , s d 1 n 1 , u d 2 n 1 +1 , s d 2 n 1 +1 , ..., u d 2 n 1 +n 2 , s d 2 n 1 +n 2 , ..., u d N n , s d N n }, where N is the number of domains in the dialog, u d j i and s d j i are user and system utterances at turn i in domain j, and n i is the number of turns in domain d i . In the MultiWOZ dataset, a TOD can cover one, two, or three domains. The proposed framework for enriching TODs with ODDs consists of three parts: (1) ODD intent detection, (2) constrained generation for ODDs, and (3) ODD to TOD transition generation. Given a TOD D, we first use the BlenderBot model (Roller et al., 2021) to generate a simulated user utterance based on the previous system utterance at a possible position for adding an ODD. We then use this simulated user utterance to detect whether the user intends to initiate an open-domain conversation. If the simulated user utterance indicates a desire for an open-domain conversation, we generate an ODD snippet using a target-guided generation model to simulate a user and the BlenderBot 2.0 model (Xu et al., 2022;Komeili et al., 2022) to mimic a system with access to external knowledge. The ODD is considered complete if a goal g extracted from the TOD is mentioned at the end of the ODD.
Finally, we generate a transition from the simulated ODD to the following TOD snippet to make the dialog more natural. Detailed implementation of each module can be found in Appendix A.

ODD Intent Detection
To determine the appropriate time to include an ODD during the process of completing a user's requests, we focus on detecting whether the user intends to distract the conversation from task completion and discuss topics related to the context. Given a user utterance u = {u 1 , ..., u n }, where u i is the i-th token in the utterance, the ODD intent detection model aims to predict whether the utterance is in the TOD setting or ODD setting. The detection model is trained by minimizing crossentropy loss: where N is number of training examples,Î i and I i are predicted and ground truth intent of the i-th training example, θ is the parameters of the model.

Target-guided Generation of User Utterances in ODD
To generate chitchats that are consistent with given TODs, we train a target-guided generation model that is designed to generate utterances based on the dialogue history and mention a preset target at the end of the ODD in order to simulate a human user. We use the pre-trained dialogue generation model BlenderBot 2.0 to mimic the system's responses to the user with access to external knowledge. The target-guided generation model is expected to generate a user utterance u at turn t + 1 based on a pre-determined goal g and dialog context c up to turn t 1 . Given pre-determined ODD goal g = {g 1 , ..., g Ng } and context c, where g i is the ith token in the goal, the training objective is defined as where θ is the set of trainable parameters in the model, N u is the target length of predicted user utterance, and u t+1,<i represents tokens before index i of predicted user utterance at turn t + 1.

Transition Generation
The objective of transition generation is to predict a system utterance that can naturally connect the last user utterance in the ODD with the initial user utterance in the following TOD. The training objective of transition generation model is where u t is the last user utterance in generated ODD, u t+1 is the first user utterance in the following TOD, s t is the transition system utterance.

Simulation Settings
Building on previous research that aims to make dialogs more natural and engaging by adding context to a given dialog (Young et al., 2022) or inserting topic transition turns (Sevegnani et al., 2021), we consider two settings: prepending an ODD to a TOD (Section 2.1.4) or inserting an ODD as domain transition turns (Section 2.1.4). In addition, we propose a more general setting in which ODDs can occur at any point during the process of completing tasks (Sec. 2.1.4).

Setting 1: Prepending ODD to TOD (init ODD)
In the first setting, we consider prepending an ODD to a TOD as context to generate dialogs containing one mode switch from ODD to TOD. We assume that users initiate the conversation by having a quick ODD and then lead the conversation towards task completion. Given a TOD D, we generate a starting ODD snippet {u 1 , s 1 ..., u i , s i }, where u i and s i are ODD user and system utterances at turn i, respectively. The simulated ODD is then prepended to the TOD, resulting in a final dialog } that starts with a social chat and then transitions to the TOD focusing on completing the user's goals.
We extract a keyword from the initial user utterance in the TOD and use it as the goal for the synthesized ODD. We also randomly sample a persona as the initial user utterance for the ODD. We use the target-guided generation model to simulate the user, while the system utterances are generated by the pre-trained dialog generation model Blender-Bot 2.0, which has access to external knowledge. The simulation stops once the preset target is mentioned in a user utterance. To connect the simulated ODD and TOD, we use the transition generation model to generate a transition turn. Note that ODD intent detection is not necessary in this setting. The algorithm can be found in the Appendix B.
Setting 2: Inserting ODD for Domain Transition in TOD (domain transition) In the second setting, we consider inserting an ODD into a TOD as domain transition turns. In TODs from the Mul-tiWOZ dataset, domain transitions are made by abruptly requesting information for a new domain, which can be unnatural in conversation. To address this issue, we insert an ODD as a transition turn to make the dialog more natural. Suppose a Specifically, we only add an ODD to transition from the first domain to the second domain in our implementation. To initialize the ODD, we use Blenderbot to generate multiple independently sampled user utterances after the last system turn in the first domain s d 1 n 1 . We then use the ODD intent detection model to detect which of these user utterances indicate ODD intent. We select one user utterance with ODD intent as the initial user utterance of the open-domain conversation, denoted by u 1 . The target of the ODD snippet is extracted from the first user utterance in the second domain u d 2 n 1 +1 . The simulation of the ODD proceeds in a similar manner to the previous setting. The final dialog D = {u d 1 1 , ..., s d 1 n 1 , u 1 , ..., s j , u d 2 n 1 +1 ..., u d N n , s d N n } contains two mode switches. This process can be easily generalized to insert an ODD after each domain i for i ≤ N − 1, and the number of mode switches in the final dialog equals the number of domains. The algorithm can be found in Appendix C.
Setting 3: Inserting Multiple Chitchats to Enrich TODs (multiple ODDs) Finally, we consider a more flexible setting that better approximates real-world scenarios. We assume that users start a conversation with requests and can have short chitchats anytime during the conversation. Given a TOD D, the initialization and simulation of ODD are similar to the domain transition setting. The difference is that in this setting we attempt to insert ODD after each system utterance s i 2 . Suppose we tend to insert an ODD after the i-th turn. If the ODD intent detection model predicts that at least one simulated initial user utterance contains ODD intention, we insert an ODD with j turns {u i 1 , ..., s i j } at the corresponding position. The target of an ODD is extracted from the following TOD user utterance u i+1 . The final dialog D in this setting contains multiple mode switches. The algorithm for can be found in Appendix D. Table 1 includes basic statistics of the new dataset. We use 500, 198 and 1100 dialogs from MultiWOZ 2.1 to simulate training, validation and test sets. In the init ODD setting, the average length of a prepended ODD is three turns, and the mean length of an utterance is 16.18 tokens. In the domain transition setting, the average length of a transition ODD is shorter than three turns. In the multiple ODDs setting, the average number of ODDs inserted into a TOD is four, and each ODD snippet has an average length of two turns. It is not surprising that the duration of ODDs in the domain transition and multiple ODDs settings are shorter, as these ODDs occur during the task completion process, and we do not want the conversation to be distracted from the task..  Figure 1 shows the overview of our task. In each dialog turn, the model predicts the appropriate dialog mode and queries for knowledge retrieval based on the dialog history up to that turn, represented by state s. The model retrieves knowledge k from different sources based on the predicted dialog mode. Finally, the model generates a response r using the retrieved knowledge.

Task Definition
Figure 1: Overview of our task. We split the full task into two subtasks: (1) state prediction and (2) grounded response generation The full task can be broken into two subtasks: state prediction and knowledge-grounded response generation. In the t-th turn of a dialog, given the dialog history h = {u t−k , r t−k , ..., u t }, where u i and r i represent the user utterance and system response at the i-th turn, respectively, and k is the size of the history window, the model predicts the state s. This state indicates the appropriate dialog mode and the query used to obtain knowledge, which is retrieved either through database lookup or an out-of-shelf search engine. The model then generates the response r based on the dialog history h, predicted state s, and knowledge k.

Model
We propose a unified model PivotBot, shown in Figure 2, which first predicts a state to track the user's goal and indicate the appropriate dialog mode based on the dialog history. The knowledge is acquired by accessing a certain knowledge based on the dialog mode using the predicted query. Finally, the model generates a response based on the dialog history, predicted states, and retrieved knowledge.

State Prediction
State s tracks a user's goal throughout a dialog. In particular, a state s is in the form m:q, where m represents the dialog mode, and q stands for query used to acquire knowledge from a knowledge source. This work considers two dialog modes: TOD modeling and knowledge grounded ODD. If the model predicts performing TOD modeling, a database state is obtained from the pre-defined database using the predicted belief state. If the Figure 2: Overall architecture of the PivotBot model state indicates the dialog mode is ODD, external knowledge can be retrieved from the Web using the predicted search query. If the search query is empty, it implies that external knowledge is not needed for response generation, and the retrieved knowledge is also empty.
Given dialog history h, the training objective of state prediction can be formulated as where θ represents trainable parameters in the model, N t is the target length of predicted state sequence, and s <i denotes tokens before index i.

Grounded Generation
System response r = {r 1 , r 2 , ..., r Nr } with length N r is generated grounded on dialog history h, predicted state s and retrieved knowledge k. The training objective is defined as

Training Objective of Full Task
A training example consists of four components: .., u t } containing user and system utterance in the last k − 1 turns and current user utterance, 2) state s including knowledge source to retrieve and the corresponding query; 3) retrieved knowledge k, and 4) (delexicalized) dialog response r. The overall training objective is where D = {x i } n i=1 is training dataset containing n training examples.

Experimental Setup
We train models using 100, 200 and 500 dialogs and evaluate them on the entire test set. We focus on the evaluation of models trained in the few-shot setting. Our primary focus is on evaluating the models in the few-shot setting, as this approach more closely reflects real-world scenarios.

Models
We train three models in an end-to-end manner. TaskBot serves as a baseline and is only capable of performing TOD with access to a database. It is trained solely on TOD turns in the MultiWOZChat dataset and does not learn from ODD turns during training. ChatBot is a baseline model that can only perform ODD. It is trained only on ODD turns in the MultiWOZChat dataset and is unable to complete tasks by accessing a database. Pivot-Bot is the proposed model that is trained on the entire MultiWOZChat dataset. It is capable of both TOD modeling by accessing a database and ODD grounded in external knowledge. The models are implemented using Huggingface T5-base and GODEL (Peng et al., 2022). Further details on the model implementations can be found in Appendix A.

Evaluation Metrics
We evaluate the performance of the models in three different settings: (1) standard TOD completion (Budzianowski et al., 2018;Eric et al., 2020b), (2) ODD response generation, and (3) the full task involving both TOD and ODD.
For the evaluation of TOD completion, we follow previous works (Budzianowski et al., 2018;Eric et al., 2020b) and use four metrics: (1) BLEU (Papineni et al., 2002) measures the fluency of the generated responses compared to humanannotated answers; (2) Success score indicates whether the model can answer all the requested attributes; (3) Inform score measures whether the model provides an appropriate entity (e.g., restaurant address or food type); (4) Combine score is an overall measure of generation quality, defined as (Inform+Success) × 0.5 + BLEU.
For the evaluation of ODD, we consider three aspects: (1) Accuracy measures the model's ability to predict the correct dialog mode, which can be calculated by comparing the predicted dialog mode with the ground truth mode; (2)   is a metric that measures the model's ability to successfully complete an ODD task. A model is considered successful in the ODD task on a dialog if it can correctly predict the dialog mode for all ODD turns. This metric assesses the model's performance in state prediction at the dialog level and can be calculated by dividing the number of dialogs in which the model correctly predicts the dialog mode for all ODD turns by the total number of dialogs with ODD turns; (3) BLEU measures the naturalness of the model's responses.
For the full task evaluation, we report BLEU, Inform, Success, and Combine score. BLEU score is computed on all responses in the dialogs. The computation of Inform and Success is different from that in the TOD setting and is only based on dialogs that are successful in the full task. A dialog is considered successful in the full task if all requested information is answered and all ODD turns are responded to in the correct mode.

Automatic Evaluation Results
In this section, we present the results for models trained in the few-shot setting using 100 training dialogs with the GODEL (Peng et al., 2022) backbone 3 . For the full task evaluation, we only report the combined score. The complete evaluation results can be found in Appendix E. Table 2 shows the evaluation results in the init ODD setting. In the full task evaluation, Pivot-Bot significantly outperforms the baseline models, demonstrating the effectiveness of the proposed model and the importance of incorporating different dialog modes. PivotBot also has a slight lead over TaskBot in the TOD modeling task, with a difference of only 1 point in the combined score. This suggests that the ability to handle both TOD and ODD tasks with appropriate dialog modes and knowledge sources is a key factor for PivotBot to succeed in the full task. ChatBot, which is not trained on TOD modeling, is unable to answer requested attributes or provide appropriate entities. When evaluating the ODD task specifically, it is unsurprising that ChatBot, which is only trained on ODD turns, outperforms the other models in predicting the dialog mode. However, TaskBot, which is trained to use the TOD dialog mode, is unable to generate natural responses to social chitchats. Piv-otBot is able to achieve comparable performance with ChatBot in predicting the dialog mode, while also generating more fluent responses. Table 3 contains evaluation results in the domain transition setting. As we have observed in the evaluation of the full task in the init ODD setting, the proposed model PivotBot performs significantly better than the other models. In the TOD modeling task, both TaskBot and PivotBot are able to complete the task, with TaskBot slightly outperforming PivotBot. ChatBot continues to achieve the best performance in the ODD task. The difference in ODD mode accuracy between ChatBot and Pivot-Bot is not significant, but the gap in success rate widens to over 9 points. This may indicate that it is more challenging for the model to learn both dialog modes simultaneously from a few examples and accurately predict the mode when mode switches in dialogs become more complex.

Evaluation results in the multiple ODDs setting
The evaluation results in the multiple ODDs setting are presented in    prediction accuracy, the model is more likely to fail in the ODD task at the dialog level. This may be due to the increased number of ODD turns within a dialog in this setting, which makes it more difficult for the model to consistently adopt the correct mode for all turns and succeed in the ODD task at the dialog level. Additionally, the positions of ODDs are more flexible in this setting, leading to more complex mode switches within a dialog.

Cross Evaluation
To further understand the relationships among the different settings, we evaluate the proposed model trained in each setting in all settings. The results of these evaluations, with different combinations of training and evaluation settings, are presented in Table 5.
In the full task evaluation, the model trained in the init ODD setting performs best in that same evaluation setting. However, the model trained in the multiple ODDs setting obtains the highest combined scores in the other two evaluation settings. This suggests that the model trained in the multiple ODDs setting is able to generalize well to other settings.
From the results of the TOD modeling evaluation, we see that there is relatively little difference in task completion performance among the models, and all models are able to complete the TOD task. However, the model trained in the init ODD setting tends to have a slightly stronger ability to handle TODs. This may be due to the fact that the TOD segments in dialogs in the init ODD setting are not disrupted by inserted ODDs, which allows the model to more effectively learn to generate accurate and informative responses to users' requests.
In the ODD task evaluation, the model trained in the init ODD setting performs best in that same evaluation setting. However, it performs poorly in the other two evaluation settings. Presumably, this is because, in the init ODD setting, the context for an ODD turn only includes chitchat, while in the other two settings, the dialog history for an ODD turn may include both TOD and ODD turns, making it more challenging to accurately predict the dialog mode based on the dialog history. The model trained in the domain transition setting is also unable to make accurate state predictions in other settings, which may be due to differences in the data distributions among the settings. It may be difficult for the model to fully learn the features of ODD turns with a limited number of examples. In contrast, the multiple ODDs training setting includes more ODD turns within a dialog, providing the model with more examples to learn the general features of ODD. As a result, the model trained in the multiple ODDs setting is able to achieve generally good performance across all evaluation settings.  (2022) considered mode switches between TOD and ODD within dialogs and discussed simple settings with only one mode transition within a dialog. In contrast to these approaches, which rely on human efforts to construct new datasets, Chiu et al. (2022) proposed a framework for automatically generating dialogs that transition from ODD to TOD with a simulated user and simulated salesperson. They assumed that the user does not have an explicit intention, and the salesperson must detect the user's potential intention and lead the conversation towards task completion.
In this work, we propose a general pipeline for automatically enriching TOD with ODD and construct a unified model that can seamlessly switch between different dialog modes.

Target-guided Generation for ODDs
Some previous works (Xing et al., 2017;Lian et al., 2019;Ling et al., 2021) focus on guiding the conversation generation in a short-term, while others studied the multi-turn target-guided process of conversations. Tang  In this work, we aim to guide the generation process of multi-turn ODDs to eventually mention predefined targets and smoothly transition to TODs. We perform soft guidance and do not include turnlevel keyword prediction in the task formulation.

Conclusion and Future Work
This paper proposes a general framework to automatically enrich a TOD with ODDs in various settings. It is easy to implement and can be generalized to a wide range of application scenarios. We also construct a unified model with both TOD and ODD modes to handle the fused task. Experimental results indicate that the proposed model, PivotBot, is able to seamlessly generate responses to different types of user utterances by incorporating both TOD and ODD dialog modes.
There are several directions for improving the data simulation framework in future work. One potential avenue is to incorporate external information, such as knowledge graphs and personality traits, to generate high-quality dialogs. It would also be worthwhile to build a system with more comprehensive skills, such as recommendation and personalization, to better align with real-world applications.

A Implemetation
A.1 Implementation of MultiWOZChat construction framework

A.1.1 ODD Intent Detection
The detection model is implemented using the Hug-gingFace BERT-base (Devlin et al., 2019) model and is trained on a combination of four datasets: MultiWOZ 2.1, ConvAI2 (Dinan et al., 2019a), a subset of FusedChat containing pretended ODDs, and Wizard of Wikipedia (WoW) (Dinan et al., 2019b). The MultiWOZ 2.1 dataset is a TOD dataset, while the other datasets are ODD datasets. The training data includes an equal number of TOD and ODD turns to ensure balance.

A.1.2 Target-guided Generation
We utilize HuggingFace implementation of distilled version of BlenderBot (Roller et al., 2021) to implement the target-guided generation model.

MultiWOZ target candidate
We use all domains and values of 8 slots in the MultiWOZ 2.1 dataset as potential targets. These slots include name, area, pricerange, type, departure, destination, department, and day, with values represented as nouns, adjectives, or phrases.
Training To generate a diverse set of user utterances, we train the target-guided generation model on a combination of three datasets: FusedChat, WoW, and ConvAI2. For the WoW and ConvAI2 datasets, we follow the keyword extraction method described in (Tang et al., 2019) and set the keyword extracted from the last user 4 utterance as the target for ODD. For the FusedChat dataset, we use a subset of prepended ODDs to TODs, which was constructed by having two annotators write an ODD as the context for a given TOD from the Mul-tiWOZ dataset. We only use the prepended ODDs and the initial TOD utterances as training data and extract a target from the set of candidate targets within the initial user utterance of the TOD part.
Inference With the trained target-guided generation model, we are able to generate an ODD for a given TOD by using the model to simulate the user. The goal g is extracted from the given TOD based on the set of candidate targets from the Mul-tiWOZ 2.1 dataset. The ODD is initialized either by sampling a persona as the initial user utterance (as described in Section 2.1.4) or by generating an utterance based on the context of the TOD (as described in Sections 2.1.4 and 2.1.4).

A.1.3 Transition Generation
The model implementation is based on the Huggingface T5-base (Raffel et al., 2020) model. The training datasets are the same as those described in Sec.A.1.2. A training example consists of user utterances at turn t and t + 1 and system response at turn t.

A.2 Implementation of PivotBot
The models are implemented using Huggingface T5-base and GODEL (Peng et al., 2022). Training examples are truncated or padded to a length of 512. To ensure that the input strings contain both dialog history and retrieved knowledge, the dialog history is truncated on the left with a maximum length of 256. The history window size is set to 2 and the dialog history consists of five utterances. We use AdamW optimizer (Loshchilov and Hutter, 2019) with a constant learning rate of 0.001. The models are trained using a mini-batch size of 8 on a Tesla P100 until no decrease in validation loss is observed or for up to 15 epochs. We conduct eight runs of experiments for each setting using different random seeds. if j = 1 then 14: if success then 24: while True do 8: if j = 1 then 9: 10: if g in u i j then       Table 11: End-to-end evaluation of single tasks in the multiple ODDs setting using GODEL as backbone. Almost all differences between GODEL-based models and T5-based models are statistically significant. (*p<0.05, **p<0.01).