Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

Pre-trained language models have been recently shown to benefit task-oriented dialogue (TOD) systems. Despite their success, existing methods often formulate this task as a cascaded generation problem which can lead to error accumulation across different sub-tasks and greater data annotation overhead. In this study, we present PPTOD, a unified plug-and-play model for task-oriented dialogue. In addition, we introduce a new dialogue multi-task pre-training strategy that allows the model to learn the primary TOD task completion skills from heterogeneous dialog corpora. We extensively test our model on three benchmark TOD tasks, including end-to-end dialogue modelling, dialogue state tracking, and intent classification. Experimental results show that PPTOD achieves new state of the art on all evaluated tasks in both high-resource and low-resource scenarios. Furthermore, comparisons against previous SOTA methods show that the responses generated by PPTOD are more factually correct and semantically coherent as judged by human annotators.


Introduction
Task-oriented dialogue is often decomposed into three sub-tasks: (1) dialogue state tracking (DST) for tracking user's belief state; (2) dialogue policy learning (POL) for deciding which system action to take; (3) natural language generation (NLG) for generating dialogue response (Young et al., 2013).
Traditional approaches (Smith and Hipp, 1995;Young et al., 2013) adopt a modularized pipeline that addresses different sub-tasks with distinct dedicated modules. In contrast, recent systems Eric et al., 2017;Lei et al., 2018;Shu et al., 2019) integrate all functionalities required to hold a dialogue into neural network models.
With the advances in pre-trained language models (PLMs) (Radford et al., 2019;Devlin et al., 2019;Raffel et al., 2020), different systems based on PLMs have been proposed (Hosseini-Asl et al., 2020;Lin et al., 2020;Peng et al., 2021;Liu et al., 2021). Despite their differences, most existing methods formulate task-oriented dialogue as a cascaded generation problem, that is, the model can only solve latter sub-tasks by conditioning on the outputs of previous ones. For instance, to generate the response (NLG), the model must rely on the outputs of previous sub-tasks (i.e., DST and POL).
While impressive results are reported (Hosseini-Asl et al., 2020;Peng et al., 2021), we identify three major limitations in the cascaded formulation of their system design.
(1) Firstly, as the model solves all sub-tasks in a sequential order, the errors accumulated from previous steps are propagated to latter steps (Li et al., 2017;Liu and Lane, 2018). (2) Secondly, the training data must be annotated for all sub-tasks. Such annotation requirement significantly increases the data curation overhead. More importantly, it precludes the model from using the large amount of existing data that is partially annotated (e.g., data only annotated with DST or NLG).
(3) Thirdly, the results of different sub-tasks must be generated in a cascaded order which inevitably increases the system inference latency.
In this study, we propose a novel Plug-and-Play Task-Oriented Dialogue (PPTOD) system. Figure  1 depicts an illustration of our approach. As seen, we integrate different dialogue modules (e.g. DST, POL, and NLG) into a unified model. Motivated by the concept of in-context learning (Brown et al., 2020), to steer the model to solve different TOD sub-task, we plug a task-specific natural language instruction, termed as prompt, into the dialogue context as the model input. This way, the generations of different sub-tasks are decoupled, leading to a greater flexibility of the model that brings us at least two advantages: (1) As different sub-tasks are Figure 1: Overview: In the dialogue multi-task pre-training stage, we pre-train our model with four TOD-related tasks, including natural language understanding (NLU), dialogue state tracking (DST), dialogue policy learning (POL), and natural language generation (NLG). For each task, the model takes the dialogue context and the taskspecific prompt as input and learns to generate the corresponding target text. Our learning framework allows us to train the model with partially annotated data across a diverse set of tasks. (best viewed in color) solved separately, the model can learn from data that is partially annotated for different sub-tasks (e.g., DST and NLG). (2) The outputs of different sub-tasks are generated in parallel which alleviates the problem of error accumulation and reduces the system inference latency.
Inspired by recent success of dialogue language model pre-training (Zhang et al., 2020c;Wu et al., 2020;Peng et al., 2021), we propose a dialogue multi-task pre-training strategy that equips our model with the primary TOD task completion skills. Specifically, initialized with T5 (Raffel et al., 2020), we pre-train our model on a heterogeneous set of dialog corpora that consist of partially-annotated data. To build the pre-training corpora, we collect and combine eleven human-written multi-turn dialogue corpora. The collected datasets are partially annotated for some of the TOD-related tasks, including natural language understanding (NLU), dialogue state tracking (DST), dialogue policy learning (POL), and natural language generation (NLG). In total, the pre-training corpora contain over 2.3M utterances across over 80 domains (see more details in Table 1). When applying the pre-trained PPTOD to a new task, we fine-tune it using the same learning objective as in the pre-training stage.
We evaluate PPTOD on a wide range of benchmark TOD tasks, including end-to-end dialogue modelling, dialogue state tracking, and intent classification. Comparisons against previous state-of-theart approaches show that PPTOD achieves better performance in both full-training and low-resource settings as judged by automatic and human evaluations. In summary, our contributions are: • A novel model, PPTOD, that effectively leverages pre-trained language models for taskoriented dialogue tasks.
• A new dialogue multi-task pre-training strategy that augments the model's ability with heterogeneous dialogue corpora.
• Extensive evaluations on three benchmark TOD tasks reporting state-of-the-art results in both full-training and low-resource settings.
• In-depth analysis that further reveals the merits of our model design and the proposed multi-task pre-training strategy.

Related Work
Task-Oriented Dialogue. Task-oriented dialogue aims at accomplishing user's goal. Traditional systems (Williams and Young, 2007;Young et al., 2013) adopt a pipelined approach that requires dialogue state tracking for understanding user's goal, dialogue policy learning for deciding which system action to take, and natural language generation for generating dialogue responses. Recently, to simplify the modelling effort, researchers have shifted their attention to building neural network models that address the TOD subtasks Eric et al., 2017;Lei et al., 2018;Liang et al., 2020). With the advances in pretrained language models (PLMs), Budzianowski and Vulić (2019) first applied the GPT-2 model for the NLG task. Lin et al. (2020) and Yang et al. (2021) moved one step forward and utilized pretrained language models to solve all TOD sub-tasks conditioned on the history of oracle belief states. Based on the GPT-2 model, Hosseini-Asl et al. (2020) proposed a cascaded model, SimpleTOD, that addresses all TOD sub-tasks without using the oracle information. To improve the system performance, Peng et al. (2021) and Liu et al. (2021) applied dialogue pre-training over external dialogue corpora. However, both methods require the pretraining data to be fully annotated for all TOD sub-tasks (i.e., DST, POL, and NLG) which greatly limits the amount of data they can use. Additionally, Liu et al. (2021) achieved better results with noisy chanel model that requires two additional language models for outputs re-scoring. Unlike their approach, we address the task of task-oriented dialogue with a single unified model. Lastly, concurrent work by He et al. (2021) shows that adding an unified dialogue act prediction task for policy optimization helps to improve the performance of the pre-trained task-oriented dialogue model.
Language Model Pre-training. The research community has witnessed remarkable progress of pre-training methods in a wide range of NLP tasks, including language understanding (Peters et al., 2018;Devlin et al., 2019;Yang et al., 2019;Su et al., 2021a) and text generation (Radford et al., 2019;Lewis et al., 2020;Raffel et al., 2020;Su et al., 2021dSu et al., ,c,b, 2022. In the dialogue domain, many models are pretrained on open-domain conversational data like Reddit. Based on GPT-2, Transfertransfo (Wolf et al., 2019b) achieves good results on ConvAI-2 competition. As another extension of GPT-2, Di-aloGPT (Zhang et al., 2020c) performs well in generating open-domain dialogue response. ConveRT (Henderson et al., 2020) is a language model with dual-encoder built for the task of response selection. PLATO (Bao et al., 2020) pre-trains a model with discrete latent variable structure for the response generation task. Wu et al. (2020) adapts BERT with TOD pre-training and achieves strong performances on four dialogue understanding tasks.
Pre-training on Supplementary Data. Recent work (Phang et al., 2018;Aghajanyan et al., 2021) found that supplementary training on the tasks with intermediate-labelled data improves the performance of the fine-tuned models on GLUE natural language understanding benchmark (Wang et al., 2018)  task-oriented dialogue systems. Unlike previous work, we use a single multi-task model for all relevant sub-tasks in task-oriented dialogue systems.

Methodology
In this section, we first discuss the datasets and learning objective used in the proposed dialogue multi-task pre-training. Then we introduce how to apply the pre-trained PPTOD for a new task.

Pre-training Datasets
To construct the pre-training corpus, we collect eleven human-written multi-turn task-oriented dialogue corpora, including MetaLWOZ ( (Rastogi et al., 2020). In total, there are over 2.3M utterances across 80 domains. In Table  1, we provide the details of data annotations and utterance/domain statistics of all datasets. 2

Dialogue Multi-Task Pre-training
Motivated by previous work (McCann et al., 2018;Keskar et al., 2019;Raffel et al., 2020) that unify multiple NLP tasks into a common format, we cast all TOD-related tasks that we consider into the same plug-and-play text generation problem. To specify the target task, we plug a task-specific prompt into the dialogue context as the model input. Figure 1 depicts an illustration of our approach. In the multi-task pre-training stage, each training sample is represented as: where t denotes the TOD task that the sample d belongs to, and t ∈ {NLU, DST, POL, NLG}. z t is the task-specific prompt of the form "translate dialogue to A:", with A corresponding to "user intent", "belief state", "dialogue act", and "system response" for the tasks of NLU, DST, POL, and NLG, respectively. x denotes the input dialogue context which is a concatenation of all previous utterances in the dialogue -both system's and user's. And y denotes the target output text.
As an example presented in Figure 1, to perform the user intent classification task (i.e., NLU), the model is fed with the sequence "translate dialogue to user intent: [user] Tell me the weather forecast for Lecanto, Georgia." and is trained to generate the user intent label text "[get_weather]".
Learning. The model is trained with a maximum likelihood objective. Given the training sample d = (z t , x, y), the objective L Θ is defined as where Θ is the model parameters. In the multi-task pre-training stage, the model is trained to perform all TOD-related tasks with data annotated for different tasks. To optimize the model parameters Θ, we use mini-batch based optimization approach as shown in Algorithm 1.

Fine-Tuning to a New Task
When applying the pre-trained PPTOD to a new downstream task with task-specific labelled data, we use the same learning objective Eq. (2) as in the dialogue multi-task pre-training stage.

Implementation Details
In this work, we report results of PPTOD with three model sizes: PPTOD small , PPTOD base , and PPTOD large . These three models are initialized with T5-small, T5-base, and T5-large models (Raffel et al., 2020) that contain ∼60M, ∼220M, and ∼770M parameters, respectively. We pre-train the model with different configurations on our collected pre-training corpora for 10 epochs. The training samples are truncated to ensure a maximal length of 1024. The models are trained using Adam optimizer (Kingma and Ba, 2015) with a learning rate of 5e-5 and a batch size of 128. Our implementation is based on the Huggingface Library (Wolf et al., 2019a).

End-to-End Dialogue Modelling
End-to-end dialogue modelling aims at evaluating the model in the most realistic, fully end-to-end setting, where the generated dialogue states are used for the database search and response generation (Zhang et al., 2020b;Hosseini-Asl et al., 2020).

Dataset and Evaluation Metric
We conduct experiments on the benchmark Multi-WOZ 2.0 (Budzianowski et al., 2018) and 2.1 (Eric et al., 2020) datasets. 3 In MultiWOZ, the generation of response is not only related to the dialogue context, but also grounded on the database (DB) state. The DB state is automatically retrieved from a pre-defined database using the generated dialogue state (DST). Following previous studies, during inference, PPTOD first predicts the DST result to retrieve the DB state. Then, based on the retrieved DB state and the dialogue context, the results of POL and NLG are generated in parallel. In Section §5, we further compare the performance of our model with or without using the DB state as input.
For evaluation, we follow the original Multi-WOZ guidance for all individual metrics: Inform, Success, and BLEU (Papineni et al., 2002). An  Table 2: End-to-end evaluation. †: the models require the history of oracle dialogue states when making predictions at current turn. ‡: UBAR scores are acquired with the author-released models. §: as the authors did not release their code, we cite the results of TOP and TOP+NOD on MultiWOZ 2.0 from the original paper (Liu et al., 2021).  overall measurement, i.e., combined score (Mehri et al., 2019), is also reported which is defined as Combined = (Inform + Success) × 0.5 + BLEU.

Baselines
We compare PPTOD with several strong baselines, including Sequicity (Lei et al., 2018), MD-Sequicity (Zhang et al., 2020b), DAMD (Zhang et al., 2020b), MinTL (Lin et al., 2020), HIER-Joint (Santra et al., 2021), LABES-S2S (Zhang et al., 2020a), SimpleTOD (Hosseini-Asl et al., 2020), UBAR (Yang et al., 2021), and SOLOIST (Peng et al., 2021), TOP and TOP+Noisy Online Decoding (TOP+NOD) (Liu et al., 2021). Table 2 shows the main results. On both MultiWOZ 2.0 and 2.1 datasets, PPTOD performs better than previous SOTA methods on seven out of eight metrics. In particular, it is worth mentioning that our model is a single architecture that does not require additional language models for re-ranking the outputs as in TOP+NOD (Liu et al., 2021). Moreover, the results show that the large size PPTOD large underperforms PPTOD small and PPTOD base . Our analysis is that the large size model is less capable when learning to generate the delexicalized tokens, which are not seen during its pre-training stage, for the NLG task.

Low-Resource Evaluation
To investigate the generalization ability of PPTOD, we evaluate it in a more challenging low-resource scenario. Following previous studies, we train our model on MultiWOZ 2.0 by varying the percentage of training data, ranging from 1% (∼80 samples) to 20% (∼1600 samples). We compare our model with several strong baselines, including MD-Sequicity, DAMD, SOLOIST,and MinTL. 4 In each low-resource setting, we train our model five times with different random seeds and different selection of training data. The average scores over five runs are presented in Table 3. 5 As seen, PP-TOD consistently outperforms all baseline models by a large margin. Notably, our performance gain is even larger when fewer samples are used for training. This indicates that PPTOD better leverages Model MWOZ Joint Acc.(%) 2.0 2.1 Classification-based Approaches GLAD (Zhong et al., 2018) 35.57 -GCE (Nouri and Hosseini-Asl, 2018) 36.27 -FJST (Eric et al., 2020) 40.20 38.00 SUMBT (Lee et al., 2019a) 46.65 -TOD-BERT (Wu et al., 2020) -48.00 DS-Picklist (Zhang et al.,

Full Training Evaluation
We compare PPTOD with a wide range of existing methods that can be categorized into two classes: (1) classification-based approaches and (2) generation-based approaches.  2019; Chen et al., 2020; Shan et al., 2020;Zhou et al., 2021). This idea of fixed ontology is not scalable, as in real world applications, the ontology is subject to constant change (Heck et al., 2020). In contrast, PPTOD directly generates the outputs, making it more adaptive and generalizable to new ontology labels in real world applications.

Low-Resource Evaluation
To investigate how well PPTOD performs with limited training samples on the downstream task, we evaluate it in a simulated low-resource setting. Specifically, we train the model on MultiWOZ 2.0 by varying the percentage of training data (i.e., 1%, 5%, 10%, and 20%). We compare PPTOD with three strong generation-based baselines, including SimpleTOD, MinTL, and SOLOIST, using the official code released by the authors. Table 5 shows the experimental results. As seen, in all settings, PPTOD outperforms other baselines by a large margin. In the extreme scenario, with only 1% of training data, PPTOD surpasses the strongest SOLOIST model by 18 points of accuracy. This demonstrates that our model is more generalizable and can be better applied to new tasks where the amount of training data is limited.

Intent Classification
The goal of intent classification, i.e. NLU, is to classify the user's intent based on the user's utterance. We conduct experiments on the benchmark Banking77 dataset (Casanueva et al., 2020) that contains data with 77 different intents. Following previous studies (Casanueva et al., 2020;Peng et al., 2021), we test our model in both full training and low-resource settings. In the low-resource setting, we vary the number of training samples per intent from 10 to 30. The standard classification accuracy is reported for evaluation.
We compare PPTOD with several strong baselines, including BERT-Fixed, BERT-Tuned, USE+ConveRT (Casanueva et al., 2020), USE    , ConveRT (Henderson et al., 2020), and SOLOIST (Peng et al., 2021). It is worth mentioning that all compared baselines are classification-based approach that uses a classifier with a softmax layer to make the prediction over the pre-defined intent set. In contrast, as described in section §3.2, PPTOD solves the classification task as a generation problem by directly generating the text of intent label. Therefore, when adapting to a new classification task, PPTOD is more flexible and no extra model parameters are required.
In the experiments, we train PPTOD for five runs with different selection of training data and random seeds. The average scores and standard deviations are reported in Table 7. We see that PPTOD is comparable with existing methods. On low-resource-30 and full training settings, PPTOD large achieves the best results. Our performance gains are even more remarkable given that PPTOD requires no extra parameters when solving the classification task.

Further Analysis
In this section, we present further discussions and empirical analyses of the proposed model.

Plug-and-Play vs Cascaded Generation
First, we compare our plug-and-play generation with the cascaded generation that is adopted by most existing studies. To this end, we fine-tune a T5-small model (without dialogue multi-task pretraining) on MultiWOZ 2.0 by either using the plugand-play or the cascaded formulation. Moreover, we also examine the effect of DB state on the model performance. Specifically, for the plug-and-play model, when utilizing DB state, it first predicts the dialogue state (DST) to retrieve the DB state from the pre-defined database. Then, based on the DB state and dialogue context, the output of POL and NLG are generated in parallel. When ignoring the DB state, the plug-and-play model generates DST, POL, and NLG results in a fully paralleled fashion.
For evaluation, we report the results on end-toend dialogue modelling task. In addition, we report the average inference latency and relative speedup of each model. 6 We compare our ablated models with two strong baselines, SOLOIST and MinTL. 7 Table 6 presents the results. As seen, the plugand-play models yield better results than their cascaded counterparts. One reason is that, for cascaded models, the previously generated results are explicitly used as model input for latter sub-tasks, which leads to error accumulation. Moreover, we see that using DB state generally improves the model performance for both plug-and-play and cascaded models as it provides the model with more grounding information. Furthermore, with DB state, our plug-and-play model achieves better overall score than MinTL with an around 4× speedup. This suggests that the plug-and-play formulation benefits the model both in terms of the generation accuracy as well as the inference latency.

Multi-Task Pre-Training Investigation
Next, we provide further analyses on the dialogue multi-task pre-training strategy. To quantify the importance of different pre-training data, we pre-train 6 The latency of each model is measured on a single Nvidia V100 GPU with a batch size of 1. 7 We did not include TOP+NOD (Liu et al., 2021) for comparison as the authors did not release their code.  the T5-small model using data that is annotated for individual TOD-related task (i.e., NLU, DST, POL, and NLG). After pre-training, we then evaluate the models on three downstream TOD tasks using Mul-tiWOZ 2.0 and Banking77 datasets. For end-to-end dialogue modelling and dialogue state tracking, we test the model in both 1% and full training settings. For intent classification, we measure the accuracy of models trained with either 10 training samples per intent or full training samples. Table 8 presents the results with the first row showing the performance of vanilla T5-small model. As seen, without any pre-training, the vanilla T5-small model performs poorly in the lowresource setting of all evaluated tasks. This suggests that the prior knowledge from pre-training is indispensable for the model to achieve strong performances in the low-resource scenarios.
Moreover, we see that pre-training with data annotated for individual TOD-related task helps the model to attain better result in the corresponding downstream task. For example, pre-training with DST data notably improves the model performance in the downstream DST task both in low-resource and full-training settings. Similarly, pre-training with NLG data helps the model to get better BLEU score in the end-to-end dialogue modelling task.
Lastly, we see that the PPTOD small model attains the best results on most of the evaluation metrics. This suggests that the pre-training data with different annotations are compatible with each other and the joint utilization of all pre-training data helps the model to achieve the best overall performance.

Human Evaluation
We also conduct a human evaluation with the help of graders proficient in English using an internal evaluation platform. For evaluation, we randomly selected 50 dialogue sessions from the test set of MultiWOZ 2.0 dataset. We compare the results  Table 9: Human Evaluation Results generated by the PPTOD base model against the results from the SOLOIST model. All generated results, plus the reference, are evaluated by five graders on a 3-point Likert scale (0, 1, or 2) for each of the following features 8 : • Understanding: Whether the system correctly understands the user's goal.
• Truthfulness: Whether the system's response is factually supported by the reference. 9 • Coherency: Whether the system's response is semantically coherent with the context.
• Fluency: Whether the system's response is grammatically fluent and easy to understand. Table 9 lists the results, with the first row showing strong inter-annotator agreements as measured by Fleiss kappa coefficient (Fleiss et al., 1971). Comparing with SOLOIST, our model achieves better scores on all metrics. Moreover, on the truthfulness and coherency metrics, our model significantly outperforms SOLOIST as judged by Sign Test (pvalue < 0.05), suggesting that PPTOD generates more factually correct and semantically coherent responses. Finally, we note that on the fluency metric, both systems perform comparably with the reference (p-value > 0.4). This shows that the fluency of such systems is largely guaranteed by the prior syntactic knowledge from pre-trained language models, which suggests that future research should focus more on the other aspects of dialog systems.
In this paper, we propose PPTOD, a unified model that supports both task-oriented dialogue understanding and response generation in a plug-andplay manner. In addition, we introduce a new dialogue multi-task pre-training strategy to further augment our model's ability in completing TODrelated tasks. Extensive experiments and analysis are conducted on three benchmark TOD tasks in both high-resource and low-resource settings. The automatic and human evaluations demonstrate that PPTOD outperforms the current SOTA systems in terms of various evaluation metrics.

Ethical Statement
We honor and support the ACL code of Ethics. Task-oriented dialogue systems aim to interact and assist the users to fulfill their goals. The interaction and assistance process do not involve any bias towards to the participants. All datasets used in this work are from previously published works, and in our view, do not have any attached privacy or ethical issues.

A Dataset Details
We elaborate the details of the dialogue datasets contained in the pre-training dialogue corpora.
• MetaLWOZ (Lee et al., 2019b) is designed for improving models' ability in generating natural language responses in unseen domains. It contains annotations for natural language generation (NLG) spanning over 47 domains.
• SNIPS (Coucke et al., 2018) is designed to help developing models capable of understanding users' intent (i.e., natural language understanding (NLU)). Its data consists of users' utterances gathered by crowdsourcing with over 20 intent labels across 9 domains.
• CLINC (Larson et al., 2019) is built for improving model's ability in detecting out-ofscope users' intents. It contains data with NLU annotations for 150 intents across 10 different domains.
• ATIS (Amin, 2019) is used for building intent classification (NLU) model. It contains data with 22 user intents from the airline travel information domain.
• KVRET (Eric et al., 2017) is a in-car personal assistant dataset with dialogues from three domains: calendar scheduling, weather information retrieval, and point-of-interest navigation. It contains annotations for user belief state (DST) and system response (NLG).
• WOZ CamRest676 (Wen et al., 2017) are collected with Wizardof-Oz procedure. They contains dialogues with DST and NLG annotations from the restaurant domain.
• MSR-E2E (Li et al., 2018) contains dialogues from three domains, including movie-ticket booking, restaurant reservation, and taxi booking. The data are annotated for three TODrelated tasks: DST, POL, and NLG.
• Frames (El Asri et al., 2017) contains dialogues from the trip booking domain. Its data are annotated for three TOD-related tasks, including DST, POL, and NLG.
• TaskMaster (Byrne et al., 2019) includes dialogues from six domains. Its data is collected with Wizard-of-Oz and self-dialogue approaches. The dataset is annotated with DST, POL, and NLG.

B Low-Resource MultiWOZ Evaluation
In Table 10, we show the results of our model on MultiWOZ 2.0 under different low-resource settings. To get more confident results, for each setting, we train our model for five runs with different selection of training data and different random seeds. The complete results along with the mean and standard deviations are presented in Table 10.

C Human Evaluation Guidelines
Please evaluate the system's response with respect to the following features: (1) Understanding; (2) Truthfulness; (3) Coherency; and (4) Fluency. In the following, we provide some guidelines regarding how to judge the quality of the system's response in terms of different features.

C.1 Understanding
This metric measures whether the system's response shows that the system is able to understand the goal and intent of the user. The definition of different scores are: • 2: The system completely understands the user's goal and intent.
• 1: The system partially understands the user's goal and intent.
• 0: The system does not understand the user's goal and intent at all.

C.2 Truthfulness
This metric measures whether the system's response is factually supported by the reference response. The definition of different scores are: • 2: The facts in the system's response are all supported by or can be inferred from the reference response.

C.3 Coherency
This metric measures whether the system's response is logically coherent with the dialogue context. The definition of different scores are: • 2: The system's response is logically coherent with the dialogue context.
• 1: The system's response contains minor information that is off the topic of the dialogue context.
• 0: The system's response is completely irrelevant to the dialogue context.

C.4 Fluency
The metrics measures the fluency of the system's response. The definition of different scores are: • 2: The system's response is grammatically correct and easy to understand.
• 1: The system's response contains minor errors but they do not affect your understanding.
• 0: The system's response does not make sense and it is unreadable.

D Case Study
Table 11 presents a generated dialogue example from the PPTOD base model. The user starts the conversation by asking for an expensive restaurant that serves Indian food for dinner. PPTOD finds 14 restaurants that satisfy the user's goal and asks the user for a preferred location. We can see that, when the user states no preference on the restaurant location, PPTPD correctly updates the dialogue state by adding the area information which is missed by the oracle information. Then the user switches the dialogue topic for booking a hotel. Through the dialogue trajectory, we see that PPTOD completes the dialogue by successfully providing the user the necessary information such as number of hotel choices (at turn 3) and the booking reference number (at turn 6). When finding the user's booking request cannot be fulfilled (at turn 5), the models asks the user for an alternative option. Moreover, this example also demonstrates that PPTOD is able to deal with some NLU challenges displayed in the conversations. For example, at turn 4, the user already provides the information about the Gonville Hotel. But only after the user describes the intention of booking the hotel at turn 5, the model updates the name of hotel in the dialogue state based on the co-referenced information from the previous turn. Interestingly, the hotel name is ignored by the oracle dialogue state but our model correctly detects it. The dialogue understanding ability of PPTOD can also be observed in turn 6, in which it updates the hotel stay in the belief state from 2 days to 1 day after the user provides the corresponding information.