ConvLab-3: A Flexible Dialogue System Toolkit Based on a Unified Data Format

Task-oriented dialogue (TOD) systems function as digital assistants, guiding users through various tasks such as booking flights or finding restaurants. Existing toolkits for building TOD systems often fall short of in delivering comprehensive arrays of data, models, and experimental environments with a user-friendly experience. We introduce ConvLab-3: a multifaceted dialogue system toolkit crafted to bridge this gap. Our unified data format simplifies the integration of diverse datasets and models, significantly reducing complexity and cost for studying generalization and transfer. Enhanced with robust reinforcement learning (RL) tools, featuring a streamlined training process, in-depth evaluation tools, and a selection of user simulators, ConvLab-3 supports the rapid development and evaluation of robust dialogue policies. Through an extensive study, we demonstrate the efficacy of transfer learning and RL and showcase that ConvLab-3 is not only a powerful tool for seasoned researchers but also an accessible platform for newcomers.


Introduction
Many task-oriented dialogue (TOD) datasets and models have been proposed along with the advancement of dialogue system research.However, the diversity in data formats and ontologies between datasets brings inconvenience to model adaptation and uniform evaluation, which potentially hinders the study of model generalization and knowledge transfer across datasets.To address this issue, we define a unified data format that serves as the adapter between TOD datasets and models: datasets are first transformed to the unified format and then loaded by models.Following this design, we present ConvLab-3, a flexible dialogue system toolkit supporting many datasets and models.
As illustrated in Figure 1, a unified dataset in ConvLab-3 consists of dialogues, ontology, and database (or API interface).We define the unified format over user goal, dialogue acts, state, API result, etc. to support common tasks, while keeping the dataset-specific annotation (e.g., emotion) intact.Once a dataset is transformed into the unified format, it can be immediately used by models on the tasks it supports.Similarly, a model supporting the unified format can access all transformed datasets.This feature reduces the cost of adapting M models to N datasets from M × N to M + N , saving the time of (1) researchers to conduct (especially transfer learning) experiments, (2) developers to build an agent with custom datasets, and (3) community contributors to consistently add models, datasets, and uniform data processing functions.
Compared with other dialogue toolkits such as PyDial (Ultes et al., 2017), Rasa (Bocklisch et al., 2017), ParlAI (Miller et al., 2017), and Nemo (Kuchaiev et al., 2019), ConvLab-3 and its predecessors (Lee et al., 2019b;Zhu et al., 2020b) provide recent powerful models for all components of a dialogue system, which allows people to assemble conversational agents with different recipes and to evaluate the agent with various user simulators.ConvLab-3 inherits the framework of ConvLab-2 (Zhu et al., 2020b) but provides support to the unified data format for existing datasets, commonly used and state-of-the-art models, and task evaluation scripts.Thanks to the unified data format, we further add many TOD datasets and powerful transformer-based models, including two transferable user simulators (Lin et al., 2021a(Lin et al., , 2022)).In addition, we highly advanced the RL toolkit by 1) simplifying the process of building the dialogue system and user environment with different modules, 2) improving the RL training process, and 3) adding evaluation and analysis tools.
Since the formats of different datasets are unified, it is convenient to perform dataset-agnostic evaluation and investigate the effects of different transfer learning settings such as domain adaptation (Zhao and Eskenazi, 2018), domain adaptive pre-training (Gururangan et al., 2020), and continual learning (Madotto et al., 2021).In this paper, we explore several transfer strategies and conduct comprehensive experiments on multiple tasks and settings.Experimental results show that the benefit of transfer learning is extremely large when labeled data in the target domain is insufficient but gradually decreases as the number of labeled data increases.Moreover, we conduct several RL experiments with multiple data scenarios, different user simulators, and the usage of uncertainty.The results show that pre-training on different data is not necessarily beneficial for initial performance but leads to better final performance, training with different user simulators leads to different dialogue system behavior, and dialogue systems that leverage uncertainty estimates in the state prediction request information from users more frequently in order to resolve uncertainty.from convlab.util import * dataset_name = "multiwoz21" # load dataset: a dict maps data_split to dialogues dataset = load_dataset(dataset_name) # load dataset in a predefined order with a custom # split ratio for reproducible few-shot experiments dataset = load_dataset(dataset_name, \ dial_ids_order=0, split2ratio={"train": 0.01}) # load ontology and database similarly ontology = load_ontology(dataset_name) database = load_database(dataset_name) # query the database with domain and state state = {"hotel": {"area": "east", \ "price range": "moderate"}} res = database.query("hotel",state, topk=3) # Example functions based on the unified format # load the user turns in the test set for NLU task nlu_data = load_nlu_data(dataset, "test", "user") # dataset-agnostic delexicalization dataset, delex_vocab = create_delex_data(dataset) Listing 1: Example usage of unified datasets.

Unified Data Format
Our goal is to unify the format of annotations included in many datasets and commonly used by dialogue models (e.g., dialogue acts, state), while keeping the original format of annotations that only appear in specific datasets (e.g., emotion).As we integrate more datasets in the future, we will transform more kinds of common annotations into the unified format.
In our unified data format, a dataset consists of (1) an ontology that defines the annotation schema, (2) dialogues with transformed annotations, and (3) a database or API interface that links to external knowledge sources.Further, the metadata of the dataset such as the format transformation process and data statistics are described in a dataset card.
As shown in Listing 1, we provide utility functions to process the unified datasets, such as delexicalization, splitting data for few-shot learning, and loading data for specific tasks.Based on the unified format, evaluations of common tasks are standardized.We list datasets and models supporting the unified data format in Table 1 and 2 respectively.

Ontology
Following Rastogi et al. (2020), an ontology consists of: (1) Domains and their slots in a hierarchical format.Each slot has a Boolean flag indicating whether it is a categorical slot (whose value set is small and fixed).(2) All possible intents in dialogue acts.(3) Possible dialogue acts appeared in the dialogues.Each act is comprised of intent, domain, slot, and originator (i.e., system or user).( 4) Template dialogue state.We also provide a natural language description, if any, for each domain, slot,

Dialogues
For dialogue in the unified format, dialogue-level information includes the dataset name, data split, unique dialogue ID, involved domains, user goal, etc.Following MultiWOZ (Budzianowski et al., 2018), a user goal has informable slot-value pairs, requestable slots, and a natural language instruction summarizing the goal.Turn-level information includes speaker, utterance, dialogue acts, state, API result, etc.Each dialogue act is a tuple consisting of intent, domain, slot, and value.According to the value, we divide dialogue acts into three groups: (1) categorical for categorical slots whose value set is predefined in the ontology (e.g., inform the weekday of a flight).(2) non-categorical for non-categorical slots whose values can not be enumerated (e.g., inform the address of a hotel).(3) binary for intents without actual values (e.g., request the address of a hotel).The state is initialized by the template state as defined by the ontology and updated dur-ing the conversation, containing slot-value pairs of involved domains.An API result is a list of entities retrieved from the database or other knowledge sources.We list common annotations included in the unified data format and the tasks they support in Table 2. Other dataset-specific annotations are retained in their original formats.

Database/API Interface
To unify the interaction with different types of databases, we define a base class BaseDatabase that has an abstract query function to be customized.The query function takes the current domain, dialogue state, and other custom arguments as input and returns a list of top-k candidate entities which is consistent with the format of the API result in dialogues.By inheriting BaseDatabase and overriding the query function, we can access different databases/APIs in a similar fashion.
Inspired by Raffel et al. (2020) that converts all NLP tasks into a text-to-text format, we provide serialization and deserialization methods for annotations in the unified data format, as shown in Table 3.In this way, we can apply a text-to-text model to solve any task in Table 2.As an example, we provide a unified training script for T5 (Raffel et al., 2020), producing a family of models including T5-NLU, T5-DST, T5-NLG, etc.Other pre-trained language models such as GPT-2 (Radford et al., 2019) could similarly be used as backbone models.We regard these models as base models to solve a task since they are not involved with dataset/taskspecific model design and are simple to use.Developers that are unfamiliar with dialogue models could conveniently use the base models for their custom datasets.

Reinforcement Learning Toolkit
Compared to ConvLab-2, we improve the reinforcement learning toolkit for the dialogue policy module in ConvLab-3.We simplify the process of building the dialogue system and its RL environment, extend the modularity of the vectorizer codebase, incorporate multiple US, and offer a wide range of evaluation metrics.

Pipeline Setup
A dialogue system agent can be comprised of a wide range of sub-modules as mentioned in Table 2. ConvLab-3 fully supports the straightforward building of any combination and allows easy substitution of components.This also includes the definition of the agent environment given by the choice of user policy and its components.In ConvLab-3 various components can be easily combined using an easy-to-use configuration file.

Vectorizer Class
The dialogue policy module trained using RL obtains the output of the DST as input and typically outputs a semantic action.Just as ConvLab-2, ConvLab-3 uses a vectorizer class as a communication module that translates the semantic actions and DST output into vectorized representations using the given ontology.We extend and modify that module and make the vectorizer easily interchangeable in order to use different vectorization strategies.Firstly, we add the possibility for masking certain actions as in Ultes et al. (2017) to aid policy learning or to prevent the policy from taking invalid actions.Secondly, the module leverages the functionalities described in Section 2 to allow easy adaptation to new datasets.
As the vectorizer module takes care of producing a suitable input representation, it is possible to use any off-the-shelf neural network architecture for the policy module and only modify the vectorizer to fit the required input.As an example, van Niekerk et al. ( 2021) defines a belief tracking module that outputs a semantic state accompanied by relevant uncertainty scores, which are used as input to the policy.In order to implement the change, the policy module can remain untouched while only a new vectorizer (inheriting from the base class) needs to be implemented with one required class method.This simplifies the integration of new policy modules and vectorization strategies.

Evaluation Tools
The core metrics for a dialogue policy trained using RL are success and return (Gašić et al., 2015).Nevertheless, as these policies are built for interacting with humans, it is necessary to provide an in-depth analysis of their behavior.For instance, is the policy providing too many actions in a turn, which a simulator can handle well but lead to an information overload for real humans?Does the policy have a tendency to only use some specific intents but neglect others?How are the actions taken by the system distributed in general?In order to easily answer these questions, we provide plotting tools to conveniently compare algorithms with each other, both in terms of performances and action distributions.Moreover, ConvLab-3 provides an evaluation function to inspect generated dialogues, thereby enabling full understanding of the dialogue policy behavior.We will showcase the additional analysis tools in Section 6.2.Lastly, we use a stricter form of task success that we call strict success, which extends the previous success definition.While the previous version checked whether the necessary information was provided and an entity was booked if required, it did not check whether the entity fulfills the booking constraints.As a consequence, it was possible to book an entity for a wrong day, with a wrong number of people but still be successful.
In summary, these changes allow easy building, extension, and evaluation of dialogue policies, making them more accessible to researchers.

Experimental Setup
To show how the new features of our platform may promote TOD research, we conduct comprehensive experiments, which can be categorized into reinforcement learning for the dialogue policy module and supervised learning for other modules.

Supervised Learning
Based on the unified data format, conducting experiments on multiple TOD datasets is convenient.We believe this feature will encourage researchers to build general dialogue models that perform well on different datasets and investigate knowledge transfer between them.To provide insight into how to transfer knowledge from other datasets, we explore three typical transfer learning settings on various tasks, namely (1) pre-train on other datasets and then fine-tune on the target dataset, (2) joint training on the target dataset and other datasets, and (3) retrieve samples that are similar to the samples of the target dataset from other datasets as additional training data or as in-context learning examples.
Metrics We evaluate our newly integrated models covering various tasks with classic metrics: turn accuracy (ACC) and dialogue act F1 score for NLU (Zhu et al., 2020b), joint goal accuracy (JGA) and slot F1 score for DST (Li et al., 2021), BLEU and slot error rate (SER) for NLG (Wen et al., 2015), BLEU and Combined score (Comb.)for End2End (Mehri et al., 2019), turn accuracy and slot-value F1 score (Lin et al., 2021a) for user simulators that output user dialogue acts only (US-DA), and slot-value F1 score and SER for user simulators that output both user dialogue acts and response (US-NL) (Lin et al., 2022).Note that the calculation of Combined score for End2End is not the exactly same as those in previous works (Peng et al., 2021) because we use our own dataset-agnostic delexicalization instead of the MultiWOZ-specific one (Budzianowski et al., 2018).Since SGD and Taskmaster do not have goal annotations, we calculate the turn-level slot-value F1 score between original responses and generated responses instead of Combined score on these datasets.For NLG, we ignore utterances with empty dialogue acts.

Pre-training then Fine-tuning
In this experiment, we explore knowledge transfer through pre-training on other TOD datasets before fine-tuning.We pre-trained models on SGD and Taskmaster datasets and then fine-tuned the models on MultiWOZ in full-data or low-resource (1%, 10% of the entire training set) settings.As shown in Listing 1, it is easy to control the data to use in few-shot learning experiments through the data loading function.For low-resource fine-tuning, we set the data ratios of both training and validation set to 1%/10% and run the training 3 times with different subsets of samples.
If not mentioned otherwise, for existing models we follow the settings of their original papers.We use T5-Small for T5-NLU, T5-DST, and T5-NLG.We initialize SC-GPT and SOLOIST with GPT2-Medium and T5-Base respectively instead of their checkpoints pre-trained on many TOD datasets.For pre-training, we merge SGD and Taskmaster datasets.Training details and hyperparameters can be found in our GitHub repository.

Joint Training
In this experiment, we investigate the effect of training a model on multiple datasets jointly instead of separately.For joint training, we merge MultiWOZ, SGD, and Taskmaster datasets into one and train a single model, which requires the model to handle datasets with different ontologies.Intuitively, the advantage of joint training is that knowledge transfer is bi-directional and persists for the whole training period, while the disadvantage is that there may be inconsistent labels for similar inputs on different datasets, potentially confusing the models.
To avoid confusion, for T5-NLU, T5-DST, and T5-NLG, we prepend the dataset name to the original input to distinguish data from different datasets.For SetSUMBT, we only predict the state of the target dataset.Since SGD may have several services for one domain, we normalize the service name to the domain name (e.g., Restaurant_1 to Restaurant) when evaluating NLU and DST.However, similar slots of different services (e.g., city and location) will still confuse the model.While further normalization may help, we are aiming to compare independent training and joint training instead of achieving SOTA performance.For Taskmaster-1/2/3, we evaluate each sample with the corresponding ontology and then calculate the metrics on all test samples of three datasets.In addition, on SGD and Taskmaster, we build pseudo user goals for TUS and GenTUS by accumulating constraints and requirements in user dialogue acts during conversations.

Retrieval Augmentation
We further explore transferring knowledge from other datasets through retrieval-based data augmentation.Here we only consider the single-turn NLU task where the input is an utterance, since utterancelevel similarity is easier to model than dialoguelevel similarity.For each utterance in the target dataset, we retrieve the top-k (k ∈ {1, 3}) most similar utterances from other datasets measured by the MiniLMv2 model (Wang et al., 2021)   Since different datasets have different ontologies (i.e., definitions of intent, domain, slot), we prepend the corresponding dataset name to an input utterance as in the joint training experiment.We use the T5-NLU model and try two model sizes T5-Small and T5-Large.We fine-tune the models on MultiWOZ using the same settings as in the pre-training-then-fine-tuning experiment.

Reinforcement Learning
In order to showcase our RL toolkit, we conduct several experiments.The first set of experiments leverages the convenient unified format for pretraining policies before starting the RL training.The second set of experiments regards training with multiple user simulators as well as incorporating uncertainty estimates in state representation.

Pre-training then RL Training
In this experiment, we simulate different data scenarios before starting the RL training.We leverage the implemented DDPT policy model for training as its flexibility allows transfer from one ontology to another.We test four different scenarios: 1. Scratch: the policy starts the RL training with random initialization.This simulates the zero data scenario.
2. SGD: the policy is pre-trained on the full SGD data before the RL training starts.This simulates having related data.
4. SGD->1%MWOZ: the policy is pre-trained on the full SGD data, then fine-tuned on 1% of MultiWOZ 2.1 data before the RL training starts.This simulates having related data as well as scarce in-domain data.
The unified format conveniently allows us to pretrain on different datasets by only specifying the dataset name.The experiments are conducted on the semantic level, leveraging the rule-based dialogue state tracker and the rule-based user simulator (Schatzmann et al., 2007) of ConvLab-3.

Training with Different User Simulators
To ensure a policy learns a general skill that can solve the task with diverse users instead of overfitting to a certain user simulator, it is important to test the policy with various user simulators.ConvLab-3 provides the SOTA user simulators TUS and Gen-TUS and allows easy usage through the simplified pipeline setup.In this set of experiments, we compare TUS, GenTUS and the rule-based simulator ABUS by training a dialogue policy in interaction with each of them and performing cross-model evaluation afterwards.The dialogue policy used is a simple multilayer perceptron and optimized using PPO (Schulman et al., 2017).The simulators and policy are pre-trained on MultiWOZ 2.1 and experiments are run on the semantic level using the rule-based dialogue state tracker.

Training with Uncertainty Features
The simplified vectorizer module in ConvLab-3 makes it possible to easily include dialogue related features that might benefit dialogue policy learning.
One such example is given due to the problem of resolving ambiguities in conversations.Humans naturally identify these ambiguities and resolve the uncertainty resulting from them.For a dialogue system to be robust to ambiguity, it is crucial to identify and resolve these uncertainties (van Niekerk et al., 2021).
ConvLab-3 provides the SetSUMBT dialogue belief tracker which achieves SOTA performance in terms of the accuracy of its uncertainty estimates.Using the vectorizer class to incorporate these features we can train a policy using the uncertainty features obtained from SetSUMBT.To illustrate the effectiveness of uncertainty features during RL, we train a PPO policy using these features.The template-based NLG in ConvLab-3 also allows for the inclusion of noise (van Niekerk et al., 2021)  largely boosts the performance of TUS because it uses domain-agnostic features for semantic actions, thus it can focus on the cross-domain user behavior on the semantic level.Pre-training-then-fine-tuning on 1% MultiWOZ data is even better than directly fine-tuning on the full MultiWOZ data.For Gen-TUS, the observation is quite different.With pretraining, the F1 score is improved in the 1% data fine-tuning setting but is largely restricted when more fine-tuning data is available, while the SER is consistently improved.Note that SER is calculated using self-generated dialogue acts instead of the golden ones, measuring the faithfulness of utterance generation.This indicates that pre-training biases GenTUS towards a user behavior that differs from users in MultiWOZ but makes utterance generation more accurate.

Joint Training
We compare independent training and joint training in lead to substantial performance drops in most cases, indicating that models have sufficient capacity to encode knowledge of different datasets simultaneously.However, joint training does not always improve performance either.It consistently improves the End2End model SOLOIST but makes no difference to T5-NLU.For other models, the gains vary with the dataset.Associating with the previous pre-training-then-fine-tuning experiment, we think the difference may be attributed to the varying task complexity on different datasets.When the original data of a certain dataset are sufficient for a model to solve the task, including other datasets via joint training may not bring further benefit.

Retrieval Augmentation
We compare retrieval augmentation methods with direct fine-tuning (Baseline) and pre-training-thenfine-tuning as shown in Table 7.We can see that the performances of different models mainly differ in the 1% data setting, and retrieving top-1 or top-3 samples does not have a large effect.In the 1% data setting, using retrieved samples to augment inputs for training is even worse, while augmenting the training set with retrieved samples is beneficial for T5-Small but disadvantageous for T5-Large.
Using other TOD datasets for pre-training is better than or on par with using them for two retrieval  augmentation methods and direct fine-tuning.The reason may be that different datasets' ontologies for similar utterances consistently confuse the models when training on retrieved samples or augmented input, while pre-trained models will no longer be distracted by other datasets' ontologies during finetuning.From the example in Table 4, although retrieved samples may extract important values, they assign different intent, domain, and slot names to these values, which are not transferable to the target dataset, calling for a better way to use other TOD datasets.

Pre-training Then RL Training
Let us first take a look at the metrics strict success rate and number of turns, plotted in Figure 2(a) and (b).We can observe that pre-training on SGD does not yield an advantage for the starting performance, compared to training from scratch.Nevertheless, looking at the final performance, we can see that the model pre-trained on SGD is different from the scratch model and reaches the same final performance as when MultiWOZ data is used.We hypothesize that pre-training on SGD effectively puts the initial policy parameters in an appropriate parameter space for dialogue learning.We can also observe that it does not seem to make a difference whether or not the policy is pre-trained on SGD before being fine-tuned on MultiWOZ, which suggests that the SGD information has been overwritten.Figure 2(c) shows the average number of actions taken in a turn, which is an important evaluation measure, as exposing a user to too much information can lead to information overload.Interestingly, the model only pre-trained on SGD initially takes much fewer actions in a turn, which exactly reflects what can be observed in the SGD data.
In addition to the plots depicted in Figure 2, the new toolkit allows us to have a look at the average intent distribution within a turn.Figure 3 (a), (b) and (c) depict the average probability of taking a request, offer, or inform intent in a turn, respectively.We can see that SGD pre-training initially leads to many offer intents, while mostly ignoring request intents compared to pre-training on Multi-WOZ.The number of inform intents is in a similar  range for all models apart from the scratch model.Nevertheless, the probabilities eventually converge to the same level, which generally holds for all intents.

Training with Different User Simulators
The result of strict success rates of PPO policies trained with ABUS, TUS, or GenTUS is shown in Table 8.The policy trained with ABUS only outperforms policies trained with the other USs when evaluated with ABUS.On the other hand, the policy trained with GenTUS outperforms the other policies when evaluated not only with GenTUS, but also with TUS.It also gets comparable performance with policies trained with ABUS when evaluated with ABUS, indicating that GenTUS trains the most robust policy.According to the result, a policy that performs well when evaluated with the US it is trained on does not necessarily perform well on other USs.Shi et al. (2019) suggest evaluating a policy with different types of USs to provide a more holistic view and show the average performance correlates well with human evaluation.In ConvLab-3, we provide not only the rule-based US, i.e., ABUS, but also data-driven USs, e.g., TUS and GenTUS, to build a richer environment for training and evaluation.Moreover, with the two transferable USs TUS and GenTUS, researchers can easily evaluate their policy on custom datasets.

Training with Uncertainty Features
Figure 4 (a) reveals that the policy trained using uncertainty features performs at least as well as the policy trained without these features.van Niekerk et al. (2021) further showed that the policy trained with uncertainty features performs significantly better in conversation with humans than the policy trained without.This is an indication that the policy using uncertainty features can handle ambiguities in conversation better than the policy without uncertainty modeling.To investigate how this policy resolves uncertainty we analyze the action distributions of the policy using the new RL toolkit evaluation tools, which provide new insights into the behavior of dialogue policy modules.Figure 4 (b) and (c) show that the policy trained using uncertainty features utilizes significantly more request actions than the policy without these features.This indicates that the policy aims to resolve uncertainty by requesting information from the user.For instance, if the policy recognizes uncertainty regarding the price range a user has requested, it can resolve this through the use of a request.See van Niekerk et al. (2021) for example dialogues with humans where this can be observed.

Conclusion
In this paper, we present the dialogue system toolkit ConvLab-3, which puts a large number of datasets under one umbrella through our proposed unified data format.The usage of the unified format facil-itates comparability and significantly reduces the implementation cost required for conducting experiments on multiple datasets.In addition, we provide recent powerful models for all components of a dialogue system and improve the RL toolkit which enables researchers to easily build, train and evaluate dialogue systems.
We showcase the advantages of the unified format and RL toolkit in a large number of experiments, ranging from pre-training, joint training, and retrieval augmentation to RL training with different user simulators and state uncertainty estimation.We hope that ConvLab-3 can aid and accelerate the development of models that can be trained with different datasets and learning paradigms, empowering the community to develop the next generation of task-oriented dialogue systems.

Contributions
Qi Zhu, Christian Geishauser, Carel van Niekerk, Hsien-chin Lin, Baolin Peng, and Zheng Zhang are the main developers of the project.Jianfeng Gao, Milica Gašić, and Minlie Huang provided full support and gave valuable advice on project design.Qi Zhu led the project.
Code implementation Qi Zhu proposed and implemented the unified data format, transformed all datasets, adapted BERTNLU and MILU, and implemented T5 series models.Christian Geishauser was the main contributor to the RL toolkit and adapted DDPT, PPO, PG, and MLE models.Hsien-chin Lin adapted TUS and GenTUS, allowing training policy with different user simulators.Carel van Niekerk adapted SUMBT and SetSUMBT, contributed to the RL toolkit, and incorporated uncertainty estimates with policy learning.Baolin Peng adapted SOLOIST and implemented End2End evaluation metrics.Zheng Zhang adapted SC-GPT and implemented NLG evaluation metrics.Michael Heck adapted TripPy.Nurul Lubis adapted LAVA.Dazhen Wan contributed to user simulators and SC-GPT.Xiaochen Zhu contributed to the RL toolkit and SC-GPT.Experiments of each model were performed by the corresponding author.
Paper writing Qi Zhu, Christian Geishauser, Hsien-chin Lin, and Carel van Niekerk finished most part of the paper.Baolin Peng and Zheng Zhang added the necessary details.Michael Heck, Nurul Lubis, and Jianfeng Gao proofread and polished the paper.

Figure 1 :
Figure 1: Unified data format is a bridge connecting different datasets and dialogue models.

Figure 3 :
Figure 3: Intent distributions observed during the training process of DDPT while interacting with the rule-based user simulator.

Table 1 :
Statistics and annotations of current unified datasets.DA-U/DA-S is dialogue acts annotation of user/system.

Table 2 :
(Zhu et al., 2020b)pported by the unified data format.RG is response generation without database support.Goal2Dial is generating a dialogue from a user goal.NLU is natural language understanding.BERTNLU, MILU, PPO, PG are from ConvLab-2(Zhu et al., 2020b).

Table 3 :
Example serialized dialogue acts and state.

Table 4 :
An example of input augmented by retrieved top-3 samples from other TOD datasets for in-context learning.Dataset names are highlighted.

Table 6 :
Comparison of independent training and joint training (1st row vs. 2nd row of each model) on 3 datasets.We normalize the service name to the domain name when evaluating NLU and DST on SGD.

Table 6 .
MultiWOZ, SGD, and Taskmaster have 8K, 16K, and 43K dialogues for training respectively.Joint training on these datasets does not Pre-training then RL training experiments with the DDPT model in interaction with the rule-based simulator.Shaded regions show standard error.Each model is evaluated on 9 different seeds.
Evaluation of the PPO policy trained combined with a SetSUMBT DST model with and without uncertainty features respectively.The policy is trained in an environment that contains 5% user NLG noise to illustrate the impact of uncertainty.

Table 8 :
The strict success rates of PPO policies trained on ABUS, TUS, and GenTUS when evaluated with various user simulators.