NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-Based Simulation

We propose NeuralWOZ, a novel dialogue collection framework that uses model-based dialogue simulation. NeuralWOZ has two pipelined models, Collector and Labeler. Collector generates dialogues from (1) user’s goal instructions, which are the user context and task constraints in natural language, and (2) system’s API call results, which is a list of possible query responses for user requests from the given knowledge base. Labeler annotates the generated dialogue by formulating the annotation as a multiple-choice problem, in which the candidate labels are extracted from goal instructions and API call results. We demonstrate the effectiveness of the proposed method in the zero-shot domain transfer learning for dialogue state tracking. In the evaluation, the synthetic dialogue corpus generated from NeuralWOZ achieves a new state-of-the-art with improvements of 4.4% point joint goal accuracy on average across domains, and improvements of 5.7% point of zero-shot coverage against the MultiWOZ 2.1 dataset.


Introduction
For a task-oriented dialogue system to be scalable, the dialogue system needs to be able to quickly adapt and expand to new scenarios and domains. However, the cost and effort in collecting and annotating an expanding dataset is not only laborintensive but also proportional to the size and variety of the unseen scenarios.
There are three types of dialogue system expansions. (1) The simplest expansion is the addition of new instances in the knowledge base (KB) under the identical schema. For example, the addition of newly opened restaurants in the KB of restaurant domain falls under this category. (2) A slightly more complicated expansion involves modifications to the KB schema, and possibly the related 1 The code is available at github.com/naver-ai/neuralwoz. Figure 1: Overview of NeuralWOZ. The NeuralWOZ takes goal instruction for the user side (U) and API call results for the system side (S) to synthesize dialogue. First, it generates dialogue from the inputs and then labels dialogue state (B t ) and active domain (Domain t ) by turn t on the dialogue.
instances. For example, additions of new constraint types to access the KB due to the change in needs of the user often require a restructuring of the KB. If a dialogue system built with only restaurant search in mind observes user's requests about not only "restaurant location" and but also "traffic information" for navigating, the system now needs a new knowledge base including the additional different domain.
(3) The most complex expansion is the one that expands across multiple domains. For example, imagine an already built dialogue system supported restaurant and hotel reservation domains, but now needs to expand to points of interest or other domains. It is difficult to expand to new domain without collecting new data instances and building a new knowledge base, if the schema between the source (restaurant and hotel in this case) and target domain (point of interest) look different.
To support development of scalable dialogue systems, we propose NeuralWOZ, a model-based dialogue collection framework. NeuralWOZ uses goal instructions and KB instances for synthetic dialogue generation. NeuralWOZ mimics the mechanism of a Wizard-of-Oz (Kelley, 1984;Dahlbäck et al., 1993) and Figure 1 illustrates our approach. NeuralWOZ has two neural components, Collector and Labeler. Collector generates a dialogue by using the given goal instruction and candidate relevant API call results from the KB as an input. Labeler annotates the generated dialogue with appropriate labels by using the schema structure of the dialogue domain as meta information. More specifically, Labeler selects the labels from candidate labels which can be obtained from the goal instruction and the API call results. As a result, NeuralWOZ is able to generate a dialogue corpus without training data of the target domain.
We evaluate our method for zero-shot domain transfer task Campagna et al., 2020) to demonstrate the ability to generate corpus for unseen domains, when no prior training data exists. In dialogue state tracking (DST) task with MultiWOZ 2.1 (Eric et al., 2019), the synthetic data generated with NeuralWOZ achieves 4.4% point higher joint goal accuracy and 5.7% point higher zero-shot coverage than the existing baseline. Additionally, we examine few-shot and full data augmentation tasks using both training data and synthetic data. We also illustrate how to collect synthetic data beyond MultiWOZ domains, and discuss the effectiveness of the proposed approach as a data collection strategy.
Our contributions are as follows: • NeuralWOZ, a novel method for generating dialogue corpus using goal instruction and knowledge base information • New state-of-the-art performance on the zeroshot domain transfer task • Analysis results highlighting the potential synergy of using the data generated from Neural-WOZ together with human-annotated data 2 Related Works

Wizard-of-Oz
Wizard-of-Oz (WOZ) is a widely used approach for constructing dialogue data (Henderson et al., 2014a,b;El Asri et al., 2017;Eric and Manning, 2017;Budzianowski et al., 2018). It works by facilitating a role play between two people. "User" utilizes a goal instruction that describes the context of the task and details of request and "system" has access to a knowledge base, and query results from the knowledge base. They take turns to converse, while the user makes requests one by one following the instructions, the system responds according to the knowledge base, and labels user's utterances.

Synthetic Dialogue Generation
Other studies on dialogue datasets use the user simulator-based data collection approaches (Schatzmann et al., 2007;Li et al., 2017;Bordes et al., 2017;Shah et al., 2018;Zhao and Eskenazi, 2018;Shah et al., 2018;Campagna et al., 2020). They define domain schema, rules, and dialogue templates to simulate user behavior under certain goals. The ingredients to the simulation are designed by developers and the dialogues are realized by predefined mapping rules or paraphrasing by crowdworkers. If a training corpus for the target domain exists, neural models that synthetically generates dialogues can augment the training corpus (Hou et al., 2018;Yoo et al., 2019). For example, Yoo et al. (2020) introduce Variational Hierarchical Dialog Autoencoder (VHDA), where hierarchical latent variables exist for speaker identity, user's request, dialog state, and utterance. They show the effectiveness of their model on single-domain DST tasks. SimulatedChat (Mohapatra et al., 2020) also uses goal instruction for dialogue augmentation. Although it does not solve zero-shot learning task with domain expansion in mind, we run auxiliary experiments to compare with NeuralWOZ, and the results are in the Appendix D.

Zero-shot Domain Transfer
In zero-shot domain transfer tasks, there is no data for target domain, but there exists plenty of data for other domains similar to target domain. Solving the problem of domain expansion of dialogue systems can be quite naturally reducted to solving zero-shot domain transfer.  conduct a landmark study on the zero-shot DST. They Figure 2: Illustration of Collector and Labeler. Collector takes goal instruction G and API call results A as the input, and outputs dialogue D T which consists of T turns. The state candidate C is prepopulated from the G and A as a full set for labeling. Finally, Labeler takes its value's subset O Si and question q for each slot type S i and dialogue context D t from Collector, and chooses answerõ from the O Si . suggest a model, Transferable Dialogue State Generator (TRADE), which is robust to a new domain where few or no training data for the domain exists. Kumar et al. (2020) and Li et al. (2021) follow the same experimental setup, and we also compare NeuralWOZ in the same experiment setup. Abstract Transaction Dialogue Model (ATDM) (Campagna et al., 2020), another method for synthesizing dialogue data, is another baseline for zero-shot domain transfer tasks we adopt. They use rules, abstract state transition, and templates to synthesize the dialogue, which is then fed into a model-based zero-shot learner. They achieved state-of-the-art in the task using the synthetic data on SUMBT , a pretrained BERT (Devlin et al., 2019) based DST model.

NeuralWOZ
In this section, we describe the components of Neu-ralWOZ in detail, and how they interact with each other. Figure 2 illustrates the input and output of two modules in NeuralWOZ. The synthetic corpus, which Collector and Labeler made, are used for the training of the DST baselines, TRADE  and SUMBT  in our experiments.

Problem Statement
Domain Schema In task-oriented dialogues, there are two slot types; inf ormable and requestable slots (Henderson et al., 2014a;Budzianowski et al., 2018). The inf ormable slots are the task constraints to find relevant information from user requests, for example, "restaurantpricerange", "restaurant-food", "restaurant-name", and "restaurant-book people" in Figure 1. The requestable slots are the additional details of user requests, like "reference number" and "address" in Figure 1. Each slot S can have its corresponding value V in a scenario. In multi-domain scenarios, each domain has a knowledge base KB, which consists of slot-value pairs corresponding to its domain schema. The API call results in Figure 1 are the examples of the KB instances of the restaurant domain.
Goal Instruction The goal instruction, G, is a natural language text describing constraints of user behavior in the dialogue D including informable and requestable slots. The paragraph consists of four sentences at the top of Figure 1 is an example. We define a set of informable slot-value pairs that explicitly expressed on the G as C G , which we formally define as ("restaurantpricerange", "expensive") and ("restaurant-food", "british") are examples of the elements of C G (Figure 1).

API Call Results
The API call results, A, are corresponding query results of the C G from KB. We for- Each a i is associated with its domain, domain a i , and with slot-value pairs, k can be either informable or requestable slot. For example, the restaurant instance, "graffiti" in Figure 1, is a query result from ("restaurant-pricerange", "expensive") and ("restaurant-food", "british") described in the goal instruction.

State Candidate
We define informable slot-value pairs that are not explicit in G but accessible by A in D as It contains all informable slot-value pairs from C a 1 to C a |A| . The elements of C A are likely to be uttered by summaries of current states or recommendations of KB instances by the system side in D. The system utterance of the second turn in Figure 1 is an example ("I recommend graffiti."). In this case, the slot-value pair ("restaurant-name", "graffiti") can be obtained from the A, not from the G. Finally, state candidate C is the union of C G and C A . It is a full set of the dialogue state for the dialogue D from given G and A. Thus, it can be used as label candidates of dialogue state tracking annotation.

Collector
Collector is a sequence-to-sequence model, which takes a goal instruction G and API call results A as the input and generates dialogue D T . The generated dialogue D T = (r 1 , u 1 , ..., r T , u T ) is the sequence of system response r and user utterance u. They are represented by N tokens (w 1 , ..., w N ) 2 .
We denote the input of Collector as <s> ⊕ G ⊕ </s> ⊕ A, where the ⊕ is concatenate operation. The <s> and </s> are special tokens to indicate start and seperator respectively. The tokenized natural language description of G is directly used as the tokens. The A takes concatenation of each a i (a 1 ⊕ · · · ⊕ a |A| ) 3 . For each a i , we flatten the result to the token sequence, The <domain> and <slot> are other special tokens as separators. The objective function of Collector is Our Collector model uses the transformer architecture (Vaswani et al., 2017) initialized with pretrained BART (Lewis et al., 2020). Collector is trained using negative log-likelihood loss, where M C is the number of training dataset for Collector and N j is target length of the j-th instance. Following Lewis et al. (2020), label smoothing is used during the training with the smoothing parameter of 0.1.
2 Following Hosseini-Asl et al. (2020), we also utilize rolespecific special tokens <system> and <user> for the r and u respectively. 3 we limit the |A| to a maximum 3

Labeler
We formulate labeling as a multiple-choice problem. Specifically, Labeler takes a dialogue context D t = (r 1 , u 1 , ..., r t , u t ), question q, and a set of answer options O = {o 1 , o 2 , ..., o |O| }, and selects one answerõ ∈ O. Labeler encodes the inputs for each o i separately, and s o i ∈ R 1 is the corresponding logit score from the encoding. Finally, the logit score is normalized via softmax function over the answer option set O.
The input of Labeler is a concatenation of D t , q, and o i , <s>⊕D t ⊕</s>⊕q ⊕</s>⊕o i ⊕</s>, with special tokens. For labeling dialogue states to D t , we use the slot description for each corresponding slot type, S i , as the question, for example, "what is area or place of hotel?" for "hotel-area" in Figure 2.
We populate corresponding answer op- There are two special values, Dontcare to indicate the user has no preference and N one to indicate the user is yet to specify a value for this slot (Henderson et al., 2014a;Budzianowski et al., 2018). We include these values in the O S i . For labeling the active domain of D t , which is the domain at t-th turn of D t , we define domain question, for example "what is the domain or topic of current turn?", for q and use predefined domain set O domain as answer options. In MultiWOZ, O domain = {"Attraction", "Hotel", "Restaurant", "Taxi", "Train"}.
Our Labeler model employs a pretrained RoBERTa model (Liu et al., 2019) as the initial weight. Dialogue state and domain labeling are trained jointly based on the multiple choice setting. Preliminary result shows that the imbalanced class problem is significant in the dialogue state labels. Most of the ground-truth answers is N one given question 4 . Therefore, we revise the negative loglikelihood objective to weight other (not-N one) answers by multiplying a constant β to the loglikelihood when the answer of training instance is not N one. The objective function of Labeler is i denotes the answer of i-th question for j-th training dialogue at turn t, the N q is the number of questions, and M L is the number of training dialogues for Labeler. We empirically set β to a constant 5.

Synthesizing a Dialogue
We first define goal template G. 5 G is a delexicalized version of G by changing each value V G i expressed on the instruction to its slot S G i . For example, the "expensive" and "british" of goal instruction in Figure 1 are replaced with "restaurantpricerange" and "restaurant-food", respectively. As a result, domain transitions in G becomes convenient.
First, G is sampled from a pre-defined set of goal template. API call results A, which correspond to domain transitions in G, are randomly selected from the KB. Especially, we constrain the sampling space of A when the consecutive scenario among domains in G have shared slot values. For example, the sampled API call results for restaurant and hotel domain should share the value of "area" to support the following instruction "I am looking for a hotel nearby the restaurant". G and A are aligned to become G A . In other words, each value for S G i in G is assigned using the corresponding values in A. 6 Then, Collector generates dialogue D, of which the total turn number is T , given G A and A. More details are in Appendix A. Nucleus sampling (Holtzman et al., 2020) is used for the generation.
We denote dialogue state and active domain at turn t as B t and domain t respectively. The B t , {(S j , V j,t ) | 1 ≤ j ≤ J}, has J number of predefined slots and their values at turn t. It means Labeler is asked J (from slot descriptions) + 1 (from domain question) questions regarding dialogue context D t from Collector. Finally, the out-put of Labeler is a set of dialogue context, dialogue state, and active domain at turn t triples { (D 1 , B 1 , domain 1 ), ..., (D T , B T , domain T )}.

Dataset
We use MultiWOZ 2.1 (Eric et al., 2019) dataset 7 for our experiments. It is one of the largest publicly available multi-domain dialogue data and it contains 7 domains related to travel (attraction, hotel, restaurant, taxi, train, police, hospital), including about 10,000 dialogues. The MultiWOZ data is created using WOZ so it includes goal instruction per each dialogue and domain-related knowledge base as well. We train our NeuralWOZ using the goal instructions and the knowledge bases first. Then we evaluate our method on dialogue state tracking with and without synthesized data from the NeuralWOZ using five domains (attraction, restaurant, hotel, taxi, train) in our baseline, and follow the same preprocessing steps of Wu et al. (2019); Campagna et al. (2020).

Training NeuralWOZ
We use the pretrained BART-Large (Lewis et al., 2020) for Collector and RoBERTa-Base (Liu et al., 2019) for Labeler. They share the same byte-level BPE vocab (Sennrich et al., 2016) introduced by Radford et al. (2019). We train the pipelined models using Adam optimizer (Kingma and Ba, 2017) with learning rate 1e-5, warming up steps 1,000, and batch size 32. The number of training epoch is set to 30 and 10 for Collector and Labeler respectively.
For the training phase of Labeler, we use a state candidate set from ground truth dialogue states B 1:T for each dialogue, not like the synthesizing phase where the options are obtained from goal instruction and API call results. We also evaluate the performance of Labeler itself like the training phase with validation data (Table 5). Before training Labeler on the MultiWOZ 2.1 dataset, we pretrain Labeler on DREAM 8 (Sun et al., 2019) to boost Labeler's performance. This is similar to coarse-tuning in Jin et al. (2019). The same hyper parameter setting is used for the pretraining.
For the zero-shot domain transfer task, we exclude dialogues which contains target domain from   (Wolf et al., 2020). The best performing models, Collector and Labeler, are selected by evaluation results from the validation set.

Synthetic Data Generation
We synthesize 5,000 dialogues for every target domain for both zero-shot and few-shot experiments 9 , and 1,000 dialogues for full data augmentation. For zero-shot experiment, since the training data are unavailable for a target domain, we only use goal templates that contain the target domain scenario in the validation set similar to Campagna et al. (2020). We use nucleus sampling in Collector with parameters top p ratio in the range {0.92, 0.98} and temperature in the range {0.7, 0.9, 1.0}. It takes about two hours to synthesize 5,000 dialogues using one V100 GPU. More statistics is in Appendix B.

Baselines
We compare NeuralWOZ with baseline methods both zero-shot learning and data augmentation using MultiWOZ 2.1 in our experiments. We use a baseline zero-shot learning scheme which does not 9 In Campagna et al. (2020), the average number of synthesized dialogue over domains is 10,140. use synthetic data . For data augmentation, we use ATDM and VHDA.
ATDM refers to a rule-based synthetic data augmentation method for zero-shot learning suggested by Campagna et al. (2020). It defines rules including state transitions and templates for simulating dialogues and creates about 10,000 synthetic dialogues per five domains in the MultiWOZ dataset. Campagna et al. (2020) feed the synthetic dialogues into zero-shot learner models to perform zero-shot transfer task for dialogue state tracking. We also employ TRADE  and SUMBT  as baseline zero-shot learners for fair comparisons with the ATDM. VHDA refers to model-based generation method using hierarchical variational autoencoder (Yoo et al., 2020). It generates dialogues incorporating information of speaker, goal of the speaker, turnlevel dialogue acts, and utterance sequentially. Yoo et al. (2020) augment about 1,000 dialogues for restaurant and hotel domains in the MultiWOZ dataset. For a fair comparison, we use TRADE as the baseline model for the full data augmentation experiments. Also, we compare ours with the VHDA on the single-domain augmentation setting following their report.

Experimental Results
We use both joint goal accuracy (JGA) and slot accuracy (SA) as the performance measurement. The JGA is an accuracy which checks whether all slot values predicted at each turn exactly match the ground truth values, and the SA is the slotwise accuracy of partial match against the grouth  truth values. Especially for zero and few-shot setting, we follow the previous setup Campagna et al., 2020). Following Campagna et al. (2020), the zero-shot learner model should be trained on data excluding the target domain, and tested on the target domain. We also add synthesized data from our NeuralWOZ which is trained in the same way, i.e., leave-one-out setup, to the training data in the experiment.

Zero-Shot Domain Transfer Learning
Our method achieves new state-of-the-art of zeroshot domain transfer learning for dialogue state tracking on the MultiWOZ 2.1 dataset (Table 1). Except for the hotel domain, the performance over all target domains is significantly better than the previous sota method. We discuss the lower performance in hotel domain in the analysis section. Following the work of Campagna et al. (2020), we also measure zero-shot coverage, which refers to the accuracy ratio between zero-shot learning over target domain, and fully trained model including the target domain. Our NeuralWOZ achieves 66.9% and 79.2% zero-shot coverage on TRADE and SUMBT, respectively, outperforming previous state-of-the-art, ATDM, which achieves 61.2% and 73.5%, respectively.

Data Augmentation on Full Data Setting
For full data augmentation, our synthesized data come from fully trained model including all five domains in this setting. Table 2 shows that our model still consistently outperforms in full data augmentation of multi-domain dialogue state tracking. Specifically, our NeuralWOZ performs 2.8% point better on the joint goal accuracy of TRADE than ATDM. Our augmentation improves the performance by a 1.6% point while ATDM degrades.
We also compare NeuralWOZ with VHDA, a previous model-based data augmentation method for dialogue state tracking (Yoo et al., 2020). Since the VHDA only considers single-domain simulation, we use single-domain dialogue in hotel   and restaurant domains for the evaluation. Table 3 shows that our method still performs better than the VHDA in this setting. NeuralWOZ has more than twice better joint goal accuracy gain than that of VHDA. Table 4 shows the intrinsic evaluation results from two components (Collector and Labeler) of the NeuralWOZ on the validation set of MultiWOZ 2.1. We evaluate each component using perplexity for Collector and joint goal accuracy for Labeler, respectively. Note that the joint goal accuracy is achieved by using state candidate set, prepopulated as the multiple-choice options from the ground truth, B 1:T , as the training time of Labeler. It can be seen as using meta information since its purpose is accurate annotation but not the dialogue state tracking itself. We also report the results by excluding target domain from full dataset to simulate zero-shot environment. Surprisingly, synthesized data from ours performs effectively even though the annotation by Labeler is not perfect. We conduct further analysis, the responsibility of each model, in the following section. 6 Analysis 6.1 Error Analysis Figure 3 shows the slot accuracy for each slot type in the hotel domain, which is the weakest domain from ours. Different from other four domains, only the hotel domain has two boolean type slots, "parking" and "internet", which can have only "yes" or "no" as their value. Since they have abstract property for the tracking, Labeler's labeling performance tends to be limited to this domain. However, it is noticeable that our accuracy of booking related slots (book stay, book people, book day) are much higher than the ATDM's. Moreover, the model using synthetic data from the ATDM totally fails to track the "book stay" slot. In the synthesizing procedures of Campagna et al. (2020), they create the data with a simple substitution of a domain noun phrase when the two domains have similar slots. For example, "find me a restaurant in the city center" can be replaced with "find me a hotel in the city center" since the restaurant and hotel domains share "area" slot. We presume it is why they outperform over slots like "pricerange" and "area".

Few-shot Learning
We further investigate how our method is complementary with human-annotated data. Figure 4 illustrates our NeuralWOZ shows a consistent gain in the few-shot domain transfer setting. Unlike the performance with ATDM is saturated as few-shot ratio increases, the performance using our Neu-ralWOZ is improved continuously. We get about 5.8% point improvement from the case which does not use synthetic data when using 10% of humanannotated data for the target domain. It implies our method could be used more effectively with the  human-annotated data in a real scenario.

Ablation Study
We discover whether Collector and Labeler are more responsible for the quality of synthesizing. Table 5 shows ablation results where each model of NeuralWOZ is trained the data including or withholding the hotel domain. Except for the training data for each model, the pipelined models are trained and dialogues are synthesized in the same way. Then, we train TRADE model using the synthesized data and evaluate it on hotel domain like the zero-shot setting. The performance gain from Collector which is trained including the target domain is 4.3% point, whereas the gain from Labeler is only 0.8% point. It implies the generation quality from Collector is more responsible for the performance of the zero-shot learner than accurate annotation of Labeler. dataset. It is harder to generalize when the schema structure of the target domain is different from the source domain. Other examples can be found in Appendix C. We would like to extend the Neural-WOZ to more challenging expansion scenario like these in future work.

Comparison on End-to-End Task
To show that our framework can be used for other dialogue tasks, we test our data augmentation method on end-to-end task in MultiWOZ 2.1. We describe the result in Appendix D with discussion.
In full data setting, Our method achieves 17.46 BLUE, 75.1 Inform rate, 64.6 Success rate, and 87.31 Combine rate, showing performance gain using the synthetic data. Appendix D also includes the comparison and discussion on SimulatedChat (Mohapatra et al., 2020).

Conclusion
We propose NeuralWOZ, a novel dialogue collection framework, and we show our method achieves state-of-the-art performance on zero-shot domain transfer task. We find the dialogue corpus from NeuralWOZ is synergetic with human-annotated data. Finally, further analysis shows that Neural-WOZ can be applied for scaling dialogue system. We believe NeuralWOZ will spark further research into dialogue system environments where expansion target domains are distant from the source domains.
A Goal Instruction Sampling for Synthesizing in NeuralWOZ   Figure 7 shows other examples from our NeuralWOZ. The left subfigure shows an example of synthesized dialogue from NeuralWOZ in a restaurant, which is seen domain and has the same schema from the   restaurant domain in MultiWOZ dataset. However, the "spicy club" is an unseen instance which is newly added to the schema for the synthesizing. The right subfigure shows other synthetic dialogue in restaurant, which is a seen domain but has different schema from restaurant domain in MultiWOZ dataset. It describes navigation in-car scenario which is borrowed from KVret dataset (Eric and Manning, 2017). It is a non-trivial problem to adapt to unseen scenario, even if it is in the same domain.

D Additional Explanation on Comparison in End-to-End Task
To compare our model with the model of (Mohapatra et al., 2020), we conduct end-to-end task experiments the previous work did. Table 8 illustrates the result. Though the performance of baseline implementation is different, we can see that the trend of performance improvement is comparable to the report of SimulatedChat. Two studies are also different in terms of modeling. In our method, all utterances in the dialogue are first collected based on goal instruction and KB information by Collector. After that, Labeler selects annotations from candidate labels, which can be inducted from goal instruction and KB information. On the other hand, SimulatedChat creates utterance and label sequentially with knowledge base access, for each turn. Thus, each generation of utterance is affected by the generated utterance of labels of the previous turn.
In detail, the two methods also differ in terms of complexity. SimulatedChat creates a model for each domain separately, and for each domain, it creates five neural modules: user response generation, user response selector, agent query generator, agent response generator, and agent response selector. This results 25 neural models for data augmentation in the MultiWOZ experiments. On the contrary, NeuralWOZ only needs two neural models for data augmentation: Collector and Labeler.
Another notable difference is that SimulatedChat does not generate multi-domain data in a natural way. The strategy of creating a model for each domain not only makes it difficult to transfer the knowledge to a new domain, but also makes it difficult to create multi-domain data. In SimulatedChat, the dialogue is created for each domain and then concatenated. Our model can properly reflect the information of all domains included in the goal instruction to generate synthetic dialogues, regardless of the number of domains.

E Other Experiment Details
The number of parameters of our models is 406M for Collector and 124M for Labeler, respectively. Both models are trained on two V100 GPUs with mixed precision floating point arithmetic. It takes about 4 (10 epochs) and 24 hours (30 epochs) for the training, respectively. We optimize hyperparameters of each model, learning rate {1e-5, 2e-5, 3e-5} and batch size {16, 32, 64}, based on greedy search. We set the maximum sequence length of Collector to 768 and the Labeler to 512.
For the main experiments, we fix hyperparameter settings of TRADE (learning rate 1e-4 and batch size 32) and SUMBT (learning rate 5e-5 and batch size 4) same with previous works. We use the script of Campagna et al. (2020) for converting the TRADE's data format to the SUMBT's.
For GPT2 (Radford et al., 2019) based model for the end2end task, we re-implement the model similar with SimpleTOD (Hosseini-Asl et al., 2020) but not using action. Thus, it generates dialogue context, dialogue state, database results, and system response in an autoregressive manner. We also use special tokens in the SimpleTOD (without special tokens for the action). We follow preprocessing procedure for the end2end task, including delexicalization suggested by (Budzianowski et al., 2018). We use 8 for batch size and 5e-5 for learning rate. Note that we also train our NeuralWOZ using 30% of training data and synthesize 5000 dialogues for the end2end experiments. However, we could not find detailed experiments setup of Mohapatra et al. (2020) including hyperparameter, the seed of each portion of training data, and evaluation, so it is not a fair comparison.