Transferable Dialogue Systems and User Simulators

One of the difficulties in training dialogue systems is the lack of training data. We explore the possibility of creating dialogue data through the interaction between a dialogue system and a user simulator. Our goal is to develop a modelling framework that can incorporate new dialogue scenarios through self-play between the two agents. In this framework, we first pre-train the two agents on a collection of source domain dialogues, which equips the agents to converse with each other via natural language. With further fine-tuning on a small amount of target domain data, the agents continue to interact with the aim of improving their behaviors using reinforcement learning with structured reward functions. In experiments on the MultiWOZ dataset, two practical transfer learning problems are investigated: 1) domain adaptation and 2) single-to-multiple domain transfer. We demonstrate that the proposed framework is highly effective in bootstrapping the performance of the two agents in transfer learning. We also show that our method leads to improvements in dialogue system performance on complete datasets.


Introduction
This work aims to develop a modelling framework in which dialogue systems (DSs) converse with user simulators (USs) about complex topics using natural language. Although the idea of joint learning of two such agents has been proposed before, this paper is the first to successfully train both agents on complex multi-domain human-human dialogues and to demonstrate a capacity for transfer learning to low-resource scenarios without requiring re-redesign or re-training of the models.
One of the challenges in task-oriented dialogue modelling is to obtain adequate and relevant training data. A practical approach in moving to a new domain is via transfer learning, where pre-training on a general domain with rich data is first performed and then fine-tuning the model on the target domain. End-to-end DS (Wen et al., 2017;Dhingra et al., 2017) are particularly suitable for transfer learning, in that such models are optimised as a single system. By comparison, pipe-lined based DSs with multiple individual components (Young et al., 2013) require fine-tuning of each component system. These separate steps can be done independently, but it becomes difficult to ensure optimality of the overall system.
A similar problem arises in the data-driven US as commonly used in interaction with the DS. Though many USs have been proposed and been widely studied, they usually operate at the level of semantic representation El Asri et al., 2016). These models can capture user intent, but are otherwise somewhat artificial as user simulators in that they do not consume and produce natural language. As discussed above for DSs, the end-to-end architecture for the US also offers simplicity in transfer learning across domains.
There are also potential advantages to continued joint training of the DS and the US. If a user model is less than perfectly optimised after supervised learning over a fixed training corpus, further learning through interaction between the two agents offers the US the opportunity to refine its behavior. Prior work has shown benefits from this approach to dialogue policy learning, with a higher success rate at dialogue level (Liu and Lane, 2017b;Papangelis et al., 2019;Takanobu et al., 2020), but there has not been previous work that addresses multi-domain end-to-end dialogue modelling for both agents. Takanobu et al. (2020) address refinement of the dialogue policy alone at the semantic level, but do not address end-to-end system architectures. Liu and Lane (2017b); Papangelis et al. (2019) address single-domain dialogues (Henderson et al., 2014), but not the more realistic and complex multi-domain dialogues.
This paper proposes a novel learning framework for developing dialogue systems that performs Joint Optimisation with a User SimulaTor (JOUST). 1 Through the pre-training on complex multi-domain datasets, two agents are able to interact using natural language, and further create more diverse and rich dialogues. Using reinforcement learning (RL) to optimise both agents enables them to depart from known strategies learned from a fixed limited corpus, to explore new, potentially better policies. Importantly, the end-to-end designs in the framework makes it easier for transfer learning of two agents from one domain to another. We also investigate and compare two reward designs within this framework: 1) the common choice of task success at dialogue level; 2) a fine-grained reward that operates at turn level. Results on Mul-tiWOZ dataset  show that our method is effective in boosting the performance of the DS in complicated multi-domain conversation. To further test our method in more realistic scenarios, we design specific experiments on two low-resource setups that address different aspects of data sparsity. Our contributions can be summarised as follows: • Novel contributions in joint optimisation of a fully text-to-text dialogue system with a matched user simulator on complex, multidomain human-human dialogues.
• Extensive experiments, including exploring different types of reward, showing that our framework with a learnable US boost overall performance and reach new state-of-the-art performance on MultiWOZ.
• Demonstration that our framework is effective in two transfer learning tasks of practical benefit in low-resources scenarios with in-depth analysis of the source of improvements.
2 Pre-training the Dialogue System and User Simulator In our joint learning framework, we first pre-train the DS and US using supervised learning so that two models are able to interact via natural language. This section presents the architectures of 1 The code is released at https://github.com/ andy194673/joust. two agents, illustrated in Fig. 1, and the objectives used for supervised learning.

Dialogue system
Dialogue state tracking (DST) The first task of a DS is to process the dialogue history in order to maintain the belief state which records essential information of the dialogue. A DST model is utilized to predict the set of slot-value pairs which constitute the constraints of the entity for which the user is looking for, e.g. {hotel_area=north, hotel_name=gonville_hotel}.
The DST model used here is an encoder-decoder model with attention mechanism (Bahdanau et al., 2015). The set of slot-value pairs is formulated as a slot sequence together with a value sequence. For the t th dialogue turn, the DST model first encodes the dialogue context and the most recent user utterance x us t−1 using a bi-directional LSTM (Graves et al., 2005)  is then passed through separate affine transforms followed by the softmax function to predict a slot token and value for step i. The final belief state is the aggregation of predicted slot-value pairs of all decoding steps.
Database Query Based on the updated belief state, the system searches the database and retrieves the matched entities. In addition, a one-hot vector of size 3 characterises the result of every query.
Context Encoding To capture the dialogue flow, a hierarchical LSTM (Serban et al., 2016) encodes the dialogue context from turn to turn throughout the dialogue. At each turn t, the most recent user utterance x us t−1 is encoded by an LSTM-based sentence encoder to obtain a sentence embedding e us t and hidden states H us t . Another LSTM is used as the context encoder, which encodes e us t as well as the output of the context encoder on the user side c us t−1 from the previous turn (see Fig. 1). The context encoder produces the next dialogue context state c ds t for the downstream dialogue manager.
Policy The dialogue manager determines the system dialogue act based on the current state of the dialogue. The system dialogue act is treated as a sequence of tokens in order to handle cases in Figure 1: Overall architecture of the proposed framework, where the dialogue system (DS) and user simulator (US) discourse with each other. t denotes dialogue turn index. The context encoder is shared between the two agents.
which multiple system actions exist in the same turn.The problem is therefore formulated as a sequence generation task using an LSTM. At each decoding step, the inputs to the policy decoder are: 1) the embedding of the act token predicted at the previous step; 2) the previous hidden state; 3) the attention vector obtained by attending over the hidden states of the user utterance H us t using 2) as query; 4) the database retrieval vector; 5) the summarized belief state, which is a binary vector where each entry corresponds to a domain-slot pair. The output space contains all possible act tokens. For better modeling of the dialogue flow, the initialization of the hidden state is set to the context state c ds t obtained by the context encoder.
Natural language generation (NLG) The final task of the DS is to generate the system response, based on the predicted system dialogue act. To generate the word sequence another LSTM is used as the NLG model. At each decoding step, the previous hidden state serves as a query to attend over the hidden states of the policy decoder. The resulting attention vector and the embedding of the previous output word are the inputs to an LSTM whose output is the word sequence with delexicalized tokens. These delexicalized tokens will be replaced by retrieval results to form the final utterance.

User Simulator
As in the DS, the proposed US has a dialogue manager, an NLG model and a dialogue context encoder. However, in place of a DST to maintain the belief state, the US maintains an internal goal state to track progress towards satisfying the user goals.

Goal State
The goal state is modelled as a binary vector that summarises the dialogue goal. Each entry of the vector corresponds to a domain-slot pair in the ontology. At the beginning of a dialogue, goal state entries are turned on for all slots that make up the goal. At each dialogue turn, the goal state is updated based on the previous user dialogue act. If a slot appears in the previous dialogue act, either as information from the user or as a request by the US, the corresponding entry is turned off.

Context encoding, Policy & NLG in the US
These steps follow their implementations in the DS. For context encoding in the US, a sentence encoder first encodes the system response using an LSTM to obtain hidden states H ds t and sentence embedding e ds t . The context encoder takes e ds t and DS context state c ds t as inputs to produce the dialogue context state c us t which is passed to the DS at the next turn.
Also as in the DS, the policy and the NLG model of the US are based on LSTMs. The input to the policy are goal state, hidden states of the sentence encoder H ds t and context state c us t , to produce the user dialogue act, represented as in the DS as a sequence of tokens. The NLG model takes the hidden states of policy decoder as input to generate the user utterance, which is then lexicalised by replacing delexicalised tokens using the user goal.

Supervised Learning
For each dialogue turn, the ground truth dialogue acts and the output word sequences are used as supervision for both the DS and the US. The losses of the policy and the NLG model are the crossentropy losses of the predicted sequence probability p and the ground-truth y: In the above, * can be either ds or us, referring either to the DS or the US: e.g. p ds a,i is the probability of the system act token at the i th decoding step in a given turn. The ground-truth y contains both word sequences and act sequences with W and A as their lengths.
The DST annotations are also used as supervision for the DS. The loss of the DST model is defined as the sum of the cross-entropy losses for slot and value: where |SV | is the number of slot-value pairs in a turn; i is the decoding step index. p ds s,i and p ds v,i are the predictions of slot and value at the i th step.
The overall losses for the DS and the US are: where θ ds and θ us are the parameters of DS and US, respectively. The two agents are updated jointly to minimize the sum of the losses (L ds +L us ). The success rate of the generated dialogues is used as the stopping criterion for supervised learning.

RL Optimisation of the Dialogue System and User Simulator
After the DS and US models are pre-trained from the corpus using supervised learning, they are finetuned using reinforcement learning (RL) based on the dialogues generated during their interactions. Two reward designs are presented after which the optimisation strategy is given.

Dialogue-Level Reward
Following common practice (El Asri et al., 2014;Zhao et al., 2019), the success of the simulated dialogues is used as the reward, which can only be observed at the end of the dialogue. A small penalty is given at each turn to discourage lengthy dialogues. When updating the US jointly with the DS during interaction using RL, the reward is shared between two agents.

Turn-Level Reward
While the dialogue-level reward is straight-forward, it only considers the final task success rate of the dialogues and neglects the quality of the individual turns. For complex multi-domain dialogues there is a risk that this will make it difficult for the system to learn the relationship between actions and rewards. We thus propose a turn-level reward function that encapsulates the desired behavioural features of fundamental dialogue tasks. The rewards are designed separately for the US and the DS according to their characteristics.
DS Reward A good DS should learn to refine the search by requesting needs from the user and providing the correct entities, with their attributes, that the user wishes to know. Therefore at the current turn a positive reward is assigned to DS if: 1) it requests slots that it has not requested before; 2) it successfully provides an entity; or 3) is answers correctly all additional attributes requested by the user. Otherwise, a negative reward is given.
US Reward A good US should not repeatedly give the same information or request attributes that have already been provided by the DS. Therefore, a positive reward is assigned to the US if: 1) it provides new information about slots; 2) it asks new attributes about a certain entity, or 3) it replies correctly to a request from the DS. Otherwise a penalty is given.

Optimization
We apply the Policy Gradient Theorem (Sutton et al., 2000) to the space of (user/system) dialogue acts. In the t th dialogue turn, the reward r ds t or r us t is assigned to the two agents at final last step of their generated act sequence. The return for the action at the i th step is R * i = γ |A * |−i r * t , where * denotes ds or us, and |A * | is the length of the act sequence of each agent. γ ∈ [0, 1] is a discounting factor. The policy gradient of each turn can then be written as: where p * a,i is the probability of the act token at the i th step in the predicted dialogue act sequence. The two agents are updated using Eqn. (4) at each turn within the entire simulated dialogue.

Experiments
Dataset The MultiWOZ 2.0 dataset  is used for all experiments. It contains 10.4k dialogues with an average of 13.6 turns. Each dialogue can span up to three domains. Compared to previous benchmark corpora such as DSTC2 (Williams et al., 2016) or WOZ2.0 (Wen et al., 2017, MultiWOZ is more challenging because 1) its rich ontology contains 39 slots across 7 domains; 2) the DS can take multiple actions in a single turn; 3) the complex dialogue flow makes it difficult to hand-craft a rule-based DS or an agenda-based US. Lee et al. (2019) provided the user act labels.
Training Details The positive and negative RL rewards of Sec. 3 are tuned in the range [-5, 5] based on the dev set. The user goals employed for interaction during RL are taken from the training data without synthesizing new goals. Further training details can be found in Appendix A.1.
Evaluation Metrics The proposed model is evaluated in terms of the inform rate (Info), the success rate (Succ), and BLEU. 2 The inform rate measures whether the DS provides the correct entity matching the user goal, while the success rate further requires the system to answer all user questions correctly. Following (Mehri et al., 2019), the combined performance (Comb) is also reported, calculated as 0.5 * (Info + Succ) + BLEU.

Interaction Quality
First, it is examined whether the proposed learning framework improves the discourse between dialogue system and user simulator. Several variants of our model are examined: 1) two agents are pre-trained using supervised learning, serving as baseline; 2) RL is used to fine-tune only the DS (RL-DS) or both agents (RL-Joint). In each RL case, we can either use rewards at the dialogue level (dial-R, Sec. 3.1) or rewards at the turn-level (turn-R, Sec. 3.2). The two trained agents interact based on 1k user goals from the test corpus, with the generated dialogues being evaluated using the metrics above.
From Table 1, we can see that the application of RL in our framework improves the success rate by more than 10% (b-e vs. a). This indicates that the DS learns through interaction with the learned US, and the designed rewards, to be better at completing the task successfully. Moreover, the joint  optimisation of both the US and the DS provides dialogues with higher success rate than only optimising the DS (c&e vs. b&d). It shows that the behaviour of the US is realistic enough and diverse enough to interact with the DS, and its behavior can be improved together during RL optimisation. Finally, by comparing two reward designs, the finegrained rewards at the turn level seem to be more effective towards guiding two agents' interaction (b&c vs. d&e), which is reasonable since they reflect more than simple success rate in terms of the nature of the tasks. Some real, generated dialogues through the interactions are provided in Appendix A.6; we note that after RL, both agents respond to requests more correctly and also learn not to repeat the same information, leading to a more successful and smooth interaction without loops in the dialogue. The corresponding error analysis of each of the agents is provided later in Sec. 4.4.1.

Benchmark Results
We conduct experiments on the official test set for comparison to existing end-to-end DSs. The trained DS is used to interact with the fixed test corpus following the same setup of . Results are reported using a predicted belief state (Table 2) and using an oracle belief state (  of our model. Joint learning of two agents using RL with the fine-grained rewards reaches the best combined score and success rate. This implies that the exploration of more dialogue states and actions in the simulated interactions reinforces the behaviors that lead to higher success rate, and that these generalise well to unfamiliar states encountered in the test corpus. Our best RL model produces competitive results in Table 2 when using predicted belief state, and can further outperform the previous work in Table 3 when using oracle belief state. Note that we do not leverage the powerful pre-trained transformerbased models like SOLOIST or MinTL-BART model. We found that with RL optimisation, our LSTM-based models can still perform competitively. In terms of DS model structure, the most similar work would be the DAMD model. The performance gain found in comparing "JOUST Supervised Learning" to DAMD is partially due to the better performance of our DST model. 3 We also conduct experiments using only 50% of the training data for supervised learning to verify the efficacy of the proposed method under different amounts of data. As shown in Table 4, it is observed that our method also improves the model upon supervised learning when trained with less data and the improvements are consistent with the complete data scenario.

Transfer Learning
In this section, we demonstrate the capability of transfer learning of the proposed framework under two low-resource setups: Domain Adaptation and Single-to-Multiple Domain Transfer. Two finetuning methods are adopted: the straightforward fine-tuning without any constraints (Naive) and 3 In correspondence, the DAMD authors report a DST model with joint accuracy of ca. 35%, while ours is 45%.  elastic weight consolidation (EWC) (Kirkpatrick et al., 2017). We show that the proposed RL can be further applied to both methods and produces significantly improved results. Here we experiment the best RL variants using turn-level rewards (same as (e) in Table 1).

Domain Adaptation
In these experiments, each of five domains is selected as the target domain.
Taking the hotel domain for example, 300 dialogues 4 involving the hotel domain are sampled from the training corpus as adaptation data. The rest of the dialogues, not involving the hotel domain, form the source data. Both the DS and the US are first trained on the source data (Source), and then fine-tuned on the limited data of the target domain (Naive, EWC). Afterwards, the pair of agents is trained in interaction using the proposed RL training regime (+RL). Results in the form of the combined score are given in Table 5 (corresponding success rates are provided in Appendix A.5). As expected, models pre-trained on source domains obtain low combined scores on target domains. Fine-tuning using Naive or EWC method significantly bootstraps the systems, where the regularization in EWC benefits more for the low-resource training. By applying our proposed framework to the two sets of finetuned models, the performance can be further improved by 7-10% in averaged numbers, with both predicted and oracle belief states. This indicates that through the interaction with the US, the DS is not constrained by having seen only a very limited amount of target domain data, and that it can learn effectively from the simulated dialogues using the simple reward structure (the RL learning curve is presented in Sec. 4.4.3). With a better initialization points such as EWC models, the models can learn from a higher quality interaction and produce better results (EWC+RL vs Naive+RL). On aver-  Single-to-Multiple Domain Transfer Another transfer learning scenario is investigated where only limited multi-domain data is accessible but sufficient single-domain dialogues are available. This setup is based on a practical fact that singledomain dialogues are often easier to collect than multi-domain ones. All single-domain dialogues in the training set form the source data. For each target multi-domain combination, 100 dialogues 5 are sampled as adaptation data. As before, the DS and the US are first pre-trained on the source data and then fine-tuned on the adaptation data. Afterwards, two agents improve themselves through interaction. The models are tested using the multi-domain dialogues of the test corpus. Results in the form of the combined score are given in Table 6 (refer to Appendix A.5 for success rates). Although the Source models capture individual domains, they cannot manage the complex flow of multi-domain dialogues and hence produce poor combined scores, with worst results on combinations of three domains. Fine-tuning improves performance significantly, as the systems learn to transition between domains in the multi-domain dialogue flow. Finally, applying our RL optimization further increases the performance by 6-9% on average. This indicates that the dialogue agents can learn more complicated policies through exploring more dialogue states and actions while interacting with user simulator. We analyse the sources of improvements in the following section.

Error Analysis
We first examine the behavior of the US and the DS to understand the improved success rate in transfer learning. The models are those of Table 5 and are examined after fine-tuning using Naive method (Naive) and then after reinforcement learning (Naive+RL). For the DS, the rates of missing entities (Miss Ent.) and of wrong answers (Wrong Ans.) are reported. For the US, rates of repetitions of attributes (Rep. Att.) and of missing answers (Miss Ans.) are reported. The results shown in Table 7 are averaged over the five adaptation domains 6 . We see that with RL optimisation the errors made by the two agents are reduced significantly. Notably, the user model learns not to repeat the information already provided and attempts to answer more of the questions from the dialogue agent. These are the behaviors the reward structure of Sec. 3.2 are intended to encourage, and they lead to more successful interactions in policy learning.

Exploration of States and Actions
We now investigate whether our framework encourages exploration through increased interaction in transfer learning. We report the number of unique belief states in the training corpus and in the dialogues generated during RL interaction, as well as the unique action sequences per state that each  agent predicts. As shown in Table 8, the DS encounters more states in interaction with the US and also takes more unique actions in reinforcement learning relative to what it sees in supervised learning. In this way the DS considers additional strategies during the simulated training dialogues, with the opportunity to reach better performance even with only limited supervised data. Detailed results for each adaptation case are provided in Appendix A.4.

RL Learning Curve
Here we show that the designed reward structure is indeed a useful objective for training. Figure  We can see that both the reward value and model performance are consistently improved during RL, and their high correlation verifies the efficacy of the proposed reward design for training task-oriented dialogue systems.

Human Evaluation
The human assessment of dialogue quality is performed to confirm the improvements of the proposed methods. 400 dialogues, generated by the two trained agents, are evaluated by 14 human assessors. Each assessor is shown a comparison of two dialogues where one dialogue is generated by  Table 9: Human assessment of the system quality under supervised learning and reinforcement learning. the models using supervised learning (SL) and another is generated by the models after RL optimization. Note that here we are evaluating the performance gain during interactions between two agents (Sec. 4.1), instead of the gain in benchmark results by interacting with the static corpus (Sec. 4.2). This is why the baseline is our SL model instead of the existing state-of-the-art systems.
The assessor offers judgement regarding: • Which dialogue system completes the task more successfully (DS Success)?
• Which user simulator behaves more like a real human user (US Human-like)?
• Which dialogue is more natural, fluent and efficient (Dialogue Flow)? The results with relative win ratio, shown in Table 9, are consistent with the automatic evaluation. With the proposed RL optimisation, the DS is more successful in dialogue completion. More importantly, joint optimisation of the US is found to produce more human-like behavior. The improvement under the two agents leads to a more natural and efficient dialogue flow.

Related Work
In the emerging field of end-to-end DSs, in which all components of a system are trained jointly (Liu and Lane, 2017a;Wen et al., 2017;Lei et al., 2018). RL methods have been used effectively to optimize end-to-end DSs in (Dhingra et al., 2017;Zhao et al., 2019), although using rule-based USs or a fixed corpus for interaction. Recent works utilise powerful transformers such as GPT-2 (Peng et al., 2020;Hosseini-Asl et al., 2020) or T5 (Lin et al., 2020b) for dialogue modeling and reach stateof-the-art performance; however, the area of having a user simulator involved during training is unexplored. By comparison, this work uses a learned US as the environment for RL. The two agents we propose are able to generate abundant high-quality dialog examples and they can be extended easily to unseen domains. By utilizing an interactive envi-ronment instead of a fixed corpus, more dialogue strategies are explored and more dialogue states are visited.
There have been various approaches to building USs. In the research literature of USs, one line of research is rule-based simulation such as the agenda-based user simulator (ABUS) (Schatzmann and Young, 2009;Li et al., 2016). The ABUS's structure is such that it has to be re-designed for different tasks, which presents challenges in shifting to new scenarios. Another line of work is datadriven modelling. El Asri et al. (2016) modelled user simulation as a seq2seq task, where the output is a sequence of user dialogue acts the level of semantics. Gur et al. (2018) proposed a variational hierarchical seq2seq framework to introduce more diversity in generating the user dialogue act.  introduced the Neural User Simulator (NUS), a seq2seq model that learns the user behaviour entirely from a corpus, generates natural language instead of dialogue acts and possesses an explicit goal representation. The NUS outperformed the ABUS on several metrics. Kreyssig (2018) also compared the NUS and ABUS to a combination of the ABUS with an NLG component. However, none of these prior works are suitable for modelling complex, multi-domain dialogues in an end-to-end fashion. By contrast, the user model proposed here consumes and generates text and so can be directly employed to interact with the DS, communicating via natural language.
The literature on joint optimization of the DS and the US is line of research most relevant to our work. Takanobu et al. (2020) proposed a hybrid value network using MARL (Lowe et al., 2017) with roleaware reward decomposition used in optimising the dialogue manager. However, their model requires separate NLU/NLG models to interact via natural language, which hinders its application in the transfer learning to new domains. Liu and Lane (2017b); Papangelis et al. (2019) learn both the DS and the US in a (partially) end-to-end manner. However, their systems are designed for the single-domain dataset (DSTC2) and cannot handle the complexity of multi-domain dialogues: 1) their models can only predict one dialogue act per turn, which is not sophisticated enough for modelling multiple concurrent dialogue acts; 2) the simple DST components cannot achieve satisfactory performance in the multi-domain setup; 3) the user goal change is not modelled along the dialogue proceeds, which we found in our experiments very important for learning complex behaviors of user simulators. Relative to these three publications, this paper focuses on joint training of two fully end-to-end agents that are able to participate in complex multi-domain dialogues. More importantly, it is shown that the proposed framework is highly effective for transfer learning, which is a novel contribution relative to previous work.

Conclusion and Future Work
We propose a novel joint learning framework of training both the DS and the US for complex multidomain dialogues. Under the low-resource scenarios, the two agents can generate more dialogue data through interacting with each other and their behaviors can be significantly improved using RL through this self-play strategy. Two types of reward are investigated and the turn-level reward benefits more due to its fine-grained structure. Experiments shows that our framework outperforms previously published results on the MultiWOZ dataset. In two transfer learning setups, our method can further improves the well-performed EWC models and bootstraps the final performance largely. Future work will focus on improving the two agents' underlying capability with the powerful transformer-based models.

A.1 Training Details
Both the DS and the US are trained in an end-to-end fashion using the Adam optimizer. The sizes of the embedding and of the hidden layers are set to 300. During supervised training, the batch size is 100 and the learning rate is 0.001, while during RL, 10 is used as the batch size and 0.0001 as the learning rate for stability. We set the discounting factor γ to 1. The computing infrastructure used is Linux 4.4.0-138-generic x86_64 with the NVIDIA GPU GTX-1080. Average run time per model using 100% training data is around 6 hours. Model parameters is around 11M in total.
The turn-level rewards used for the best models in benchmark results are reported in Table 10 below. All rewards are tuned based on the combined score of the validation performance averaged over three seeds. As for dialogue-level rewards, a positive reward 1.0 will be given if a dialogue is successful.

A.2 Details of Dataset
As noted in the paper, we follow the original split of the MultiWOZ dataset and the number of dialogues for train/dev/test split is 8420/1000/1000. Data statistics of the number of dialogues in the two transfer learning scenarios are provided in Tables 11 and 12. Taxi  Train  300  300  300  300  300  Dev  438  415  400  484  206  Test  437  394  396  495 195

A.3 Error Analysis
The error analysis of each domain adaptation cases are provided in Tables 13 and 14.

A.4 Exploration
The detailed numbers of explored dialogue states and the average of unique dialogue actions per state in each case of two transfer learning scenarios are provided in Tables 15 and 16.

A.5 Transfer Learning
Here we provide the results in success rate in two transfer learning setups.

A.6 Generated Dialogue Examples
Here we provides two dialogue pairs examples that are generated by the two agents' interaction using supervised learning (SL) and using RL respectively in the following pages. As seen in the Table 19 first dialogue, since the SL user model answers wrong to the system's request, the system keeps asking the unsolved question, leading to an unsuccessful dialogue with loop (highlighted in color). On the other hand, the interaction between RL models (second dialogue) is much smoother and efficient since both agents answer correctly to each other. This indicates the designed rewards are useful to improve the agents' behaviors. Similar trend can be observed in Table 20, where the user model keeps stating the same information about restaurant. 19-sys: TR0269 arrives at 18:15 . would you like me to book it for you ? 20-usr: yes , please . i need the price and travel time . 21-sys: the train id is TR0269 , the price is 10 and the travel time is 50 . 22-usr: i am also looking for a particular attraction called club salsa . 23-sys: club salsa is a nightclub in the centre . the address is 1 station road . entrance is free .
is there anything else i can help you with ? 24-usr: that is all i need . thank you for your help . 25-sys: thank you for using our system ! Dialogue generated by two agents trained using RL (Dialogue is smooth and successful) 00-usr: i am looking for a train from kings lynn on wednesday . 01-sys: what is your destination ? 02-usr: i need to arrive in cambridge by 18:15 . 03-sys: i have train TR0415 that leaves at 17:11 and arrives at 17:58 . would you like me to book it for you ? 04-usr: yes , please book 2 tickets . 05-sys: booking was successful , the total fee is 9.80 pounds payable at the station . reference number is fztwszhh . is there anything else i can help you with ? 06-usr: yes , i am also looking for a particular attraction called club salsa . 07-sys: club salsa is located at 1 station road , postcode cb12jb . is there anything else i can help you with ? 08-usr: no , that s all i need today . thank you . 09-sys: thank you for using our service . have a great day ! Table 19: Example 1: two dialogues generated by SL system and RL system respectively. The dialogue loop with repeated information is highlighted. Output utterances are lexicalised with corresponding slot values (belief state for system; user goal for user). If the system generates attributes that has not existing in the belief state, the values are randomly sampled from the ontology, e.g., the train departure "london kings cross" in the first dialogue.