Domain-independent User Simulation with Transformers for Task-oriented Dialogue Systems

Dialogue policy optimisation via reinforcement learning requires a large number of training interactions, which makes learning with real users time consuming and expensive. Many set-ups therefore rely on a user simulator instead of humans. These user simulators have their own problems. While hand-coded, rule-based user simulators have been shown to be sufficient in small, simple domains, for complex domains the number of rules quickly becomes intractable. State-of-the-art data-driven user simulators, on the other hand, are still domain-dependent. This means that adaptation to each new domain requires redesigning and retraining. In this work, we propose a domain-independent transformer-based user simulator (TUS). The structure of TUS is not tied to a specific domain, enabling domain generalization and the learning of cross-domain user behaviour from data. We compare TUS with the state-of-the-art using automatic as well as human evaluations. TUS can compete with rule-based user simulators on pre-defined domains and is able to generalize to unseen domains in a zero-shot fashion.


Introduction
Task-oriented dialogue systems are designed to help users accomplish specific goals within a particular task such as hotel booking or finding a flight. Solving this problem typically requires tracking and planning (Young, 2002). In tracking, the system keeps track of information about the user goal from the beginning of the dialogue until the current dialogue turn. In planning, the dialogue policy makes decisions at each turn to maximise future rewards at the end of the dialogue . The system typically needs thousands of interactions to train a usable policy (Schatzmann et al., 2007;Pietquin et al., 2011;Li et al., 2016;Shi et al., 2019). The amount of interactions required makes learning from real users time-consuming and costly. It is therefore appealing to automatically generate a large number of dialogues with a user simulator (US) 1 (Eckert et al., 1997).
Rule-based USs are interpretable and have shown success when applied in small, simple domains. However, expert knowledge is required to design their rules and the number of rules needed for complex domains quickly becomes intractable (Schatzmann et al., 2007). In addition, handcrafted rules are unable to capture human behaviour to its fullest extent, leading to suboptimal performance when interacting with real users (Schatzmann et al., 2006).
Data-driven USs on the other hand can learn user behaviour directly from a corpus. However, they are still domain-dependent. This means that in order to accommodate an unseen domain one needs to collect and annotate a new dataset, and retrain or even re-engineer the simulator.
We propose a transformer-based domainindependent user simulator (TUS). Unlike existing data-driven simulators, we design the feature representation to be domain-independent, allowing the simulator to easily generalise to new domains without modifying or retraining the model. We utilise a transformer architecture (Vaswani et al., 2017) so that the input sequence can have a variable length and dynamic order. The dynamic order takes into account the user's priorities and the varying input length enables the US to incorporate system actions in a seamless manner. TUS predicts the value of each slot and the domains of the current turn, allowing the model to optimise its performance in multiple granularities. By disentangling the user behaviour from the domains, TUS can learn a more general user policy to train the dialogue policy.
We compare policies trained with our TUS to policies trained with other USs through indirect and direct evaluation as well as human evaluation. The results show that policies trained with TUS outperform those that are trained with another data-driven US and are on par with policies trained with the agenda-based US (ABUS). Moreover, the policy generalises better when evaluated with a different US. Automatic and human evaluations on our zeroshot study show that leave-one-domain-out TUS is able to generalise to unseen domains while maintaining a comparable performance to ABUS and TUS trained on the full training data.

Related Work
The quality of a US has a significant impact on the performance of a reinforcement-learning based task-oriented dialogue system (Schatzmann et al., 2005). One of the early models include an N-gram user simulator proposed by Eckert et al. (1997). It uses a 2-gram model P (a u |a m ) to predict the user action a u according to the system action a m . Since it only has access to the latest system action, its behaviour can be illogical if the goal changes. Therefore, models which can take into account a given user goal were introduced (Georgila et al., 2006;Eshky et al., 2012). The Bayesian model of Daubigney et al. (2012) predicts the user action based on the user goal, and hidden Markov models are used to model the user and the system behaviour (Cuayáhuitl et al., 2005). The graph-based US of Scheffler and Young (2002) combines all possible dialogue paths in a graph. It can generate reasonable and consistent behaviour, but is impractical to implement, since extensive domain knowledge is required.
The agenda-based user simulator (ABUS) (Schatzmann et al., 2007) models the user state as a stack-like agenda, ordered according to the priority of the user actions. The probabilities of updating the agenda and choosing user actions are set manually or learned from data (Keizer et al., 2010). Still, the stacking and popping rules are domain-dependent and need to be designed carefully.
To build a data-driven model, the sequence-tosequence (Seq2Seq) model structure is widely used. El Asri et al. (2016) propose a Seq2Seq semantic level US with an encoder-decoder structure. Each turn is fed into the encoder recurrent neural network (RNN) and embedded as a context vector. Then domain-independent data-driven i n t e r p r e t a b l e TUS VHUS NUS ABUS Graph-based Seq2Seq Figure 1: The difference between USs. We compare to which extent a model is data-driven, domainindependent and interpretable. this context vector is passed to the decoder RNN to generate user actions. To add new domains, it is necessary to modify the domain-dependent feature representation and retrain the model.
Instead of generating semantic level output, the neural user simulator (NUS) by Kreyssig et al. (2018) generates responses in natural language, thus requiring less labeling, at the expense of interpretability. However, its feature representation is still domain-dependent.
A variational hierarchical Seq2Seq user simulator (VHUS) is proposed by Gür et al. (2018). Instead of designing dialogue history features, the model encodes the user goal and system actions with a vector using an RNN, which alleviates the need of heavy feature engineering. However, the inputs are represented as one-hot encodings, which are also dependent on the ontology. In addition, the output generator is not constrained by the ontology in any way, so it can generate impossible actions.
As shown in Fig. 1, ABUS and graph-based models are domain-dependent and require significant design efforts. Data-driven models such as Seq2Seq, NUS, and VHUS can learn from data, but are constrained by the underlying domain. NUS generates natural language responses, which requires less labeling, but comes with reduced interpretability. Shi et al. (2019) compared different ways to build a US and indicated that the data-driven models suffer from bias in the corpus. If some actions are rare in the corpus, the model cannot capture them. Thus, the dialogue policy cannot explore all possible paths during training with the data-driven USs. It is important to learn more general human behaviour to reduce the impact of the corpus bias.

Problem Description
Task-oriented dialogue systems are defined by a given ontology, which specifies the concepts that the system can handle. The ontology can include multiple domains. In each domain, there are informable slots, which are the attributes that users can assign values to, and requestable slots, which are the attributes that users can query. For example, in Fig. 2 the user goal has two domains, "hotel" and "restaurant". The slot Area is an informable slot with the value North in domain "hotel" and Addr is a requestable slot in domain "restaurant". The system state records the slots and values mentioned in the dialogue history. A US for task-oriented dialogue systems needs to provide coherent responses according to a given user goal G = {domain 1 : [(slot 1 , value 1 ), (slot 2 , value 2 ), . . . ], . . . }. The domains, slots and values are selected from the ontology.
The user action is composed of user intents, domains, slots, and values. We consider user intents that appear in the MultiWOZ dataset . It is of course possible to consider arbitrary intents within the same model architecture, as long as they are defined a priori 2 . The two possible user intents we consider are Inform and Request. With Inform, the user can provide information, correct the system or confirm the system's recommendations. When a user goal cannot be fulfilled, the user can also randomly select a value from the ontology and change the goal. With Request, the user can request information about certain slots.
The system action is similar to the user action, but there exist more (system) intents. For example, the system can provide suggestions to users with the intent Recommendation and make reservations for users with the intent Book. More system intents can be found in Appendix A.
We view user simulation in a task-oriented dialogue as a sequence-to-sequence problem. For each turn t, we extract the input feature vectors V t of the input list of slots S t = [s 1 , s 2 , . . . ], which is composed of the slots from the user goal and the system action. The output sequence O t = [o t 1 , o t 2 , . . . ] is then generated by the model, where o t i shows how the value for slot s i is obtained. The input feature representation and the output target should be User Goal Info: Hotel-Area=North, Rest-Area=North Reqt: Hotel-Name, Rest-Addr Conversation Turn 0 USR: I want to find a hotel in the north and a nearby restaurant.
Inform(Hotel-Area=North, Rest-Area=North) SYS: There are some good hotels in the south. Which price range do you prefer? Would you mind providing more information? Recom(Hotel-Area=South), Request(Hotel-Price), general-reqmore() Turn 1 USR: No, I want one in the north and I don't care about the price range.
Inform(Hotel-Area=North, Hotel-Price=dontcare) domain-independent in order to generalise to unseen domains without redesigning and retraining. More details can be found in Sec. 4. By working on the semantic level during training, we retain interpretability. To interact with real users during human evaluation, we rely on template-based natural language generation to convert the semantic-level actions into utterances, as language generation is out of the scope of this work.

Transformer-based Domain-independent User Simulator
The TUS model structure is shown in Fig. 3. For each turn t, the list of input feature vectors is generated based on the system actions and the user goal, where v t i is the feature vector of slot s i and n t is the length of the input list in turn t, V t . We explain the feature representation in detail in Sec. 4.1. Inspired by ABUS, which models the user state as a stack-like agenda, the length of input list n t at each turn t varies by taking into account slots mentioned in the system's action. For example, in Fig. 3 the input list V 0 only contains the slots in the user goal at the first turn. Then the system mentions a slot not in the user goal, Hotel-Price. So in turn 1 the length of input list V 1 is n 1 = n 0 + 1 because one slot is inserted into the input list V 1 . The whole input sequence to the model is The user policy network is a transformer (Vaswani et al., 2017;Devlin et al., 2019). We choose this structure because transformers are able to handle input sequences of arbitrary lengths and to capture the relationship between slots thanks    to self-attention. The model structure includes a linear layer and position encoding for inputs, two transformer layers, and one linear layer for outputs.

User Policy Network
The consists of one-hot vectors o t i which determine the values of the slots s i at turn t. The dimensions of o t i ∈ {0, 1} 6 correspond to "none", "don't care", "?", "from the user goal", "from the system state", or "randomly selected". More precisely, "none" means that this slot is not mentioned in this turn, "don't care" signifies that the US does not care about this slot, "?" means the US wants to request information about this slot, "from user goal" implies that the value is the same as in the user goal, "from system state" means that the value is as mentioned by the system, and lastly "randomly selected" indicates that the US wants to change its goal by randomly selecting a value from the ontology.
The loss function for slots measures the difference between the predicted output O t and the target Y t at each turn t from the dataset as computed by cross entropy (CE), i.e., where n t is the number of slots in the input list, o t i is the output, and y t i is the target of slot s i in turn t.

Domain-independent Input Features
We design the input feature representation v t i of each slot s i in turn t consisting of a set of subvectors, all of which are domain-independent. For better readability, we drop the slot index i and the turn index t, i.e. we write v for v t i .

Basic Information Features
Inspired by the feature representation proposed in El Asri et al. (2016), we use a feature vector v basic that is composed of binary sub-vectors to represent the basic information for each slot. Each slot has two value vectors: v sys value represents the value in the system state, and v user value represents the value in the user goal. Each value vector is a 4-dimensional onehot vector, with coordinates encoding "none", "?", "don't care" or "other values", in this order. For example, in turn 1 in Fig. 2, for slot Hotel-Price v user value = [1, 0, 0, 0], i.e., "none", because it is not in the user goal, and v sys value = [0, 1, 0, 0], i.e., "?", because the system requests it.
The slot type vector v type is a 2-dimensional vector which represents whether a slot is in the user goal as a constraint or a request. For example, in Fig. 2 for Hotel-Area v type = [1, 0] (constraint), while for Hotel-Name v type = [0, 1] (request). A value of [0, 0] means that the slot is not included in the user goal.
The state vector v f ul encodes whether or not a constraint or informable slot has been fulfilled. The value is set to 1 if the constraint has been fulfilled, and to 0 otherwise. The vector v f irst similarly encodes whether a slot is mentioned for the first time.
The basic information feature vector v basic is the concatenation of these vectors, i.e., v basic = v user value ⊕v sys value ⊕v type ⊕v f ul ⊕v f irst (2) 4.1.2 System Action Features The system action feature vector v system action encodes system actions in each turn. There are two kinds of system actions, general actions and domainspecific actions. The general actions are composed only with general intents, such as "reqmore" and "bye". For example, general-reqmore(). The feature vector of general actions v gen is a multihot encoding of whether or not a general intent appears in the dialogue. With a total number of n gen general intents, for each k ∈ {1, . . . , n gen }, the k-th entry of v gen is set to 1 if the k-th general intent is part of the system act.
On the other hand, domain-specific actions are composed with domains, slots, values, and domainspecific intents such as "recommend" and "select". For example, Recom(Hotel-Area=South). Each domain-specific action vector v spec j with the domain-specific j-th intent, j ∈ {1, . . . , n spec }, where n spec is the total number of domain-specific intents, is represented by a 3-dimensional onehot encoding that describes whether the value is "none", "?" or "other values".
The final action representation v system action is formed by concatenating n spec domain-specific action representations together with the general action representation, i.e., v system action = v spec 0 ⊕ · · · ⊕ v specn spec ⊕ v gen .
(3) For the slot Hotel-Area in Fig. 3, we have a vector for each intent. For the intent "recommend" v spec 0 = [0, 0, 1], which means that "other values" (in this case South) are mentioned. For all other domain-specific intents, the vectors are [1, 0, 0] since no value is mentioned. In terms of the general intents, only "reqmore" is mentioned, so v gen [1] = 1, as "reqmore" is the first general intent.

User Action Features
The output vector from the previous turn O t−1 is also included in the input features of the next turn t to take into account what has been mentioned by the US itself, i.e. for slot s i in turn t, the user action feature v user action = o t−1 i .

Domain and Slot Index Features
In some cases, multiple slots may share the same basic feature v basic , system action feature v system action and user action feature v user action . This similarity in features of different slots makes it difficult for the model to distinguish one slot from another, despite the positional encoding. In particular, it is challenging for the model to learn the relationship between turns for a given slot because the number and the order of slots vary from one turn to the next. This may lead to over-generation: the model selects all slots with the same feature vector.
To counteract this issue, we introduce the index feature v index , which consists of the domain index feature v domain index ∈ {0, 1} l d and the slot index feature v slot index ∈ {0, 1} ls , where l d is the maximum number of domains in a user goal and l s is the maximum number of slots in any given domain 3 .
To make the index feature ontology-independent, for a particular slot, v index remains consistent throughout a dialogue, but varies between dialogues. The order of the index in each dialogue is determined by the order in the user goal. For example, the "hotel" domain can be the first domain in one user goal of the first dialogue, and the second domain in the next.
Then for each slot in each turn the input feature vector v is formed by concatenating all sub-vectors: An example of v for slot Hotel-Area is shown in Fig. 3 based on the dialogue history in Fig. 2. Examples of how the feature representation is constructed can be seen in Appendix D. During training, the dropout rate is 0.1. We train our model 4 on the MultiWOZ 2.1 dataset (Eric et al., 2020), consisting of dialogues between two humans, one posing as a user and the other as an operator. The dialogues in the dataset are complex because there may be more than one domain involved in one dialogue, even in the same turn. During training and testing with the dataset, the order of slots in the input list is derived from the data, which means slot s i is before slot s i+1 if the user mentioned slot s i first. For inference without the dataset, such as when using TUS to train a dialogue policy, the order of slots is randomly generated.

Domain Prediction
We measure how well a US can fit the dataset by precision, recall, F1 score, and turn accuracy. The turn accuracy measures how many model predictions per turn are identical to the corpus, based on the oracle dialogue history.

Training Policies with USs
User simulators are designed to train dialogue systems, thus a better user simulator should result in a better dialogue system. We train different dialogue policies by proximal policy optimization (PPO) (Schulman et al., 2017), a simple and stable reinforcement learning algorithm, with ABUS, VHUS, and TUS as USs in the ConvLab-2 framework (Zhu et al., 2020). The policies are trained for 200 epochs, each of which consists of 1000 dialogues. The reward function gives a reward of 80 for a successful dialogue and of -1 for each dialogue turn, with the maximum number of dialogue turns set to 40. For failed dialogues, an additional penalty is set to -40. Each dialogue policy is trained on 5 random seeds. The dialogue policies are then evaluated using all USs by cross-model evaluation (Schatztnann et al., 2005) to demonstrate the generalisation ability of the policy trained with a particular US when evaluated with a different US.

Leave-one-domain-out Training
To evaluate the ability of TUS in handling unseen domains, we remove one domain during supervised learning of TUS. The leave-one-domain-out TUSs are used to train dialogue policies with all possible domains. For example, TUS-noHotel is trained on the dataset without the "hotel" domain. During policy training, the user goal is generated randomly from all possible domains.
Some domains in MultiWOZ may share the same slots, such as "restaurant" and "hotel" domains which contain property-related slots, e.g. "area," "name," and "price range." However, the corpus also includes domains that are quite different from the rest, For example, the "train" domain which contains many time-related slots such as "arrival time" or "departure time", as well as unique slots such as "price" and "duration." The different properties of the domains will allow us to study the zero-shot transfer capability of the model.

Human Evaluation
Following the setting in Kreyssig et al. (2018), we select 2 of the 5 trained versions of each dialogue policy for evaluation in a human trial: the version performing best on ABUS, and the version performing best in interaction with TUS. The results of the two versions are averaged. For each version we collect 200 dialogues, which means there are 400 dialogues for each policy in total. Dialogue policies trained with VHUS significantly underperform, so we only consider policies trained with ABUS or TUS for the human trial (see Table 1). The best and the worst policies in the leave-one-domain-out experiment are also included to see the upper and lower bound of the zero-shot domain generalisation performance.
Human evaluation is performed via DialCrowd (Lee et al., 2018) connected to Amazon Mechanical Turk 5 . Users are provided with a randomly generated user goal and are required to interact with our systems in natural language.  6 Experimental Results

Cross-model Evaluation
The results of our experiments are shown in Table 1. The policy trained with TUS performs well when evaluated with ABUS, with 10% absolute improvement in the success rate over its performance on TUS. On the other hand, while a policy trained with ABUS performs almost perfectly when evaluated with ABUS, the performance drops significantly, by 35% absolute, when this policy interacts with TUS. This signals that, in the case of ABUS, the policy overfits to the US used for training, and is not able to generalise well to the behaviour of other USs. We found that VHUS is neither able to train nor to evaluate a multi-domain policy adequately. This was also observed in the experiments by Takanobu et al. (2019). We suspect that this is due to the fact that VHUS was designed to operate on a single domain and does not generalise well to the multi-domain scenario. To the best of our knowledge, no other data-driven US has been developed for the multi-domain scenario. The success rates of policies trained with ABUS and TUS during training, evaluated with both US, are shown in Fig. 4. Each of the systems is trained 5 times on different random seeds. We report the average success rate as well as the standard deviation. Although the policy trained with TUS is more unstable when evaluated on ABUS, it still shows an improvement from the initial policy, converging at around 79%. On the other hand, the policy trained with ABUS and evaluated with TUS barely show any improvements.

Impact of features and loss functions
We conduct an ablation study to investigate the usefulness of the proposed features and loss functions. The result is shown in Table. 2. First, we measure the performance of the basic model which uses only the basic information feature v basic , the system action feature v system action , and the user action feature v user action as the input. While this model can  have a high recall rate, the precision and the turn accuracy are fairly low. We deduce that without the index features the model cannot distinguish the difference between slots and therefore tends to select slots of the same slot type in one turn. For example, it provides all constraints in the first turn, which leads to high recall and over-generation. Analysis of the generated user actions shows that the basic model tends to mention four or more slots in the first turn. This is unnatural, since human users tend to only mention one or two slots at the beginning of a dialogue. More details about the average slots per turn can be found in Appendix B.
After adding the index feature v index , the recall rate is decreased by 17% absolute, but the turn accuracy is increased by 35% absolute, along with improvements on the precision and the F1 score. Furthermore, the average number of slots per turn is closer to that of a real user. Although the recall rate with respect to the target in the data is decreased, this is not necessarily a concern since in dialogue there are many different plausible actions for a given context. For example, when searching for a restaurant, we may provide the information of the area first, or the food type. The order of  Table 3: The success rates of dialogue policies trained with leave-one-domain-out TUSs. For example, the TUS-noAttr model is trained without the "attraction" domain. The sum of all removed data is more than 100% because some dialogues have multiple domains. We report results on all domains.
communicating these constraints may vary. When we include the domain loss loss domain during training, both the recall rate and the turn accuracy improve while a similar average slot length per turn is maintained. These results indicate that the proposed ontology-independent index features can help the model to distinguish one slot from the other, which solves the over-generation problem of the basic model. The domain loss allows for more accurate prediction of the domain at turn level and the value for each slot at the same time.

Zero-shot Transfer
We test the capability of the model to handle unseen domains in a zero-shot experiment. In a leave-onedomain-out fashion we remove dialogues involving one particular domain when training the US. The share of each domain in the total dialogue data ranges from 19.60% to 45.21%. During dialogue policy training we sample the user goal from all domains. As presented in Table 3, removing one domain from the training data when training the US does not dramatically influence the policy on the corresponding domain. The final performance of the policies trained with leave-one-domain-out TUSs is still reasonably comparable to the policy trained with the full TUS. This is especially noteworthy considering the substantial amount of data removed during US training and the difference between each domain.
We observe that the model is able to learn about the removed domain from the other domains, although the removed domain is different from the remaining ones. For example, the "train" domain is very different from "attraction", "restaurant", and "hotel", and it is more complex than "taxi", but TUS-noTrain still performs reasonably well on the "train" domain. This signals that the model can do zero-shot transfer by leveraging other do-  main information. The worst performance on the "train" domain happens instead when the "hotel" domain is removed, i.e. the domain with the most substantial amount of data.
Our results also show that that some domains are more sensitive to data removal than others, irrespective of which domain is removed. This indicates that some domains are more involved and simply require more training data. This result demonstrates that TUS has the capability to handle new unseen domains without modifying the feature representation or retraining the model. It also shows that our model is sample-efficient.

Human Evaluation
The result of the human evaluation is shown in Table 4. In total, 156 users participated in the human evaluation. The number of interactions per user ranges from 10 to 80. The success rate measures whether the given goal is fulfilled by the system and the overall rating grades the system's performance from 1 star (poor) to 5 stars (excellent). TUS is able to achieve a comparable success rate as ABUS, without domain-specific information, and even scores slightly better in terms of overall rating. We were not able to observe any statistically significant differences between ABUS and TUS in the human evaluation. For leave-one-domain-out mod-els, the performance of TUS-noAttr is similar to that one of ABUS and TUS without a statistically significant difference. We do however observe a statistically significant decrease in the success rate of TUS-noHotel when compared to TUS and ABUS (p < 0.05). This is unsurprising as the hotel domain accounts for 40.15% of the training data. For both TUS-noAttr and TUS-noHotel, the success rate on the domain "attraction" is comparable to TUS and ABUS, but the success rate on the domain "hotel" is relatively low. As observed in the simulation, removing a domain does not decrease the success rate in the corresponding domain as the feature representation is domain agnostic. Instead, it impacts domains which need plenty of data to learn.

Conclusion
We propose a domain-independent user simulator with transformers, TUS. We design ontologyindependent input and output feature representations. TUS outperforms the data-driven VHUS and it has a comparable performance to the rule-based ABUS in cross-model evaluation. Human evaluation confirms that TUS can compete with ABUS even though ABUS is based on carefully designed domain-dependent rules. Our ablation study shows that the proposed features and loss functions are essential to model natural user behavior from data. Lastly, our zero-shot study shows that TUS can handle new domains without feature modification or model retraining, even with substantially fewer training samples.
In future work, we would like to learn the order of slots and add output language generation to make the behaviour of TUS more human-like. Applying reinforcement learning to this model would also be of interest.

A All System Intents
All system intents in the MultiWOZ 2.1 dataset are listed in Table 5, including 5 general intents and 9 domain-specific intents.

B Average Action Length in Each Turn
The average number of slots mentioned by TUS in each turn when interacting with the rule-based dialogue system is shown in Fig. 5. When the index feature v index and the domain loss loss domain are added, TUS can deal with the over-generation problem and behave more similarly to what is observed in the corpus.

C Success Rates of Leave-one-domain-out Training
The training success rates of dialogue policies trained with leave-one-domain-out TUSs, which are evaluated on TUS, are shown in Fig. 6. In comparison to the full TUS, the leave-one-domain-out TUSs are more unstable, but they can achieve a comparable success rate at the end.  For turn 0, V 0 only includes 4 vectors from the user goal. For turn 1, the system mentions slot Hotel-Price, which is not in the user goal, so the feature vector of slot Hotel-Price is inserted into V 1 , where the 1-st dimension of v domain slot is 1 because domain Hotel is the first domain in this conversation and the 3-rd dimension of v slot index is 1 because it is the third slot in domain Hotel.
In comparison between the feature vectors of slot Hotel-Area in turn 0, v 0 1 , and turn 1, v 0 1 , the v sys value and v spec 0 are different because of the system's domainspecific action Recom(Hotel-Area=South).
The system also mentioned a general action, general-reqmore(), thus v gen is changed. In addition, this slot is first mentioned at turn 0, so v f irst is changed from 0 to 1. Similarly, v user action is also modified according to the user action. On the other hand, v user value is the same because the user does not update its goal, v type is not changed because the slot is still a constraint, and v f ul is 0 because it has not been fulfilled yet. v domain index and v slot index are also the same through the whole conversation.

E Example Dialogue Generated by TUS
An example dialogue with a multi-domain user goal is shown in Fig. 8. It shows that TUS is able to switch between different domains (from turn 2 to 6), respond to the system's requests, and generate multi-domain actions (in turn 5).