Controllable User Dialogue Act Augmentation for Dialogue State Tracking

Prior work has demonstrated that data augmentation is useful for improving dialogue state tracking. However, there are many types of user utterances, while the prior method only considered the simplest one for augmentation, raising the concern about poor generalization capability. In order to better cover diverse dialogue acts and control the generation quality, this paper proposes controllable user dialogue act augmentation (CUDA-DST) to augment user utterances with diverse behaviors. With the augmented data, different state trackers gain improvement and show better robustness, achieving the state-of-the-art performance on MultiWOZ 2.1.


Introduction
Dialogue state tracking (DST) serves as a backbone of task-oriented dialogue systems (Chen et al., 2017), where it aims at keeping track of user intents and associated information in a conversation. The dialogue states encapsulate the required information for the subsequent dialogue components. Hence, an accurate DST module is crucial for a dialogue system to perform successful conversations.
Recently, we have seen tremendous improvement on DST, mainly due to the curation of large datasets (Budzianowski et al., 2018;Eric et al., 2020;Rastogi et al., 2020) and many advanced models. They can be broadly categorized into 3 types: span prediction, question answering, and generation-based models. The question answering models define natural language questions for each slot to query the model for the corresponding values Li et al., 2021).  proposed TRADE to perform zeroshot transfer between multiple domains via slotvalue embeddings and a state generator. Sim-pleTOD (Hosseini-Asl et al., 2020) combines all components in a task-oriented dialogue system with a pre-trained language model. Recently, TripPy (Heck et al., 2020) categorizes value prediction into 7 types, and designs different prediction strategies for them. This paper focuses on generalized augmentation covering all categories.
Another research line leverages data augmentation techniques to improve performance (Song et al., 2021;Yin et al., 2020;Summerville et al., 2020;Kim et al., 2021). Most prior work used simple augmentation techniques such as word insertion and state value substitution. With recent advances in pre-trained language models (Devlin et al., 2019;Radford et al., 2019;Raffel et al., 2020), generation-based augmentation has been proposed (Kim et al., 2021;. These methods have demonstrated impressive improvement and zero-shot adaptability (Yoo et al., 2020;Campagna et al., 2020), while our work focuses on data augmentation with in-domain data.
The closest work is CoCo , a framework that generates user utterances given augmented dialogue states. The examples are shown in Figure 1, where the main differences between CoCo and ours are that 1) CoCo only augments user utterances in slot and value levels, but dialogue acts and domains are fixed, making augmented data limited. Our method can augment reasonable user utterances with diverse dialogue acts and domain switching scenarios. 2) Boolean slots and referred slots are not handled by CoCo due to its higher complexity, while our approach can handle all types of values for better generalization.
This paper proposes CUDA-DST (Controllable User Dialogue Act augmentation), a generalized framework of generation-based augmentation for improving DST. Our contribution is 2-fold: • We present CUDA which generates diverse user utterances via controllable user dialogue acts augmentation. • Our augmented data helps most DST mod- [System]: Hello, how can I help you? [User]: I need to find a restaurant in the center.
[System]: I recommend Pho Bistro, a popular restaurant in the center.
[User]: No, it needs to serve British food and I'd like a reservation for 18:00.
[VS]: No, it needs to serve Chinese food and I'd like a reservation for 17:00.
[CoCo]: No, it should serve Chinese food and I need to book a table for 2 people.
[CUDA]: Thank you, can you also find me a hotel with parking near the restaurant?

Controllable User Dialogue Act Augmentation (CUDA)
The goal of our method is to augment more and diverse user utterances that fit the dialogue context, and then the augmented data can help DST models learn better. More formally, given a system utterance U sys t in the turn t and dialogue history H t−1 before this turn, our approach focuses on augmenting a user dialogue act and state,Â t , and generating the corresponding user utteranceÛ usr t . Note that each user utterance can be augmented.
To achieve this goal, we propose CUDA with three components illustrated in Figure 2: 1) a user dialogue act generation process for producingÂ t , 2) a user utterance generator for producingÛ usr t , and 3) a state match filtering process.

User Dialogue Act Generation
Considering that a user dialogue act represents the core meaning of the user's behavior (Goo and Chen, 2018;Yu and Yu, 2021), we focus on simulating reasonable user dialogue acts given the system context for data augmentation. After analyzing taskoriented user utterances, user behaviors contain the following user dialogue acts: 1. Confirm: The system provides recommendation to the user, and the user confirms if accepting the recommended item. 2. Reply: The system asks for a user-desired value of the slots, and the user replies the corresponding value.
3. Inform: The user directly informs the desired slot values to the system. Heck et al. (2020) designed their dialogue state tracker that tackle utterances with different dialogue acts in different ways and achieved good performance, implying that different dialogue acts contain diverse behaviors in the interactions. To augment more diverse user utterances, we introduce a random process for each user dialogue act. Unlike the prior work CoCo that did not generate utterance whose dialogue act different from the original one, our design is capable of simulating diverse behaviors for better augmentation illustrated in Figure 2. Confirm When the system provides recommendations, our augmented user behavior has a probability of P confirm to accept the recommended values. When the user confirms the recommendation, the suggested slot values are added to the augmented user dialogue stateÂ t as shown in Figure 1. In the example, the augmented user dialogue act is to confirm the suggested restaurant, and then includes it in the state (restaurant-name=pho bistro, restaurant-area=center).
Reply When the system requests a constraint for a specific slot, e.g. "which area do you prefer?", the user has a probability of P reply to give the value of the requested slot. P reply may not be 1, because users sometimes revise their previous requests without providing the asked information.
Inform In anytime of the conversation, the user can provide the desired slot values to convey his/her preference. As shown in the original user utterance of Figure 1, the user rejects the recommendation I recommend Pho Bistro, a popular restaurant in the center. 2. Thank you, can you also find me a hotel without parking near the restaurant?
3. Thank you, can you also find me a hotel with parking in the center of the town?
4. Thank you, can you also find me a hotel with free wifi near the restaurant? and then directly informs the additional constraints (food and time). The number of additional informed values is randomly chosen, and then the slots and values are randomly sampled from the pre-defined ontology and dictionary. Note that the confirmed and replied information cannot be changed during additional informing. Considering that a user may change the domain within the dialogue, our algorithm allows the user to change the domain with a probability of P domain , and then the informed slots and values need to be sampled from the new domain's dictionary. The new domain is selected randomly from all the other domains.

User Utterance Generation
Coreference Augmentation In the generated user dialogue act and state, all informed slot values are from the pre-defined dictionary. However, it is natural for a user to refer the previously mentioned information, e.g., "I am looking for a taxi that can arrive by the time of my reservation". To further enhance the capability of handling coreference, our algorithm has a probability of P coref to switch the slot value from the generated user dialogue state. Since not all slots can be referred, we define a coreference list containing all referable slots and the corresponding referring phrases, e.g., "the same area as" listed in Appendix A.
With the generated user dialogue acts and the system action, we form the corresponding turnlevel dialogue act and state based on the confirmed suggestions and referred slot values as shown in the green block of Figure 2.

User Utterence Generation
To generate the user utterance associated with the augmented user dialogue act and state, we adopt a pre-trained T5 (Raffel et al., 2020) and fine-tune it on the MultiWOZ dataset by a language modeling objective formulated below: where U usr t,k denotes the k-th token in the user utterance, H t−1 represents the all dialogue history before turn t, and A t is the user dialogue act and state in the t-th turn. With the trained generator, we can generate the augmented user utterance by inputting the augmented user dialogue act and statê A t as shown in the green block of Figure 2. In decoding, we apply beam search so that we can augment diverse utterances for improving DST.

State Match Filtering
To make sure the generated user utterance well reflects its dialogue state, we propose two modules to check the state matching: a slot appearance classifier and a value consistency filter, where the former checks if the given slots are included and the latter focuses on ensuring the value consistency between dialogue states and user utterances.
Slot Appearance Following Li et al., we employ a BERT-based multi-label classification model to predict whether a slot appears in the given t-th turn. The augmented user utterances are eliminated if they do not contain all slots in the user dialogue state predicted by the model.  We apply BERT (Devlin et al., 2019) to encode the t-th turn in a dialogue as: where R CLS t denotes the output of the [CLS] token, which can be considered as the summation of the turn t. We then obtain the probability of the value types as for each boolean slots, and for each span-based slots. Our multi-task BERTbased slot-gate classifier is trained with the cross entropy loss. The neural-based filters are trained on the original MultiWOZ data, and the prediction performance in terms of slots (for both appearance and value consistency) is 92.9% in F1 evaluated on the development set. In our CUDA framework, we apply the trained filters to ensure the quality of the augmented user utterances as shown in Figure 2.

Experiments
To evaluate if our augmented data is beneficial for improving DST models, we perform three popu-

Experimental Setting
Our CUDA generator is trained on the training set of MultiWOZ 2.3 (Han et al., 2020) due to its additional coreference labels. Note that all dialogues are the same as MultiWOZ 2.1. We then generate the augmented dataset for the training set of Multi-WOZ 2.1 for fair comparison with the prior work. The predifined slot-value dictionary is taken from CoCo's out-of-domain dictionary and the defined coreference list is shown in Appendix A.
In user dialogue act generation, the parameters are set as (P confirm , P reply , P domain , P coref ) = (0.7, 0.9, 0.8, 0.6), which can be flexibly adjusted to simulate different user behaviors. We report the distribution of slot types in our augmented data and the original MultiWOZ data in Table 1, where it can be found that our augmented slots cover diverse slot types and the distribution is reasonably similar to the original MultiWOZ. Different from the prior work, CoCo, which only tackled the span-based slots, our augmented data may better reflect the natural conversational interactions. Additionally, we perform CUDA with P coref = 0 to check the impact of coreference augmentation.
We train three DST models on the augmented data and evaluate the results using joint goal accuracy. The compared augmentation baselines include value substitution (VS) and CoCo  with the same setting. Table 2 shows that CUDA significantly improves TripPy and TRADE results by 3.6% and 0.8% respectively on MultiWOZ, and even outperforms the prior work CoCo. In addition, our CUDA augmentation process has 78% success rate, while CoCo only has 57%, demonstrating the efficiency of our augmentation method and the great data utility. Interestingly, CUDA without coreference achieves slightly better performance for TripPy while the performance of TRADE and SimpleTOD degrade,

Robustness to Rare Cases
We also evaluate our models on CoCo+ (rare) 2 , a test set generated by CoCo's algorithm , to examine model robustness under rare scenarios. Table 3 presents the results on CoCo+ (rare), which focuses rare cases for validating the model's robustness. It is clear that the model trained on our augmented data shows better generalization compared with the one trained on the original Mul-tiWOZ data, demonstrating the effectiveness on improving robustness of DST models. The performance of CoCo is listed as reference, because comparing with its self-generated data is unfair.

Slot Performance Analysis
To further investigate the efficacy for each slot type, Figure 3 presents its performance gain on TripPy. Comparing with CoCo, CUDA improves more on informed, refer, and dontcare slots. It implies that CUDA augments diverse user dialogue acts for helping informed and refer, and the proposed slotgate can better ensure value consistency for improving dontcare slots, even though they are rare cases in MultiWOZ. Our model can also keep the same performance for frequent span slots, demonstrating great generalization capability across diverse slot types from our controllable augmentation. The qualitative study can be found in Appendix B.
2 CoCo+ (rare) applies CoCo and value substitution (VS) with a rare slot-combination dictionary.  Figure 3: Performance gain across slots on TripPy.

Conclusion
We introduce a generalized data augmentation method for DST by utterance generation with controllable user dialogue act augmentation. Experiments show that our approach improves results of multiple state trackers and achieves state-of-theart performance on MultiWOZ 2.1. Further study demonstrates that trackers' robustness and generalization capabilities can be improved by diverse generation covering different user behaviors.

A Reproducibility
Our CUDA generator is trained on the training set of MultiWOZ 2.3 (Han et al., 2020) due to its additional coreference labels. Note that all dialogues are the same as MultiWOZ 2.1. We then generate the augmented dataset using CUDA for the training set of MultiWOZ 2.1 for fair comparison with the prior work. The predifined slot-value dictionary is taken from CoCo's out-of-domain dictionary shown in Table 4 and the defined coreference list is shown in Table 5.