Leveraging Slot Descriptions for Zero-Shot Cross-Domain Dialogue State Tracking

Zero-shot cross-domain dialogue state tracking (DST) enables us to handle task-oriented dialogue in unseen domains without the expense of collecting in-domain data. In this paper, we propose a slot description enhanced generative approach for zero-shot cross-domain DST. Specifically, our model first encodes dialogue context and slots with a pre-trained self-attentive encoder, and generates slot values in an auto-regressive manner. In addition, we incorporate Slot Type Informed Descriptions that capture the shared information across slots to facilitate cross-domain knowledge transfer. Experimental results on the MultiWOZ dataset show that our proposed method significantly improves existing state-of-the-art results in the zero-shot cross-domain setting.


Introduction
Task-oriented dialogue systems are designed to assist users in performing daily activities, such as restaurant booking, travel planning, and online shopping. These virtual assistants provide natural language interfaces to services and online APIs (Rastogi et al., 2020). Based on users' needs, these systems frequently require support for new domains. However, the current state-ofthe-art systems require a substantial amount of indomain data to properly model a new domain. The data-collection process is both expensive and timeconsuming, and thus it is very important to study methods that can build robust and scalable dialogue systems using little to no in-domain data.
The dialogue state tracking (DST) is an essential component of task-oriented dialogue systems that tracks users' requirements over multi-turn conversations. A popular formulation of the dialogue state is in the form of a list of slot-value pairs. In DST, tracking unseen slots in a new domain, a.k.a. zeroshot domain adaptation, is a significant challenge, * Work done during internship at Facebook since the model has never seen in-domain training samples. There are two main lines of work to tackle this problem. The first proposes domain transferable models using copy mechanisms or ontology graph information Zhou and Small, 2019). A limitation of such models is that they may not fully leverage pre-trained language models due to the specialized model architecture. The second line of work uses slot-descriptions as input to the model to facilitate the slot understanding (Rastogi et al., 2020). However, the provided slot descriptions are collected by crowd sourced human annotators and might be inconsistent among different domains. In general, the optimal approach for constructing slot descriptions in zero-shot settings remains unexplored.
In this work, we tackle the challenge of zeroshot cross-domain DST via leveraging large scale pre-trained sequence-to-sequence (seq2seq) models and with effective encoding of slot descriptions. We first introduce a generative DST model called T5DST, which models the relation of a slot and its dialogue context with a self-attentive encoder, and generates the slot value with a decoder in an autoregressive manner. This simple design allows us to effectively incorporate a pre-trained arXiv:2105.04222v1 [cs.CL] 10 May 2021 seq2seq model (e.g., T5 (Raffel et al., 2020)) without any task-specific modification. To further enhance the model's cross-domain transferability, we propose Slot Type Informed Descriptions that capture the shared information of different slots. Experimental results on the MultiWOZ benchmark (Budzianowski et al., 2018) suggest that 1) our model achieves significantly higher joint goal accuracy compared to existing results in zero-shot cross domain DST; 2) models using the proposed slot description formulation substantially outperform those using other slot description variants. Our contributions are summarized as the following: • We propose a simple yet novel generative DST model based on T5 that significantly improves existing zero-shot cross-domain DST results; • We investigate the effectiveness of different slot description formulations. To the best of our knowledge, this is the first work that comprehensively studies the effectiveness of slot descriptions in zero-shot cross-domain DST.

Related Work
Dialogue State Tracking has been of broad interest to the dialogue research community (Williams and Young, 2007;Williams et al., 2014;Heck et al., 2020;Wu et al., 2020;. Current state-of-the-art models Heck et al., 2020;Hosseini-Asl et al., 2020;Ye et al., 2021;Li et al., 2020) trained with extensive annotated data have been shown promising performance in complex multi-domain conversations (Budzianowski et al., 2018). However, collecting large amounts of data for every   Gao et al. (2019Gao et al. ( , 2020 formulated DST as a question answering problem by casting a slot name into questions. However, these works did not show the effectiveness of slot descriptions, by comparing the performance of models with and without them. There is no study on how to construct slot descriptions.
In this paper, we aim to fill this research gap by providing an empirical study on the different slot description formulations.

T5DST
The design of our model follows the basis of generative question answering models. As illustrated in Figure 1, given a dialogue history which consists of an alternating set of utterances from two  Table 2: Zero-shot cross-domain results in MultiWOZ 2.0. We run each experiment three times with different random seeds, and report the mean and standard deviation. Note that the reported averaged zero shot joint goal accuracy is not comparable to multi-domains joint goal accuracy. *Result from (Campagna et al., 2020).
we add the "user:" and "system:" prefixes to the user and system utterance respectively. Then all the utterances and slot names s i are concatenated into a single sequence, i.e., user:U 1 . . .system:R t−1 user:U t [sep] s i . The sequence is used as the input to the encoder, and the decoder generates the corresponding slot value v i : The learning objective of this generation process is minimizing the negative log-likelihood of v i given C t and s i , that is, where n is the number of slots to be tracked. We initialize the model parameters with T5 (Raffel et al., 2020), an encoder-decoder Transformer with relative position embeddings (Shaw et al., 2018) pre-trained on a massive amount of English text. We denote our model as T5DST. To incorporate slot descriptions into T5DST, we replace the slot name with its corresponding slot description as the model input.

Slot Type Informed Descriptions
Although different slots may have distinguishing names, they can share the same slot type. As shown in Table 1

Dataset and Evaluation
We evaluate the proposed method on the Mul-tiWOZ 2.0 dataset (Budzianowski et al., 2018), which has 7 domains. We use the pre-processing and evaluation setup from , where restaurant, train, attraction, hotel, and taxi domains are used for training, as the test set only contains these 5 domains.
In the zero-shot cross-domain experiments, the models are first trained with four domains and then evaluated on the test-set of the unseen domain. Joint goal accuracy is used to evaluate the performance of the models. The generated dialogue states are considered to be correct if and only if all of the predicted values exactly match the oracle values.

Implementation
We implement T5DST 1 based on the T5small (60M parameters) model which has 6 encoder-decoder layers and the hidden size d model = 512. All models are trained using an AdamW (Loshchilov and Hutter, 2018) optimizer with the initial learning rate of 0.0001. In all crossdomain zero-shot experiments, we train the models with batch size 128 for 5 epochs. For the few-shot  Table 3: Few-shot experimental results in MultiWOZ 2.0. We evaluate our proposed model with 1%, 5%, and 10% in-domain data, against TRADE  and DSTQA (Zhou and Small, 2019). experiments, the models are first trained on 4 domains for 5 epochs then fine-tuned with 1%, 5% and 10% of target domain data for 10 epochs. For full shot training, we train our model for at most 10 epochs with batch size 64 and early stop according to the loss in the validation set. Other hyperprameters are same as zero-shot cross-domain setting. We use 8 NVIDIA V100 GPUs for all of our experiments. We use greedy decoding in test time. DSTQA. Dialogue state tracking via question answering 2 over ontology graph (Zhou and Small, 2019).

SimpleTOD++. SimpleTOD
(Hosseini-Asl et al., 2020) uses a single causal language model GPT2 (Radford et al., 2019) to generate the dialogue states. To adapt this model to a zero-shot cross-domain setting, we also provide the slot name as the model input. We denote this model as SimpleTOD++.
Naive. Simple transformation of the slot name from "domain-slot" to "[slot] of the [domain]".
Slot Value. Following recent works (Zhang et al., 2019;Rastogi et al., 2020) Question. Similar to (Gao et al., 2019(Gao et al., , 2020, we reformulate the slot into a natural language question, i.e., "What is the [slot] of the [domain] that is the user interested in?".

Zero-Shot Cross-Domain
The results of the zero-shot cross domain experiments are shown in Table 2. Overall, T5DST achieves significantly higher performance in terms of averaged joint goal accuracy compared to the three baseline models TRADE, SUMBT, and Sim-pleTOD++. These results demonstrate that our model can effectively capture the slot-context relation, and thus generalize better in unseen domains.
Replacing slot-names with human annotated slot descriptions does not bring improvement to the zero-shot performance. This might because of the diverse and inconsistent human descriptions among different domains. For example, the human descriptions of attraction-area and restaurantarea are "area to search for attractions" and "area or place of the restaurant" respectively. Such inconsistent descriptions increase the challenge on slot understanding in the zero-shot learning setting. the model using naive slot descriptions gives similar performance to the one that uses original slot names. The two approaches lead to similar semantic representation of the slots. In contrast, incorporating slot values hurts the learning, leading to a lower joint goal accuracy in the restaurant domain. We observe that even though adding value candidates improve some of the categorical slots (e.g., restaurant-area 68.35% → 82.25% slot accuracy), it hurts the unseen non-categorical slots (e.g., restaurant-food 40.63% → 26.10% slot accuracy). These non-categorical slots are usually the  bottlenecks of joint goal accuracy. Finally, models trained with question style descriptions improves the performance in some domains, but fails in the others.
Our proposed slot type informed descriptions consistently improves the zero-shot performance of T5DST in all the domains. It produced an average of 2% joint goal accuracy improvement compared to human labeled and naive description formulations. This result indicates that slot type information may better capture the shared property (e.g., time, location) among different slots, thus facilitating the domain knowledge transferring for DST. Figure 3 and 4 show the slot accuracy of models using Naive and Slot Type description. Compared to naive description, we obverse significant gain of time slots (e.g., arrive by and leave at), location slots (e.g., departure and destination), and number slots (e.g., book stay and book people) by adding slot type information. We conjecture that explicit information about the target value (i.e., slot type) is important in the low resource condition when the model does not have enough data to capture the semantic meaning of a new slot.

Few-Shot Cross-Domain
We further conduct experiments in few-shot crossdomain settings, as in Zhou and Small, 2019), where the models are first trained on 4 domains then fine-tuned with 1%, 5% and 10% of target domain data. As shown in Table  3, our model outperforms the DSTQA model in 4 out of 5 domains. Moreover, our approach is more practical in a real-world learning scenario as it does not require the supervision of a full ontology graph. We also conduct the full shot experiments and compare our model with previous methods. The reults are reported in Appendix A.2.

Conclusion
In this paper, we propose leveraging large scale pretrained models with an effective slot description formulation to tackle the zero-shot cross-domain DST challenge. Specifically, we propose T5DST, a novel generative DST model based on the T5 language model, and incorporate Slot Type Informed Descriptions to facilitate cross-domain knowledge transfer. In the evaluation on the MultiWOZ dataset, our approach substantially improves existing results in both the zero-shot and few-shot settings.

A.1 Slot Type Informed Description Construction
As shown in

A.2 Full Shot Results
To understand the full shot performance of our T5DST model and whether slot description is still helpful when there is enough training data, we also conduct the experiments in a full data setting. As shown in Table 5, using slot description only improves the joint goal accuracy by 0.56% in Mul-tiWoz 2.0 and 0.30% in MultiWoz 2.1, which indicates that the description is less effective when there is a large amount of data for training. Compared to prior models with zero-shot capability, T5DST shows promising performance. Compared to other state-of-the-art models that optimized for full shot training, our model achieve competitive results in MultiWoz 2.0, but inferior results on MultiWoz 2.1. We notice that there are many training strategies (e.g., token masking Heck et al., 2020)), additional supervision (e.g., full ontology ), and label cleaning strategies (Heck et al., 2020)) that may impact final full-shot result. We also expect higher performance with a larger T5 model, such as T5-base or T5-large. However, achieving SOTA in full-scale training is out of the scope of this work.