Zero-Shot Dialogue State Tracking via Cross-Task Transfer

Zero-shot transfer learning for dialogue state tracking (DST) enables us to handle a variety of task-oriented dialogue domains without the expense of collecting in-domain data. In this work, we propose to transfer the cross-task knowledge from general question answering (QA) corpora for the zero-shot DST task. Specifically, we propose TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in DST. In addition, we introduce two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable our model to handle none value slots in the zero-shot DST setting. The extensive experiments show that our approaches substantially improve the existing zero-shot and few-shot results on MultiWoz. Moreover, compared to the fully trained baseline on the Schema-Guided Dialogue dataset, our approach shows better generalization ability in unseen domains.


Introduction
Virtual assistants are designed to help users perform daily activities, such as travel planning, online shopping and restaurant booking. Dialogue state tracking (DST), as an essential component of these task-oriented dialogue systems, tracks users' requirements throughout multi-turn conversations as dialogue states, which are typically in the form of a list of slot-value pairs. Training a DST model often requires extensive annotated dialogue data. These data are often collected via a Wizard-of-Oz (Woz) (Kelley, 1984) setting, where two workers converse with each other and annotate the dialogue states of each utterance (Wen et al., 2017;Budzianowski et al., 2018;, or with a Machines Talking To Machines * Work done during internship at Facebook (M2M) framework (Shah et al., 2018), where dialogues are synthesized via the system and user simulators (Campagna et al., 2020;Lin et al., 2021b). However, both of these approaches have inherent challenges when scaling to large datasets. For example, the data collection process in a Woz setting incurs expensive and time-consuming manual annotations, while M2M requires exhaustive hand-crafted rules for covering various dialogue scenarios.
In industrial applications, virtual assistants are required to add new services (domains) frequently based on user's needs, but collecting extensive data for every new domain is costly and inefficient. Therefore, performing zero-shot prediction of dialogue states is becoming increasingly important since it does not require the expense of data acquisition. There are mainly two lines of work in the zero-shot transfer learning problem. The first is cross-domain transfer learning Lin et al., 2021a), where the models are first trained on several domains, then zero-shot to new domains. However, these methods rely on a considerable amount of DST data to cover a broad range of slot types, and it is still challenging for the models to handle new slot types in the unseen domain. The second line of work leverages machine reading question answering (QA) data to facilitate the low-resource DST (i.e., cross-task transfer) . However, the method of  relies on two independent QA models, i.e., a span extraction model for non-categorical slots and a classification model for categorical slots, which hinders the knowledge sharing from the different types of QA datasets. Furthermore, unanswerable questions are not considered during their QA training phase. Therefore, in a zero-shot DST setting, the model proposed by  is not able to handle "none" value slots (e.g., unmentioned slots) that present in the dialogue state. Figure 1: A high-level representation of the cross-task transfer for zero-shot DST (best viewed in color). During the QA training phase (top figure), the unified generative model (T5) is pre-trained on QA pairs of extractive questions (blue), multiple-choice questions (purple), and negative extractive questions (green). At inference time for zero-shot DST (bottom figure), the model predicts slot values as answers for synthetically formulated extractive questions (for non-categorical slots) and multiple-choice questions (for categorical slots). Note that the negative QA training allows for the model to effectively handle "none" values for unanswerable questions.
In this paper, to address the above challenges, we propose TransferQA, a unified generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework (Raffel et al., 2020;Khashabi et al., 2020). Such design not only allows the model to leverage both extractive and multi-choice QA datasets, but also provides a simple unified text-totext interface for tracking both categorical slots and non-categorical slots. To handle the "none" value slots in a zero-shot DST setting, we introduce two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which simulate the out-of-domain slots and in-domain unmentioned slots in multi-domain DST. We evaluate our approach on two large multidomain DST datasets: MultiWoz (Budzianowski et al., 2018; and Schema-Guided Dialogue (SGD) . The experimental results suggest that our proposed model, without using any DST data, achieves a significantly higher joint goal accuracy compared to previous zero-shot DST approaches. Our contributions are summarized as the following: • We propose TransferQA, the first model that performs domain-agnostic DST without using any DST training data.
• We introduce two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable our model to handle "none" value slots in the zero-shot DST setting; • We demonstrate the effectiveness of our approach in two large multi-domain DST datasets. Our model achieves 1) the stateof-the-art zero-shot and few-shot results on MultiWoz and 2) competitive performance compared to a fully trained baseline on the SGD dataset.

Text-to-Text Transfer Learning for DST
In multi-choice QA, each sample consists of a context passage C, a question q i , multiple answer candidates A = {a 1 , a 2 , . . . , a n }, and the correct answer a i . In extractive QA, answer candidates are not available, and A become an empty set A = ∅. Therefore, in QA training, the models learn to predict the answer a i to a question q i by reading the Figure 2: Negative sampling strategy for adding unanswerable questions to the training. Given a passage, we randomly sample a question from other passages and train the QA model (T5) to predict "none".
context passage C and answer candidates A (if available), while in DST inference, the models predict the value a i of a slot q i by reading the dialogue history C and value candidates A (in categorical slots).
QA Training. As illustrated in Figure 1, we prepend special prefixes to each input source. For instance, in multi-choice QA, "Multi-Choice Question:" is added to the question sequence; and "Choices:" is added to the answer candidates sequence, where each candidate is separated by a special token "[sep]". All the input sources are concatenated into a single sequence as input to a sequence-to-sequence (Seq2Seq) model. Then, the model generates the correct answer a i token by token.
It is worth noting that some of the questions q i are unanswerable given the context. In these cases, a i = none. The training objective of our QA model is minimizing the negative log-likelihood of a i given q i , A and C, that is We initialize the model parameters with T5 (Raffel et al., 2020), an encoder-decoder Transformer with relative position embeddings (Shaw et al., 2018). The model is pre-trained on 750GB of clean and natural English text with a masking language modeling objective (masking out 15% of input spans, then predicting the missing spans using the decoder). DST Zero-Shot. In DST, we consider tracking a slot value as finding the answer to a slot question from a dialogue history. Therefore, we first formulate all the slots as natural language questions, with each question roughly following the format "what is the <slot> of the <domain> that user wants?". The context is a dialogue history which consists of an alternating set of utterances from two speakers, C = {U 1 , S 1 , . . . , S t−1 , U t }. "user:" and "system:" prefixes are added to the user and system utterance, respectively. Following the QA training phase, "Multi-Choice Question:" and "Extractive Question:" prefixes are added to the categorical and non-categorical slot question sequence. Then, the slot question q i , value candidates A, and dialogue context C are concatenated into a single sequence as model input, and the model decodes the answer a i with greedy decoding.

Unanswerable Questions
In DST, at any given turn of the conversation, the slots not mentioned by the user are marked with "none" in the dialogue state. Especially in multidomain dialogues, there are typically two kinds of "none" value slots: out-of-domain and in-domain unmentioned. The out-of-domain slots are the slots in other domains that are irrelevant to the current conversation. For example, when the user asks for a hotel in the center, all the slots that are not in the hotel domain (e.g., restaurant-price) have the value "none". The second, in-domain unmentioned, are those slots in the domain of interest but not yet mentioned by the user. For example, the user asks about a hotel in the center, and thus the slot hotelstar is "none" since the user does not specify this information. Therefore, we introduce two methods to simulate the out-of-domain and in-domain unmentioned slots in the QA training phase.
Negative Question Sampling. The out-ofdomain slots in DST are similar to out-of-context questions in QA, that is, the model must predict "none" when the question is irrelevant to the context. To construct this kind of unanswerable question, we adapt the negative sampling strategy (Mikolov et al., 2013). As illustrated in Figure 2, during QA training, we sample these negative questions from a pool of questions collected from other passages.
Context Truncation. The in-domain unmentioned slots often appear in the middle of conversations, where some of the in-domain slots have not yet mentioned by the user. We simulate such scenario by truncating the context passage from the first sentence that contains the answer span. As illustrated in Figure 3, given a question and a passage from a QA training set, we first truncate the passage according to the answer span annotation, then we pair the question and the truncated passage as an unanswerable sample.  (Fisch et al., 2019), and two multichoice datasets such as RACE (Lai et al., 2017) and DREAM (Sun et al., 2019). The main train/dev statistics are reported in Table 1.
DST datasets. The evaluation is conducted on two multi-domain task-oriented dialogue benchmark, MultiWoz (Budzianowski et al., 2018; and Schema-Guided-Dialogue (SGD) . Both datasets provide turn-level annotations of dialogue states. In MultiWoz, we follow the pre-processing and evaluation setup from , where restaurant, train, attraction, hotel, and taxi domains are used for training and testing. In SGD, the test set has 18 domains, and 5 of the domains are not presented in the training set.  (Fisch et al., 2019), and that of multiple-choice datasets are from RACE (Lai et al., 2017) and DREAM (Sun et al., 2019).

Evaluation
Joint Goal Accuracy (JGA) and Average Goal Accuracy (AGA) are used to evaluate our models and baselines. For JGA, the model outputs are only counted as correct when all of the predicted values exactly match the oracle values. AGA is the average accuracy of the active slots in each turn. In order to make consistent comparisons to the previous works on cross-domain zero-shot/fewshot DST Zhou and Small, 2019) in MultiWoz, we compute JGA per domain as in  2 . In SGD dataset, we use the official evaluation script 3 .

Implementation
We implement TransferQA based on T5-large (Raffel et al., 2020) 4 . All models are trained using the AdamW (Loshchilov and Hutter, 2018) optimizer with an initial learning rate of 0.00005. In the QA training stage, we set the ratio of generating an unanswerable question α = 0.3, in which the ratio of negative sampled questions and truncated context is 0.95 : 0.05, and we train the models with batch size 1024 for 5 epochs.
In the DST zero-shot testing, we first treat all the slots as non-categorical and generate all the slot values. The slots that have no "none" values are considered as active slots. Then the model gener-   . Results marked with † and ‡ are from  and Campagna et al. (2020). We also report the averaged zero shot joint goal accuracy among five domains. Note that this averaged per-domain accuracy is not comparable to the JGA in full shot setting.   . The SGD-baseline is trained with the whole training set, and the results are reported by . Domains that appear in the test set but are not present in the training set are marked with *. For TransferQA, all the domains are unseen because the model is not trained with any DST data.
ates the value of active categorical slots by using a multi-choice QA formulation. In SGD, we follow the split of non-categorical and categorical slots in the dataset, while in MultiWoz, we follow the split of MultiWoz2.2 , except that all the number-type slots are considered as non-categorical slots. We also apply the canonicalization technique proposed by  in

MultiWoz.
For the few-shot experiments, the QA pretrained models are fine-tuned with 1%, 5% and 10% of the target domain data for 20 epochs. Other hyper-parameters are the same as in the QA training. We use 8 Tesla V100 GPUs for all of our experiments.

Baselines
TRADE. Transferable dialogue state generator , which utilizes a copy mechanism to facilitate domain knowledge transfer.

MA-DST.
A multi-attention model  which encodes the conversation history and slot semantics by using attention mechanisms at multiple granularities.
DSTQA. Dialogue state tracking via question answering over the ontology graph (Zhou and Small, 2019).

STARC.
Applying two machine reading comprehension models based on RoBERTa-Large (Liu et al., 2019) for tracking categorical and noncategorical slots .

Zero-Shot
In  Table 4: Few-shot performance on MultiWoz 2.0 in terms of Joint Goal Accuracy (JGA). N/A for results not presented in the original paper. All models are evaluated with 1%, 5%, and 10% in-domain data.
domain setting, where the models are trained on the four domains in MultiWoz then zero-shot on the held-out domain. Our TransferQA, without any DST training data, achieves significantly higher JGA (7.59% on average) compared to the previous zero-shot results. Table 3 summarizes the results on SGD dataset, where the SGD-baseline  is trained with the whole SGD training set. TransferQA zero-shot performance is consistently higher in terms of JGA and AGA in the unseen domains, and competitive in seen domains.
The results on both datasets shows the effectiveness of cross-task zero-shot transferring. In the cross-domain transfer scenario, despite the large amount of dialogue data, only a limited number of the slots appear in the source domain. For example, MultiWoz has 8,438 dialogues with 113,556 annotated turns, but only 30 different slots in 5 domains. Thus, cross-domain transferring requires the models generalize to new slots after being trained with fewer than 30 slots. By contrast, in cross-task transferring, each question in QA datasets can be considered as a slot. Therefore, a model which trained with diverse questions (around 500,000) on QA datasets is more likely to achieve better generalization. Table 4 shows the few-shot results ond Multi-Woz 2.0 5 , where TRADE  and DSTQA (Zhou and Small, 2019) are trained on four source domain on MultiWoz then finetuned with the target domain data, while STARC  and our model TransferQA are first trained on the same QA datasets then finetuned with the target domain data. We experiment with 1%, 5% and 10% of the target domain data. The results show that both cross-task transferring approaches (i.e., STARC and TransferQA) outperform crossdomain transferring approaches (i.e., TRADE and 5 Few shot experiments are conducted on MultiWoz 2.0 for comparing with previous works. DSTQA) in 4 out of 5 domains. Compared to STARC, TransferQA achieves around 1% lower JGA in the hotel domain, but consistently higher JGA on other domains under different data ratio settings. Especially when only 1% of in-domain data are available, our model outperforms STARC in most domains (except hotel) by a large margin (e.g., 11.95% in the attraction and 4.49% in the train domain). This significant improvement can be attributed to the generated unanswerable samples, which bridge the gap between the source data distribution and the target data distribution.

Impact of Unanswerable Questions
In Table 5, we study the effect of the two unanswerable question generation strategies Context Truncation (CT) and Negative Question Sampling (NQS) described in Section 2.2. Applying both CT and NQS gives the best result in terms of average JGA for both TransferQA-large and TransferQA-base. While removing the CT strategy during the QA training only affects the performance in the train domain, removing both NQS and CT decreases the JGA dramatically in all the domains. This is due to the ratio of unanswerable (none) slots in MultiWoz is high (55.25%), and removing the simulated unanswerable questions during QA training affects the Slot Gate Accuracy (SGA) in DST inference. Indeed, by adding NQS and CT, we observed large JGA improvement (around 30%) in the taxi domain which has highest unanswerable slots ratio (71.85%), and relatively small JGA improvement (around 10%) in attraction and train domains where the ratios of unanswerable slots are 47.70% and 49.58%. Overall, these results demonstrate the importance of generating unanswerable questions.
In Figure 4, we show the effect of using different ratios α for generating unanswerable questions, while when it is too low, the model is not able to capture the unmentioned slots; when the ratio  of unanswerable questions is too high, the model tends to over-predict "none". In general, we find that α = 0.3 and α = 0.6 gives the highest JGA.

Error Analysis
To understand the current limitation of cross-task transfer learning, we conducted an error analysis on the results of MultiWoz 2.1 zero-shot. We found that 79.79% of the errors come from the slot gate prediction (i.e., whether the slot is unanswerable or answerable), of which 37.54% are false positive errors (i.e., the slot is unanswerable and the model predict answerable), 42.25% are false negative errors (i.e., the slot is answerable and the model predicts unanswerable), and only 20.21% of errors come from wrong value predictions of answerable slots. In Table 6, we show two typical errors that we found in the zero-shot DST setting. The first, as shown in the example of dialogue MUL2321, is the model predicting slot values that have not been confirmed by the user yet (e.g., pricerange="expensive" etc.). The second error, as shown in dialogue PMUL0089, is the model not capturing slot values when the user does not explicitly mention the domain (e.g., a place to stay refers to the hotel domain). These errors occurred because of question-context mismatching, and they might be addressed with well designed or leaned slot questions (Li and Liang, 2021;Wallace et al., 2019). We leave this exploration to the future work.

Oracle Study
We further conducted an oracle study on our model by providing the gold slot gate information. The results are shown in the last row of Table 3 and Table 2. We found that this oracle information dramatically increases the JGA (20.7% → 48.0% in SGD, 35.77% → 56.06% in MultiWoz). Therefore, by improving the accuracy of predicting "none" value slots, we have the potential to increase the overall zero-shot DST performance by a large margin.

Related Work
Machine Reading for Question Answering (MRQA) is an important task for evaluating how well computer systems understand human language (Fisch et al., 2019). In MRQA, a model must answer a question by reading one or more context documents. There are mainly two types of MRQA tasks. The first is extractive QA (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Trischler et al., 2017;Joshi et al., 2017;Dunn et al., 2017;Yang et al., 2018;Kwiatkowski et al., 2019), where the answer to each answerable question appears as a span of tokens in the passage. A popular approach for this task is to predict the start token and end token of the answer span . The second is multi-choice QA (Lai et al., 2017;Sun et al., 2019), where the answer candidates are provided. In this task, classification-base models are usually applied to predict the correct candidate.  Dialogue State Tracking is an essential yet challenging task in conversational AI research (Williams and Young, 2007;Williams et al., 2014). Recent state-of-the-art models (Lei et al., 2018;Wu et al., 2020;Peng et al., 2020;Zhang et al., 2019;Heck et al., 2020;Mehri et al., 2020;Hosseini-Asl et al., 2020;Li et al., 2020; trained with extensive annotated dialogue data have shown promising performance in complex multi-domain conversations (Budzianowski et al., 2018;. However, collecting large amounts of data for every dialogue domain is often costly and inefficient. To reduce the expense of data acquisition, zero-shot (few-shot) transfer learning has been proposed as an effective solution. Despite the effectiveness of these approaches, a considerable amount of DST data are still required to cover a broad range of slot categories .
On the other hand,  propose to utilize abundant QA data to overcome the data scarcity issue in DST tasks. The authors first train a classification model and a span-extraction model on multi-choice QA and extractive QA datasets independently. Then, they use the two QA models to track categorical and extractive slots. Compared to this approach, our method is fundamentally different in two aspects: 1) our model can effectively handle "none" value slots (e.g., unmentioned and out-of-domain slots) in the zero-shot setting, which is important to DST performance as there are many "none" slots in multi-domain dialogues; 2) our method provides a simple text-to-text input-output interface for tracking both categorical and extractive slots with a single generative model.

Conclusion
In this paper, we present TransferQA, a unified generative model that performs DST without using any DST training data. TransferQA uses the textto-text transfer learning framework that seamlessly combines extractive QA and multi-choice QA for tracking both categorical slots and non-categorical slots. To enable our model to zero-shot "none" value slots, we introduce two effective ways to construct unanswerable questions, i.e., negative question sampling and context truncation. The experimental results on the MultiWoz and SGD datasets demonstrate the effectiveness of our approach in both zero-shot and few-shot settings. We also show that improving the "none" value slot accuracy has