Zero-shot Generalization in Dialog State Tracking through Generative Question Answering

Dialog State Tracking (DST), an integral part of modern dialog systems, aims to track user preferences and constraints (slots) in task-oriented dialogs. In real-world settings with constantly changing services, DST systems must generalize to new domains and unseen slot types. Existing methods for DST do not generalize well to new slot names and many require known ontologies of slot types and values for inference. We introduce a novel ontology-free framework that supports natural language queries for unseen constraints and slots in multi-domain task-oriented dialogs. Our approach is based on generative question-answering using a conditional language model pre-trained on substantive English sentences. Our model improves joint goal accuracy in zero-shot domain adaptation settings by up to 9% (absolute) over the previous state-of-the-art on the MultiWOZ 2.1 dataset.


Introduction
Dialog agents are gaining increasing prominence in daily life. These systems aim to assist users via natural language conversations, taking the form of digital assistants who help accomplish everyday tasks by interfacing with connected devices and services. A key component to understanding and enabling these task-oriented dialogs is Dialog State Tracking (DST): extracting user intent and goals from conversations via filling in belief slots (Lemon et al., 2006;Wang and Lemon, 2013). Assistive and recommendation use-cases for dialog agents in production settings are particularly challenging due to constantly changing services and applications with which they interface.
Traditional DST systems have achieved high accuracy when presented with a known ontology of slot types and valid values (Chen et al., 2020). In a real-world setting, however, a DST model must generalize to new slot values (e.g. new entities that are not present at training time) and new slot types (e.g. requirements regarding a new application). Recent work has sought to address these issues by posing DST as a reading comprehension or question answering (QA) task )-such models predict each slot value independently at any given turn and can theoretically be queried for new slots at inference time.
Some approaches toward DST as QA learn embedding vectors for each slot and/or domain word , but this is not robust to unseen slots whose specific names (e.g. 'Internet Access') may be totally unlike those in the training set.  attempt to remedy this by posing a natural language question for each slot, but their hybrid span-extraction and classification-based system nonetheless requires access to the full ontology for unknown domains. We present an ontologyfree model using natural language questions to represent slots that builds on conditional language modeling techniques-taking advantage of the rise of powerful generative language models (Radford et al., 2019)-to tackle DST as a generative QA task. Our model can generalize to unseen domains, slot types, and values, and allows developers to query for arbitrary user requirements via simple questions. To summarize our main contributions: • We propose an ontology-free conditional language modeling framework for dialog state tracking via generative question answering, achieving state-of-the-art performance in zeroshot domain adaptation settings for DST on MultiWOZ 2.1 (Eric et al., 2020) across all domains with average per-domain gains of 5.9% joint accuracy over previous best methods; • We demonstrate performance competitive with state-of-the-art methods in a fully supervised setting; • We show that our approach can be easily adapted to predict slot carry-over and transfer knowledge from a larger, more diverse dataset , improving zero-shot DST performance across all domains to 11% joint accuracy over the state-of-the-art.

Approach
We follow  in treating Dialog State Tracking as a reading comprehension problem: at each turn of dialog, our model reads the dialog history and answers a fixed set of queries about user requirements and preferences (slots), with predictions aggregated to form the belief state. In our framework (Figure 1), we query for a given slot (e.g. Hotel Price Range) by asking a natural language question -"What is the price range of the hotel the user prefers?". As our model's predictive ability is based on its general understanding of language and task-oriented conversation, we support zero-shot inference without the need to re-train the model or extend a formal ontology. For example, if a model has not been trained on data from the hotel domain, when presented with a hotel booking conversation we may nonetheless ask it a question like "In what area is the user looking for a hotel?" and received a prediction for that unseen requirement (Hotel Area). While we conduct our experiments on Englishlanguage DST datasets, our approach is applicable to state tracking in any language, provided a conversation history is available.

Problem Statement
We consider a conversation with T turns of user u t and system utterances y t : Figure 2: Our model performs DST via generative question-answering. Natural language questions for dialog slots allow our model to generalize to new slot types through its understanding of general language. C = {y 1 , u 1 , . . . y T , u T }. The belief state B t at turn t comprises many tuples of slots s ∈ S and their associated values v s,t ∈ V s , extracted from the conversation history C t = {y 1 , u 1 , . . . , y t , u t }. The set of possible values V s can be arbitrarily large (e.g. possible hotel names), so we represent these values as sequences of vocabulary tokens v s,t = {w 1 , w 2 , . . . , w k }, w i ∈ W. At inference time we pose a natural language question s = {w 1 , . . . , w n } and our model predicts an answer (slot value v s,t ) based on its understanding of the dialog history C t . To predict the belief state B t , our model independently answers |S| different questions ( Figure 1). In zero-shot DST, the system must predict values for slots outside of the initial ontology-these slot queries correspond to arbitrary natural language questions s about entities and relationships in the conversation C t .
Generalizing to New Domains and Slots Dialog State Tracking systems in real-world settings must scale to new users and services, accommodating new slot values (e.g. a new movie release) as well as new domains and slot types (e.g. a service update, or a new connected API). Existing methods require the developer to either write a complete ontology of slots and allowed values or modify their model architecture to add slot-specific prediction heads (Chen et al., 2020). Span-based approaches (Zhang et al., 2019;Zhou and Small, 2019) can correctly predict values that appear verbatim in a conversation but fail when a user paraphrases or mis-phrases a value. They also fall back to treating open-valued slots as classification problems (Zhang et al., 2019;. We approach DST as an ontology-free generative question answering task, as generative methods  have shown promise in few-shot and supervised DST settings.  While some approaches toward DST as QA learn a set of embeddings for each slot and/or domain , this is not robust to unseen slots. We encode slots as natural language questions-manually formulating one question per slot-allowing us to share a pre-trained encoder for both dialog context and slot to leverage shared linguistic knowledge . Thus, our model is also agnostic to ontologies and can answer arbitrary English questions about the dialog history. We treat DST via QA as a conditional language modeling task, and train our model to predict the conditional likelihood of question (slot s) and answer (value v s,t ) tokens given a dialog context C t at a given turn t: At inference time, the model is given the dialog context alongside a question-[C t ; s]-and asked to predict the value v s,t for that slot.

Model Architecture
For our conditional language model, we compared two common architectures: 1) an encoder-decoder model (Sutskever et al., 2014) with a bi-directional encoder; and 2) a purely auto-regressive decoderonly model. We conducted preliminary experiments using both a Transformer (Vaswani et al., 2017) encoder-decoder language model pre-trained using a de-noising auto-encoder objective (Lewis et al., 2020), as well as a Transformer decoder pre-trained with next-token prediction on English web pages. We achieved 1% better supervised DST performance with the decoder-only model in half the training time. Our model architecture thus comprises a Transformer decoder language model that allows us to leverage pre-trained language models like GPT2 (Radford et al., 2019) and common-sense world knowledge accrued through pre-training (Petroni et al., 2019).  We use a BPE (Sennrich et al., 2016) tokenizer to convert input text into a sequence of tokens. These are embedded in R h and added to an R h sinusoidal positional embedding. This input embedding is processed by l Transformer layers with hidden dimensionality h, each of which applies multi-headed attention with k heads followed by a feed-forward layer with a softmax nonlinearity. The final output hidden states are then projected into our vocabulary space of 50,257 sub-word tokens. We initialize our model weights with Distil-GPT2 (Sanh et al., 2019), GPT2 (Radford et al., 2019), or GPT2-medium with h = 768, 768, 1024, l = 6, 12, 24, and k = 12, 12, 16 respectively.
As seen in Figure 2, our input sequence consists of a concatenation of dialog context C t , slot query s, and slot value v s,t : [C t ; s; v s,t ]. We prepend each utterance with a speaker token [usr] or [sys] for a user or system speaker to allow our model to identify additional context about each utterance. We pre-pend the slot query and value with question: and answer: respectively to distinguish slot queries from user-posed questions in the conversation. At training time, we calculate a cross-entropy loss similar to encoder-decoder models by maximizing the log likelihood of the slot query and value conditioned on the dialog context: where n = |[s; v s,t ]|. We find through ablation experiments on our architecture that this loss computation method out-performs a naïve languagemodeling approach that maximizes log likelihood of the full concatenated sequence [C t ; s; v s,t ] via the factorized joint distribution (Peng et al., 2020;   Hosseini-Asl et al., 2020): This allows for flexibility in learned representations for dialog context while regularizing slot query hidden states.

Data
We perform our experiments on MultiWOZ , which contains over 10K single-and multi-domain task-oriented dialogs written by crowd-workers. We use the 2.1 version, with corrected and standardized annotations from Eric et al. (2020). We follow  in lower-casing all dialogs and removing dialogs from training-only domains (Police and Hospital). The final dataset contains 9,906 conversations from 5 domains (Restaurant, Hotel, Attraction, Train, Taxi) covering 30 domain-slot pairs. Each dialog contains an average of 7 user and system turns. We also experiment with augmenting our training dataset in zero-shot settings with observations drawn from the DSTC8  dataset, 1 which contains 16,152 dialogs from 45 domains. DSTC8 was created via template-based dialog models provided with service APIs, and then edited by crowd-workers (Shah et al., 2018). We normalize domains and slots corresponding to the same domain (e.g. Bus 1, Bus 2) for a total of 19 domains and 124 slot types in DSTC8. We further manually annotate each dataset with slot value types: open-valued (e.g. Hotel Name), numeric (e.g. Restaurant Guests), temporal (e.g. Taxi LeaveAt), and categorical (e.g. Attraction Type). Dataset statistics are shown in

Experiments
We measure DST performance via Joint Goal Accuracy (JGA): the proportion of turns with all belief slots predicted correctly, including those not present. In Section 5.1, we evaluate our model on fully supervised DST, in which all domains and slots are known at training time. In Section 5.2, we investigate zero-shot domain adaptation in which the model is evaluated on conversations from an unseen domain with previously unseen slots. We then explore how our framework seamlessly accommodates teaching a model to predict slot carryover (Section 5.3) and transfer learning with significantly more diverse domains and slot types (Section 5.4). To measure zero-shot JGA, we follow Campagna et al. (2020) and only consider slots specific to the held-out domain. We focus our analysis on the zero-shot setting, as our goal is to build DST systems that can easily and effectively generalize to new domains and services. We train all models to convergence with a maximum of 10 epochs on Nvidia V100 GPUs, using the Lamb optimizer (You et al., 2020) with a base learning rate of 2e-5. All predictions are made using greedy decoding.

Supervised DST
We first evaluate on the commonly benchmarked supervised DST task to demonstrate performance competitive with state-of-the-art. In this setting we compare our approach against prior methods capable of zero-shot inference in Table 3-TRADE, STARC, SUMBT, and MA-DST-and those incapable of doing so in Table 4, including DSTQA (Zhou and Small, 2019), DS-DST (Zhang et al., 2019), SOM-DST (Kim et al., 2020), SST (Chen et al., 2020), TripPy (Heck et al., 2020), and Sim-pleToD (Hosseini-Asl et al., 2020). Our model outperforms all prior models that support zeroshot generalization and is competitive with meth-  ods that focus solely on supervised DST-most of which require extra supervision at training and inference time, including dialog actions and prior dialog states. We distinguish models by their prediction type as (C)lassification-, (S)pan extraction-, and (G)eneration-based methods.
As seen in Table 1, our formulation of DST as a generative QA task benefits significantly from the usage of a conditional decoder-style model. A standard auto-regressive language modeling formulation (LM) with loss computed over the entire input sequence achieves 13% lower JGA compared to computing cross entropy loss only over slot value tokens (CLM). Pre-training is also crucialwe see a 10-point drop in JGA when randomly initializing model weights (no PT) compared to initializing from pre-trained DistilGPT2 weights. We also compare two other sizes of our models: GPT2-based-comparable in size to SUMBT's  112M parameters-and GPT2medium-based-comparable in size to STARC's  355M parameters. We find that scaling the size of our model results in modest improvements in supervised JGA. We hypothesize that extending our loss to cover both slot query and value tokens (+Question/CLMQ) helps regularize the hidden representations of question tokens, and we achieve a 1.3% improvement in JGA.

Zero-Shot DST
Our primary focus lies in the zero-shot domain adaptation setting, where conversations and target slots at inference time come from unseen domains. We use a leave-one-out setup, training our models on four domains from MultiWOZ and evaluating on the held-out domain. Our model must understand a wide variety of possible questions about unseen conversations to generalize well. We compare our model against strong baseline models for zero-shot DST: TRADE, SUMBT, and MA-DST; Table 5 contains results from our models alongside baseline results reported by  and Campagna et al. (2020). These models represent slots as domain-slot tuples: TRADE learns a separate embedding for each domain and word in slot names, while SUMBT and MA-DST encode domain-slot tuples via BERT (Devlin et al., 2019) and an RNN encoder, respectively.
Our GPT2-medium based model achieves stateof-the-art zero-shot performance on all five domains, and by a significant (5-10%) margin on the Restaurant, Hotel, Attraction, and Train domains. While increased model size modestly impacts supervised DST performance (Table 1), larger models perform significantly better in a zero-shot setting with average absolute gains of 4.8% and relative gains of 22% in JGA across domains. Such improvements are consistent with findings from Brown et al. (2020) that up-sizing language models improves zero-shot performance across various tasks and Petroni et al. (2019), who observe that larger pre-trained models can retain more commonsense and world knowledge from their pre-training corpus-which may help our model understand queries for unseen domains and slots.

Effect of Natural Language Questions
Prior work that frames DST as QA typically represents the slot query as a concatenation (tuple) of domain and slot name. Zhang et al. (2019) explore the impact of three different slot representationsdomain-slot tuples, short slot descriptions, and full questions-on a hybrid classification-extraction model for DST, and find little difference in performance. However, we find that full questions work much better than domain-slot tuples for our generative framework, especially in zero-shot DST. We hypothesize that natural language questionsstructurally similar to dialog utterances and pretraining sentences-allow our model to best leverage its linguistic knowledge with minimal friction when jointly encoding the dialog history, slot query, and slot value.  find that zero-shot generalization in models that represent slots as tuples is primarily due to shared slot names between domains (e.g. Taxi and Train 'leaveAt'). In a real-world setting a newly added dialog service is unlikely to share slot names verbatim with existing services. To fairly compare tuples and natural language questions under our framework, we per-  form zero-shot experiments using each representation. For tuple-based questions, our model takes as slot query a synonym of the slot name (e.g. Taxi 'leaveAt' → 'Pick Up Time') instead of a full question (e.g. 'What time does the user want the taxi to pick them up¿). Full question models achieved 6% higher per-domain JGA compared to slot-tuple models, supporting the notion that slot-tuple models memorize slot names rather than understanding their meaning and thus do not generalize well in real-world settings. Using full questions, our model (Table 5) achieves state-of-the-art performance in zero-shot settings.
Error Modalities To analyze our model, we follow  and categorize DST errors in three modalities: 1) the model predicts a spurious value for an irrelevant slot; 2) the model ignores a relevant slot; and 3) the model correctly infers the presence of a slot but predicts a wrong value. We also examine the source of dialog slots: users explicitly express the majority (79.5%) of slot values, while a minority are either derived via user reactions to system suggestions (9.7%) or implicitly valued (10.8%)-not present verbatim in a conversation. However, our errors are distributed evenly between user, system, and implicit sourced slotssuggesting that it is challenging for our model to track dialog states that are updated reactively via user feedback. We thus see a future opportunity to improve DST models by emphasizing multi-hop reasoning and common-sense inference.

Predicting Carried Over Slots
Long-range dependencies and slot values carried over from early turns are particularly important to model for accurate DST in long conversations . We observe this in the zero-shot setting: our model is able to predict all slots accurately for 61% of conversation first-turns, dropping to 46% after one turn, and 5.7% after seven turns (the average conversation duration). We implement an oracle module to discard predictions when a dialog state does not need updating, obtaining an upper bound for DST improvements due to carryover prediction. With this oracle, we see an average 5-point improvement in JGA across domain, indicating that carry-over prediction can greatly benefit our model. State-of-the-art models for fully supervised DST often rely on explicitly processing previous dialog states-via slot-value graphs (Zhou and Small, 2019;Chen et al., 2020) or as a separate input to the model at each turn (Heck et al., 2020;Kim et al., 2020). In our framework we can target slot carry-over by training a model to predict a carried over token in place of the true slot value whenever a slot value does not need updating at the current turn (+ Carryover). At inference time, we replace predicted carry-over tokens with the slot's last predicted value.
Our carry-over implementation improved JGA for all domains (Table 7) by an average of 3.14%, and improved JGA across all context lengthswith the largest improvements (+7%) at the sec-  ond and third turn of a conversation (Figure 4). The carried over token allows our model to hedge against low confidence slots, falling back to predictions from previous turns where the target slot may be directly mentioned. This helps reduce the wrong value error rate by an average of 31% across each domain. Our model can also propagate null values with carry-over, reducing spurious predictions by an average of 36% across domains. However, we also observe our carry-over model propagating 78% of its errors from previous turns, suggesting that further improvements can result via accurately predicting slot updates.

Transfer Learning for Generalization
Our framework is ontology-agnostic and thus easily supports transfer learning without modifying the architecture by simply writing natural language questions for additional slots.  found that intermediate fine-tuning of RoBERTa-Large (Liu et al., 2019) on passage-based QA tasks (Fisch et al., 2019) improved zero-shot DST performance. In preliminary experiments, we found no significant impact from intermediate fine-tuning on the SQuAD v2.0 (Rajpurkar et al., 2018) passagebased QA dataset. However, we observe significant improvements when training with joint, noncurriculum learning (McCann et al., 2018;Raffel et al., 2020)-augmenting our training data with an equal number of examples sampled from DSTC8, taking care to remove data from the held-out domain in both MultiWOZ and DSTC8. Our framework allows for easy joint optimization with carry-over and transfer learning: by training new models on MultiWOZ 2.1 augmented with DSTC8 (+ DSTC8) we gain a further average 3.5point improvement in per-domain JGA (Table 7). On average, our model makes 29% fewer spurious errors, and 6.9% fewer errors in open-valued slots, suggesting that our model scales well with additional training data with semantically distinct slot types and values. Our model also makes 9.7% fewer errors on categorical slots and 63% fewer mistakes where it assigns the value of one categorical slot to another, despite being unable to observe the set of possible categorical options-suggesting that exposure to more diverse categorical slots allows our model to better understand and distinguish between such slots. While temporal slots comprise only 17% of MultiWOZ and 10% of DSTC8 slots, these additional examples seem to help our model better disambiguate temporal references and make 32% fewer errors in such slots.
By applying both carry-over and transfer learning to our largest model, we observe further improvements in zero-shot JGA for all domainsaveraging 5.1 points better than GPT2-m CLMQ, for an average gain of 11% JGA over previous state-of-the-art across domains (Table 7).

Qualitative Analysis
We manually reviewed 300 errors made by our GPT2-medium CLMQ model in the zero-shot setting-annotating 20 errors from each modality (spurious, ignored, wrong value) from each domain with the gold label quality and perceived cause of error totaling 300 annotated examples. As widely observed in recent DST work (Zhou and Small, 2019;, a significant proportion of DST errors on MultiWOZ are unavoidablecaused by annotation errors. While version 2.1 corrected some of these, annotation errors and inconsistencies remain responsible for 30% of sampled errors-in particular, in 10% of errors the original annotator did not record reactive preferences while in 5% of errors the original annotator did. These inconsistencies can hurt our model's ability to infer reactive and implied requirements and preferences. We are also particularly interested in slot transfers-when our model mistakenly predicts one slot's value for a different slot, comprising 36% of our manually reviewed errors. In the Taxi and Hotel domains, our model transfers slots from the same domain over 75% of the time, with most swaps occurs between same-category slots (e.g. temporal slots like Taxi 'LeaveAt' and 'ArriveBy'). Slots in these domains are closely semantically related, with values that can fit any slot of that category (e.g. 13:10 vs. 15:15). While a human can easily infer that the earlier of two times must be departure and the later arrival, our model has no inherent understanding of temporal mechanics or numeracy (Wallace et al., 2019). In future work, we will explore learning such knowledge directly via hierarchical softmax output distributions to distinguish between output modalities (Spithourakis and Riedel, 2018), and fine-tuning our model with contrastive losses to learn to rank numerals and times (Hoffer and Ailon, 2015).
For Restaurant, Attraction, and Train, our model tends to swap slot values with those from other domains in the conversation. This is often due to semantically similar slots whose values, at first glance, may not be obviously identifiable as such (e.g. 'Bridge' or 'The Place').  similarly observe a particularly high incidence of slot transfers between different-domain 'Name' slots. Other such slots include price ranges and numbers of guests. We have seen that data augmentation with DSTC8 can improve our model's ability to disambiguate such slots-this suggests that we could further improve our model by exposing it to in-domain, conversational reading comprehension data.
While no such dataset currently exists, in future work we aim to explore using question generation (Du et al., 2017) and paraphrasing (Tseng et al., 2014) models to perform in-domain data augmen-tation, creating reading comprehension questions for task-oriented dialogs that targeting entities and relations not covered by an ontology. We also wish to explore methods for generating general reading comprehension questions for out-of-domain conversations (Shakeri et al., 2020) to improve our model's domain adaptation ability.

Related Works
Modern dialog state tracking seeks to capture evolving user intents in a structured belief state (Thomson and Young, 2010). Traditional systems rely on hand-crafted features (Henderson et al., 2014) and classify slot values from a fixed ontology (Mrksic et al., 2017;.  and Zhou and Small (2019) fill some slots via spans extracted from dialog history, although they treat non-numeric slots as categorical. Generative methods (Xu and Hu, 2018; can predict arbitrary unseen values, with Hosseini-Asl et al. (2020) achieving state-of-the-art supervised DST performance in MultiWOZ 2.1 although they cannot predict unseen slots.
By posing DST as generative QA, our framework can leverage language models pre-trained on open-domain documents (Radford et al., 2019) to understand unfamiliar queries. Like , we seek to answer natural language questions about each slot. We contrast our approach to zero-shot DST-which never has access to slots or dialog from the target domain-and that of Campagna et al. (2020), who expose their 'zero-shot' models to synthetic in-domain conversations that require access to the full ontology of the 'held-out' evaluation domain.
We take inspiration from previous work that frames a wide selection of natural language understanding (NLU) tasks  as QA (McCann et al., 2018) and span extraction (Keskar et al., 2019). While question-answering can be posed as a span extraction task (Wang et al., 2016), generative approaches have proven successful in answering questions about complex passages (Fan et al., 2019). We use a language modeling approach, taking cues from Raffel et al. (2020) who demonstrate that a large language model trained on next-token prediction can learn to solve many different NLU tasks posed as text. Recent work has also shown that large pre-trained language models can generalize to new NLU tasks with few or no examples (Brown et al., 2020), and we leverage this alongside world knowledge acquired during the pre-training process (Petroni et al., 2019) to build a DST model that is robust to new domains and slot-value ontologies.

Conclusion
This paper proposes a conditional language modeling approach to multi-domain DST posed as a generative question answering task. By leveraging natural language questions as state queries, our model can generalize to unseen domains, slots, and values via its understanding of language. Our model achieves state-of-the-art zero-shot results on the MultiWOZ 2.1 dataset with average per-domain absolute improvements of 5.9% joint accuracy. We also demonstrate that our framework is easily extensible to support transfer learning and learning slot carry-over. In the future, it is worth exploring mechanisms for our model to better understand relative temporal values and general reading comprehension questions from conversations in order to disambiguate semantically similar dialog slots.