Dialogue State Tracking with a Language Model using Schema-Driven Prompting

Task-oriented conversational systems often use dialogue state tracking to represent the user’s intentions, which involves filling in values of pre-defined slots. Many approaches have been proposed, often using task-specific architectures with special-purpose classifiers. Recently, good results have been obtained using more general architectures based on pretrained language models. Here, we introduce a new variation of the language modeling approach that uses schema-driven prompting to provide task-aware history encoding that is used for both categorical and non-categorical slots. We further improve performance by augmenting the prompting with schema descriptions, a naturally occurring source of in-domain knowledge. Our purely generative system achieves state-of-the-art performance on MultiWOZ 2.2 and achieves competitive performance on two other benchmarks: MultiWOZ 2.1 and M2M. The data and code will be available at https://github.com/chiahsuan156/DST-as-Prompting.


Introduction
In task-oriented dialogues, systems communicate with users through natural language to accomplish a wide range of tasks, such as food ordering, tech support, restaurant/hotel/travel booking, etc. The backbone module of a typical system is dialogue state tracking (DST), where the user goal is inferred from the dialogue history (Henderson et al., 2014;Shah et al., 2018;Budzianowski et al., 2018). User goals are represented in terms of values of pre-defined slots associated with a schema determined by the information needed to execute task-specific queries to the backend. In other words, user goals are extracted progressively via slot filling based on the schema throughout the conversation. In this paper, we focus on multi-domain DST where the dialogue state is encoded as a list of triplets in the form of (domain, slot, value), e.g. ("restaurant", "area", "centre").
There are two broad paradigms of DST models, classification-based and generation-based models, where the major difference is how the slot value is inferred. In classification-based models (Ye et al., 2021;, the prediction of a slot value is restricted to a fixed set for each slot, and non-categorical slots are constrained to values observed in the training data. In contrast, generationbased models  decode slot values sequentially (token by token) based on the dialogue context, with the potential of recovering unseen values. Recently, generationbased DST built on large-scale pretrained neural language models (LM) achieve strong results without relying on domain-specific modules. Among them, the autoregressive model (Peng et al., 2020a;Hosseini-Asl et al., 2020) uses a uni-directional encoder whereas the sequence-to-sequence model (Lin et al., 2020a;Heck et al., 2020) represents the dialogue context using a bi-directional encoder.
In this study, we follow a generation-based DST approach using a pre-trained sequence-to-sequence model, but with the new strategy of adding taskspecific prompts as input for sequence-to-sequence DST models, inspired by prompt-based fine-tuning (Radford et al., 2019;Brown et al., 2020a). Specifically, instead of generating domain and slot symbols in the decoder, we concatenate the dialogue context with domain and slot prompts as input to the encoder, where prompts are taken directly from the schema. We hypothesize that jointly encoding dialogue context and schema-specific textual information can further benefit a sequence-to-sequence DST model. This allows task-aware contextualization for more effectively guiding the decoder to generate slot values.
Although the domain and slot names typically have interpretable components, they often do not reflect standard written English, e.g. "arriveby" and "ref ". Those custom meaning representations are typically abbreviated and/or under-specified, which creates a barrier for effectively utilizing the pretrained LMs. To address this issue, we further incorporate natural language schema descriptions into prompting for DST, which include useful information to guide the decoder. For example, the description of "ref " is "reference number of the hotel booking"; the values of "has_internet" are "yes", "no", "free", and "don't care".
In short, this work advances generation-based DST in two ways. First, candidate schema labels are jointly encoded with the dialogue context, providing a task-aware contextualization for initializing the decoder. Second, natural language descriptions of schema categories associated with database documentation are incorporated in encoding as prompts to the language model, allowing uniform handling of categorical and non-categorical slots. When implemented using a strong pretrained text-to-text model, this simple approach achieves state-of-theart (SOTA) results on MultiWOZ 2.2, and performance is on par with SOTA on MultiWOZ 2.1 and M2M. In addition, our analyses provide empirical results that contribute towards understanding how schema description augmentation can effectively constrain the model prediction.

Multi-Domain Dialogue State Tracking
Task-oriented dialogue datasets (Shah et al., 2018;Henderson et al., 2014), have spurred the development of dialogue systems (Zhong et al., 2018;Chao and Lane, 2019). Recently, to further examine the generalization abilities, large scale cross-domain datasets have been proposed (Budzianowski et al., 2018;Zang et al., 2020;Eric et al., 2019;Rastogi et al., 2020b). Classification-based models (Ye et al., 2021; pick the candidate from the oracle list of possible slot values. The assumption of the full access of the schema makes them have limited generalization abilities. On the other hand, generation-based models Lin et al., 2020a) directly generate slot values token by token, making it possible to handle unseen domains and values. Most of these models require task-specific modular designs.
Recently, generation-based models that are built on large-scale autoregressive pretrained language models (Ham et al., 2020;Hosseini-Asl et al., 2020;Peng et al., 2020a) achieve promising state tracking results on MultiWOZ 2.0 and 2.1 when trained on additional supervision signals or dialogue corpus. Both Ham et al. (2020) and Hosseini-Asl et al. (2020) require dialogue acts as inputs. Both Hosseini-Asl et al. (2020) and Peng et al. (2020a) require DB search results as inputs. Peng et al. (2020a) also leverages other dialogue corpora to finetune the language model. Our work requires only the dialogue state labels and does not utilize any external dialogue datasets.

Language Models
Large-scale pretrained language models have obtained state-of-the-art performance on diverse generation and understanding tasks including bidirectional encoder style language models (Devlin et al., 2019;Liu et al., 2019), auto-regressive language models (Radford et al., 2019;Brown et al., 2020b) and more flexible sequence-to-sequence language models (Raffel et al., 2020). To adapt to dialogue tasks, variants of systems are finetuned on different dialogue corpora including chit-chat systems Adiwardana et al., 2020;Roller et al., 2020) and task-oriented dialogue systems (Mehri et al., 2019;Wu et al., 2020;Henderson et al., 2020;Peng et al., 2020b). We leave it as future work to leverage domain-adapted language models.

Prompting Language Models
Extending a language model's knowledge via prompts is an active line of research. Radford et al. (2019) obtain empirical success by using prompts to guide zero shot generation without finetuning on any prompts. Raffel et al. (2020) uses task-specific prompts in both finetuning and testing phase. Recent studies have also tried to automatically discover prompts rather than writing them by humans (Jiang et al., 2020). Our proposed prompting method is largely inspired by this body of work. Instead of prompt engineering/generation, we focus on using available natural language descriptions of schema categories associated with database documentation as task-specific promptings for DST.

Prompting Language Model for Dialogue State Tracking
In this section, we first set up the notations that are used throughout paper, and then review the generative DST with the sequence-to-sequence framework. Based on that, we formally introduce our promptbased DST model and the corresponding backbone  pretrained model.

Notation.
For task-oriented dialogues considered in this paper, a dialogue consists of a sequence of utterances alternating between two parties, 1 , 1 , ..., , , where and represent the user utterance and the system response, respectively. In a turn , the user provides a new utterance and the system agent responds with utterance . As shown in the bottom of Figure 1, at turn , we denote the dialogue context as = { 1 , 1 , … , −1 , }, which excludes the latest system response . In this work, we assume a multi-domain scenario, in which case the schema contains Figure 1. , the dialogue state at turn , is then defined as a mapping from a pair ( , ) into values . Here, we define ( , ) = , if ( , ) is not in the current dialogue state. In the given example of Figure 1, the pair (domain=hotel, slot=ref ) is not in the dialogue state, and the value "none" is assigned.

Generation-based DST with the Sequence-to-sequence Model
There are primarily two decoding strategies for generation-based DST in the literature for inferring the dialogue state at a particular turn -sequential (a) and independent (b)(c) -both of which are explored in the paper as illustrated in Figure 1.
In the first case (top system (a) in Figure 1), the dialogue history is taken as input to the encoder, and domain-slot-value triplets ( , , ) are generated sequentially, where ( , ) ≠ . This approach is adopted in many systems that leverage autoregressive LMs (Peng et al., 2020a;Hosseini-Asl et al., 2020). Despite being simple, this kind of sequential generation of multiple values is more likely to suffer from optimization issues with decoding long sequences resulting in lower performance. However, given its wide adoption in the literature, we still include this type of generative DST with the same backbone pretrained encoderdecoder Transformer model in our experiments. To partially address this issue, Lin et al. (2020b) propose a domain independent decoding where the decoder only have to generate a sequence of slot and value pairs within a specific given domain. Although their model leverages the same backbone model as ours, we empirically find that this form of strategy is still of limited effectiveness.
In the second case (middle two systems (b)(c) in Figure 1), the values for each domain-slot pair are generated independently, potentially in parallel. The domain and slot names (embedded as continuous representations) are either the initial hidden state of the decoder  or the first input of the decoder . Values are either generated for all possible domain-slot ( , ) pairs with a possible value of "none" and/or there is a separate gating mechanism for domain-slot combinations not currently active. Since we are interested in enriching the input with task-specific information, we focus on extending the independent decoding modeling for our prompt-based DST.

Prompt-based DST
In this section, we formally present the flow of our prompt-based DST with an encoder-decoder architecture. Here, we are interested in an encoderdecoder model with a bi-directional encoder (Raffel et al., 2020;Lewis et al., 2020), in contrast with the uni-directional encoder used in autoregressive LMs (Radford et al., 2019;Brown et al., 2020a).
The input of the prompt-based DST is made up of a dialogue context and a task-specific prompt. Here, we use two types of task-specific prompts, the domain-related prompt ( ), and slot-related prompt ( ), both of which are de-rived based on the given schema. We leave the discussion of two specific realizations of taskspecific prompts to the later part of this section. Specifically, all sub-sequences are concatenated with special segment tokens, i.e., "[user] [slot] are special segment tokens for indicating the start of a specific user utterance, system utterance, domain-related prompt, and slotrelated prompt, respectively.
Given this prompt-augmented input, the bidirectional encoder then outputs where ∈ ℝ × is the hidden states of the encoder, is the input sequence length, and is the encoder hidden size. Then, the decoder attends to the encoder hidden states and decodes the corresponding slot value ( , ): ( 2) The overall learning objective of this generation processing is maximizing the log-likelihood of ( , ) given , ( ) and ( ), that is During inference, a greedy decoding procedure is directly applied, i.e., only the most likely token in the given model vocabulary is predicted at each decoding step. Schema-Based Prompt. The first realization of task-specific prompt considered in this paper is based on the domain and slot names as defined in the task-dependent schema. As shown in (b) of Figure 1, given the domain name train and the slot name day, the specific prompt is in the form of "[domain] train [slot] day". Different from (Lin et al., 2020a; where the taskspecific information is used in the decoder side, our symbol-based prompt as additional input to the bi-directional encoder can potentially achieve taskaware contextualizations. Observing that users often revise/repair their earlier requests in dialogues, we posit that the resulting encoded representations can be more effectively used by the decoder for generating corresponding slot values. Natural Language Augmented Prompt. One main drawback of symbol-based prompt is that those domain/slot names contain limited information that can be utilized by pretrained LMs. In other words, those symbols from the custom schema are typically under-specified and unlikely to appear in corpus for LM pretraining. Fortunately, documentation is commonly available for real-world databases, and it is a rich resource for domain knowledge that allows dialogue systems to better understand the meanings of the abbreviated domain and slot names. The documentation includes but is not limited to domain/slot descriptions and the list of possible values for categorical slots. In this work, we experiment with a simple approach that augments the input by incorporating the domain description after the domain name and the slot description (with the sequence of values, if any) following the slot name, as illustrated in the system (c) in Figure 1.

Backbone Sequence-to-sequence Model
Our prompt-based DST model is initialized with weights from a pretrained LM in an encoderdecoder fashion. In this paper, we use the Text-to-Text Transformer (T5) (Raffel et al., 2020) as our backbone model. T5 is an encoder-decoder Transformer with relative position encodings (Shaw et al., 2018). We refer interested readers to the original paper for more details.

MultiWOZ 2.2: Fully Annotated Natural Language Augmented Prompt
We present the evaluation results on MultiWOZ 2.2 in Table 2. The following baseline models are considered: TRADE , DS-DST (Zhang et al., 2019) and Seq2Seq-DU (Feng et al., 2020). Similar to ours, the decoding strategy of TRADE is independent. However, the sum of domain and slot embeddings are the first input of the decoder, which makes their dialogue history representation not task-aware contextualized. The sequential decoding strategy is worse than the independent decoding strategy by over 5% with both T5-small and T5-base. Even with T5-small (almost half the model size of BERT-base which is used in most previous benchmark models), our system achieves the SOTA performance using the independent decoding. As expected, T5-base systems outperform T5-small systems. With the augmentation of descriptions, we improve the overall JGA by over 1% in both T5-small and T5-base.

MultiWOZ 2.1: Partially Annotated Natural Language Augmented Prompt
Different from MultiWOZ 2.2 studied in the previous section, MultiWOZ 2.1 only contains natural language descriptions for slots but not domains. In addition, there is no possible slot value information.
The evaluation results on MultiWOZ 2.1 are shown in Table 3, where we compare with TRADE , MinTL (Lin et al., 2020a), SST , TripPy (Heck et al., 2020), Simple-TOD (Hosseini-Asl et al., 2020), SOLOIST (Peng et al., 2020a) and TripPy+SCORE . Note that both SOLOIST and TripPY+SCORE use external dialogue datasets to  finetune their models. As expected, we observe that T5-base models perform consistently better than T5-small models. Moreover, using descriptions consistently improves the performance of both models. All our models outperform baselines that do not use extra dialogue data. It is worth noting that comparing with MinTL (T5-small), our model is better by over 4% even without descriptions. Further, our T5-small system is even better than MinTL built on BART-LARGE (Lewis et al., 2020) which has substantially more parameters. Similar to ours, MinTL leverages a sequence-to-sequence LM. One difference is that their domain information is fed only to the decoder while our approaches enables task-aware contextualization by prompting the LMs with domain and slot information on the encoder side. Another difference is that they jointly learn DST together with dialogue response generation, which provides more supervision signals. Therefore, the better performance of our systems implies that schema-driven prompting is effective.
Lastly, compared with MultiWOZ 2.2, the performance gain brought by augmenting natural language descriptions is less pronounced which is likely caused by the reduced information available in MultiWOZ 2.1 descriptions.

M2M: Borrowed Natural Language
Augmented Prompt  Table 4: Results on M2M. All numbers are reported in joint goal accuracy(JGA)(%). (Rastogi et al., 2017) should be considered as a kind of oracle upper bound performance because the target slot value is guaranteed to be in the candidate list and consider by the model. notated in a different manner. We achieve the SOTA performance on Sim-R and Sim-M+R while being comparable on Sim-M. The improvements of descriptions are only evident on the restaurant domain. The lack of improvement from slot descriptions for the movie domain may be because the slot descriptions do not add much beyond the slot name (compared to "category" for the restaurant domain) or that it has slots that generalize better across domains (e.g. date, time, number of people).

Breakdown Evaluation for MultiWOZ
In Table 5, we follow the categorization provided in (Zang et al., 2020) and show the breakdown evaluation of categorical and non-categorical slots on MultiWOZ 2.2. As we can see, the breakdown accuracy scores for both categorical and non-categorical slots are pretty consistent with the overall JGA. For both T5-small and T5-base models, models with sequential decoding perform worse than the corresponding models with independent decoding for both categorical and non-categorical slots. In particular, the independent decoding models achieve more pronounced improvement in categorical slots indicating that the task-specific prompt is very helpful for guiding the decoder to predict valid values. When comparing models using natural language description with those not, we observe performance gains for both types of slots for T-base but only non-categorical slots for T5-small. It is likely that the smaller size of T5 has limited representation capability to effectively utilize the additional textual description information regarding types and possible values.

Ablation Study on Schema Descriptions
To understand what parts of the schema descriptions are most important, we experiment with three kinds of description combinations on MultiWOZ 2.2 using the T5-small configuration: (i) excludes the list of possible values for categorical slots (ii) excludes slot descriptions (iii) excludes domain descriptions. For (i), there is an 0.4% point drop in JGA, validating that value sets can successfully constrain the model output, as we illustrate in Table 6. For (ii), there is a 0.8% point drop in JGA. And for (iii), there is a 0.1% point drop in JGA. This shows that slot descriptions are the most important part of the schema prompts and domain descriptions are relatively less effective. This is probably due to the fact that there are 61 slots in MultiWOZ 2.2 but only 8 domains. Also, the domain names are all self-contained single words.

The Effectiveness of Natural Language Augmented Prompt
In order to understand the benefit of natural   the dialogue history are partially matched to the slot description of arriveby making it easier for the description-augmented system to detect the mention of the correct slot. For the second example, the type information in the description implicitly guides the model to focus on time-related information leading the correct output of the normalized time expression, 16:45. In contrast, the model without descriptions only generates the partial answer 4:45, ignoring PM. Lastly, "London Kings Street" is a typographical error in this case. By utilizing the provided possible values included in the slot descriptions, the model is able to generate the correct slot value without spelling error, demonstrating that the natural language augmented prompt can successfully constrain the model output and potentially provides robustness to the dialogue state tracking system.

Error Analysis of Natural Language Augmented Prompt-based DST
Here, we further carry out error analyses into the natural language augmented prompt-based T5-base model on MultiWOZ 2.2. As shown in Table 7, we randomly sample 50 turns and categorize them into different types. In summary, there are four types of errors: (i) The most common error type is annotation error in which the model prediction is actually correct, which is similar to the findings of (Zhou and Small, 2019). (ii) 20% of the errors come from model failing to capture information provided by the system. 3 (iii) 16.66% of the errors are caused by the model misses of at least one gold slot. (iv) 10% of the errors are correct slot predictions with the wrong corresponding values. In general, most errors are likely caused by the lack of explicit modeling of user-system interactions.

Conclusion
In this work, we propose a simple but effective task-oriented dialogue system based on large-scale pretrained LM. We show that, by reformulating the dialogue state tracking task as prompting knowledge from LM, our model can benefit from the knowledge-rich sequence to sequence T5 model. Based on our experiments, the proposed natural language augmented prompt-based DST model achieve SOTA on MultiWOZ 2.2 and comparable performance on MultiWOZ 2.1 and M2M to recent SOTA models. Moreover, our analyses provide evidence that the natural language prompt is effectively utilized to constrain the model prediction.  (Rastogi et al., 2020a) and the descriptions of the restaurant domain is taken from (Zang et al., 2020).