Knowledge-grounded Dialog State Tracking

Knowledge (including structured knowledge such as schema and ontology, and unstructured knowledge such as web corpus) is a critical part of dialog understanding, especially for unseen tasks and domains. Traditionally, such domain-specific knowledge is encoded implicitly into model parameters for the execution of downstream tasks, which makes training inefficient. In addition, such models are not easily transferable to new tasks with different schemas. In this work, we propose to perform dialog state tracking grounded on knowledge encoded externally. We query relevant knowledge of various forms based on the dialog context where such information can ground the prediction of dialog states. We demonstrate superior performance of our proposed method over strong baselines, especially in the few-shot learning setting.


Introduction
Pre-trained language models (LMs, Radford et al., 2019;Raffel et al., 2020) are the backbone of contemporary task-oriented dialog (TOD) models (Peng et al., 2020;Yang et al., 2021).However, the models are pre-trained on large generic corpora so that they do not contain task-specific knowledge.Previous work primarily suggests further pre-training or fine-tuning the LMs on in-domain data for adaptation (Wu et al., 2020;Hosseini-Asl et al., 2020), but it cannot consider information above the surface level.This makes it challenging for downstream tasks especially in the few-shot learning setting because mapping representation to the output space and encoding knowledge into the model parameters are entangled, while the latter may require more training data.Some more recent research proposes to incorporate external knowledge for response generation tasks (Dinan et al., 2019;Shuster et al., 2022;Chen et al., 2022 Figure 1: Model architecture for our proposed knowledge-grounded DST.The encoder first encodes the query and knowledge into representations, and we find the top-k most relevant knowledge elements to the context in step 2. We flatten the retrieved elements in step 3 and append to the query context as the input to the encoder-decoder model.The retrieved elements serve as a prior for DST.Komeili et al., 2022), but it is not clear how to utilize such information for language understanding.
In TOD settings, because the API call structure is restricted to certain intents, slots, and values, the schema is often provided.For example, in a flight booking system, queries like departure location and airlines are pre-defined.Users, even though not bounded directly by what they can say to agents, have a limited and predictable vocabulary set to some extent.If the schema information is utilized, a model does not need to learn that "San Francisco" represents a departure place, rather than a general city name from the LM.This is particularly important for new information, such as movie titles or locations that do not appear in the LM training corpus.Similar to human annotators, grounding a dialog model on such knowledge makes it easier and more accurate in understanding conversations.
In this paper, we investigate knowledge-grounded understanding for dialog state tracking (DST).In addition to using structured knowledge such as the ontology of slot type-value pairs, we also consider unstructured knowledge from the raw training data.We train a TOD model to query relevant knowledge for each turn in the context, and leverage the retrieved knowledge to predict dialog state.We evaluate our method on MultiWOZ (Budzianowski et al., 2018) for both the full-data and few-shot settings, and show superior performance compared to previous methods.
2 Related Work

Knowledge grounding
To relax the requirement of encoding knowledge of the whole world into model parameters, one direction is to disentangle knowledge representation from LMs.Most of these methods are applied to knowledge-intensive text generation tasks such as open-domain question answering (Lee et al., 2019;Karpukhin et al., 2020;Guu et al., 2020;Lewis et al., 2020;Borgeaud et al., 2021), and response generation with factual information (Dinan et al., 2019;Komeili et al., 2022;Thoppilan et al., 2022;Kim et al., 2020;Thulke et al., 2021;Chen et al., 2022).Similarly, some work also considers retrieving information to serve as a reference to refine the model generation process (Weston et al., 2018;Gonzalez et al., 2019;Khandelwal et al., 2021;Zhang et al., 2021).Different from these approaches, our method focuses on learning and utilizing available domain-relevant knowledge for language understanding tasks.Moreover, we propose to leverage knowledge of various formats.

Knowledge guided dialog understanding
Encoding domain schema into model parameters (Hosseini-Asl et al., 2020;Madotto et al., 2020) may not be efficient for unseen domains and tasks where the ontology can be different.One line of research (Ren et al., 2018;Wu et al., 2019;Zhou and Small, 2019;Rastogi et al., 2020;Du et al., 2021;Lee et al., 2021) leverages question-answering techniques to predict values for each slot, or prepend all slot-value information to the context (Zhao et al., 2022).However, this method is not scalable when the number of slot-value pairs is large, especially in multi-domain TOD systems.In addition, probably due to blurry attention over long context (Fan et al., 2021), Lee et al. (2021) find that adding potential slot values does not improve the model performance.In contrast, retrieving only relevant schema effectively solves the scalability problem by specifying the knowledge with a fixed length.Alternatively, instead of structured schema knowledge, recent research proposes to use handcrafted demonstrations as prompts (Gupta et al., 2022) or find similar examples to guide understanding tasks (Yu et al., 2021;Pasupat et al., 2021;Yao et al., 2021) such as conversational semantic parsing.However, one turn can contain multiple dialog states so that retrieved examples from previous methods may not be sufficient to provide required evidence.Furthermore, our method can be applied to unify different forms of knowledge including structured and unstructured ones.

Methodology
Our proposed method is illustrated in Figure 1.Given the context x, we first retrieve k relevant knowledge entries e by the similarity between Enc(x) and Enc(e) using an encoder Enc.Then we integrate the retrieved entries e 1 , e 2 , ..., e k with the original context to form x , where x is used as the input for the target DST task.
Knowledge retrieval Different from previous work (such as question answering) where there is only one ground-truth knowledge for each query, multiple entries of the form slot-value pairs may exist in the ontology base that match the conversation context.Importantly, unlike passage retrieval where the query (e.g., a sentence) and the target (e.g., another sentence or passage) are similar to the pre-training corpus, structured knowledge such as schema pairs may have different representation distribution.Thus, an off-the-shelf encoder may retrieve noisy elements and degrade final performance, especially when training with the target task optimized on DST generation.Moreover, nonparametric retrieval methods such as TF-IDF and BM25 (Robertson and Zaragoza, 2009) rely on lexical overlapping, which could be detrimental when entries in schemas contain high word overlapping (e.g., same value for different slots).
We therefore train our knowledge retriever to promote similar representations between a query and its ground truth knowledge.We started with optimizing the marginal likelihood over all positive knowledge entries, but found that it resulted in peaky distribution centered around specific elements in our preliminary studies.Instead, we mini-mize binary cross-entropy with contrastive loss: where y i is 1 if e i appears in the target dialog state and otherwise 0. In our model, we use the same encoder Enc for both the context and the knowledge, and Enc is also used for the target DST task.sim defines the retrieval score, computed as the dot product between representations of the first token from the last layer1 .
Knowledge integration Once most relevant knowledge elements are retrieved by the model, this extra information can serve as a strong inductive bias to the downstream, knowledge-sensitive tasks.One common approach for knowledge integration is fusion-in-decoder (Izacard and Grave, 2021).Although efficient, it has been shown that retrieved information is likely to be ignored by a pre-trained model (Shuster et al., 2022).Hence, we concatenate retrieved knowledge with the context x = e k , e k−1 , ..., e 1 , x, where the entries are ordered from the least similar (e k ) to the most similar (e 1 ).The similarity can also be considered as the confidence an element e i is relevant to the current context.We take the x as the context to the DST task.Therefore, our method is unified for knowledge of any format, and a bounded number of elements can solve the problem of memory constraint in previous research (Zhao et al., 2022).

Experiments and baselines
We conduct experiments on MultiWOZ 2.4 (Budzianowski et al., 2018;Ye et al., 2021) for DST in both full-shot and few-shot (1%) learning settings.For all experiments, we use T5-base and T5-XXL encoder-decoder models (Raffel et al., 2020) as the initial checkpoints.We use the publicly available T5 checkpoints 2 for our experiments.T5-base has 250 million parameters, and T5-XXL has 11B parameters.We train all models on 64 (for T5-base) and 128 (for T5-XXL) TPU v3 chips (Jouppi et al., 2017).For fine-tuning, we set a learning rate of 1e-4 and a batch size of 32.We set the input and output sequence length to 1024 and 512 tokens.We train all models for 200k steps and report the performance on the test set from checkpoints achieving the best results on the development set.When multitask training on both retrieval and DST, we set 0.1 weight to the retrieval loss and 1 weight to the DST loss since it is relatively faster to converge for retrieval.We also experimented with 0.01, 0.05, 0.5, 1 for the retrieval loss weight, and found that 0.1 performs the best.
We compare with two baselines, seq2seq and D3ST.seq2seq takes the context as input, and predicts a sequence of linearized dialog state for each turn.Similarly, D3ST (Zhao et al., 2022) adds descriptions of each slot with potentially values as the prompt and predicts dialog states as multiple choice.Both baselines use the same T5 initial checkpoints.We report averaged joint goal accuracy (JGA) across three random seeds.
For our proposed method, we consider slot type (type), slot type and value (type+value), and training data (training) as knowledge sources.Specifically, for type, we consider all slot types (35 in total such as "hotel-parking") as the knowledge base and retrieve corresponding top ten elements.For type+value, we consider each combination of types and their values in the form of "type: value" (1858 in total such as "hotel-parking: don't care") as the knowledge elements.Because there are more elements, we consider top 30 in our experiments for retrieval to achieve higher recall (with analysis later in Section 4.3).For training data, because of memory concerns, we randomly sample 500 training examples as the knowledge base and we consider the ground truth training example as the one with the highest F1 overlapping in the dialog slot types.We only consider top-1 due to the length constraint.For each knowledge source, we train retrieval together with the DST generator using the same model parameters.

Results
Table 1 shows DST results produced by different methods.Compared to the seq2seq baseline and D3ST, grounding on relevant knowledge by retrieval achieves better JGA by a large margin especially in the few-shot learning setting (> 4% absolute value).In the full data setting, our method performs on par with D3ST mostly due to that with more training data, the model can encode knowledge into parameters rather than relying on a separate, disentangled knowledge base.However, our method is not limited by the sequence length when we can specifically choose the number of retrieved elements regardless of the ontology size.
When comparing among different knowledge formats, type+value performs better than retrieving type only despite that retrieving is a harder task.As shown by recall 3 , with a large pre-trained model (XXL), recall for retrieving type only can achieve perfect scores (> 99%), but recall for type+value can only be 48% in the few-shot and close to 70% in the full-data setting.This indicates that the model can denoise distracting elements and make use of relevant knowledge as a positive inductive bias.Meanwhile, retrieving training data is similar to utilizing prompts (Gupta et al., 2022), but the worse performance compared to other knowledge formats suggests that selecting top-1 element is not optimal despite the relatively high recall.This is mostly due to that the retrieval results are noisy, as the small set of examples may contain slot types or values that are different from the ground truth.It 3 Precision is determined by the number of retrieved elements we set, whereas recall measures the percentage of ground-truth knowledge elements being correctly retrieved.Therefore, recall is more informative.is even less likely to find an example with exactly the same dialog state when the context is long.We leave further investigation by separating knowledge memory to support different knowledge sizes and external knowledge to future work.

Analysis
We study the relationship between retrieval and JGA in this section, and provide error analysis.We also analyze the detailed comparison between our method and D3ST. Figure 2: JGA with controlled retrieval recall from sanity check experimented with T5-XXL on the full-data setting.Results show that similar to our findings, even noisy retrieval improves model performance on DST.
Relationship between retrieval quality and JGA To understand the relationship between retrieval and the downstream task, we show JGA corresponding to recall in a controlled sanity check.Specifically, we randomly sample ground truth slot type-value pairs to match a target recall score and replace the rest dialog states with pairs uniformly sampled from the whole ontology (excluding the ground truth) without replacement as negative examples.Results (detailed in Figure 2) show that with T5-XXL on the full-data setting, 50% recall can significantly improve the model performance (83.75 JGA) while 90% recall can result in 91.24 JGA.This suggests that a high recall for retrieval is critical to JGA, while the model remains robust against noisy retrieval results.It also indicates that a better retrieval method (such as an external one Lazaridou et al. 2022) may achieve better performance.On the other hand, if we consider DST as a multi-class classification task with a retrieval module only, the model has to pick relevant elements from top-k, which is non-trivial.We also consider separating retrieval from DST, i.e., train the model for retrieval first and then on DST.Results show that although the model can achieve 97.38% recall, JGA actually drops to 70.33 on the full-data setting with XXL.We conjecture the main reason to be that different from freezing retrieval index in previous question-answering work, knowledge such as ontology or training data are more homogeneous and thus being more sensitive.This result is similar to our findings when training the two tasks jointly: retrieval metrics keep improving while JGA may drop with higher retrieval, even if we decrease the retrieval loss weight.
When we optimize separate parameters (i.e., two additional layers) for retrieval instead of the whole model, we observe slightly lower performance on JGA (54.76 compared to 55.32 on 1% data) and lower retrieval recall (36 compared to close to 46).Lastly, compared to top-30 with a JGA of 75.47, we observe an absolute drop of 0.40 for top-20, and 3.25 for top-10.This indicates that compared to noise in precision, retrieving ground-truth elements for recall is more critical to JGA.
Comparison to D3ST D3ST decodes the sequence of dialog state based on the order of slot types provided in the prompt by data preprocessing.In comparison, the order of retrieved elements varies while the order of dialog state depends on the ground-truth annotation.In other words, similar to the seq2seq baseline, our method requires learning the annotation order for DST prediction.This makes it more challenging to train, especially when there are similar knowledge elements retrieved.This can be justified by the slightly lower JGA with the full data setting.On the other hand, D3ST can be considered as a special setting of our grounding method where all knowledge elements are provided, and the DST generation model needs to implicitly detect relevant information and decode accordingly.We conjecture that the better performance on the few-shot setting over D3ST is due to that retrieving target elements while filtering noisy ones is easier than selecting corresponding knowledge, which can be shown from the high recall scores compared to lower JGA for D3ST.One future direction is to combine the benefits of the two worlds by utilizing the retrieved knowledge without length restriction.
Error analysis We found qualitatively that instead of ignoring retrieved elements as shown in previous research, the model does attend to retrieved slot-value pairs when decoding dialog states.
The main errors are from noisy retrieval, where a very similar elements with a higher rank (thus closer to the context in x ) than the ground truth knowledge may either stop the model from generating more states (i.e., missing target dialog states) or signal the model to generate the wrong elements directly.On the other hand, the model always predicts correctly if the ground truth are the most confident retrieved elements.To deal with the influence of attending only at the nearest few elements (which have the highest retrieval scores), we also experimented with randomly shuffling the retrieved knowledge but this results in lower scores (71.0 compared to 75.5) because the model needs to denoise from potential top-k elements without any additional information.

Conclusion
In this paper, we propose to disentangle domain knowledge and encode knowledge as a prior to dialog state tracking.Compared to previous research of grounding on knowledge for factual generation, our method can be applied to multiple sources of knowledge in the task-oriented dialog understanding setting.We conduct experiments on the Multi-WOZ dataset and show superior performance especially in the few-shot learning setting.We plan to apply our method on more general natural language understanding tasks in the future.

Limitations
In the experiments, we show model improvements over strong baselines.Despite the simplicity of the method, we acknowledge that the domain ontology is not always available since knowledge (e.g., noncategorical slots) may not be a closed set, such as type+value in DST.However, this limitation can be lifted in two ways.Firstly, as shown in our experiments, retrieving slot type alone can also improve the model performance, which indicates that we may choose a knowledge base mixing type and type+value when the assumption that all values are predefined does not hold.Moreover, in most DST applications, the schema is specified before data collection and model training, where all target types and values need to match a database for information lookup.If the schema is unavailable, we may consider schema induction (Hudeček et al., 2021;Yu et al., 2022) where we can build the schema before DST.We plan to investigate these directions in our future work.

Table 1 :
Dialog state tracking results on MultiWOZ.