DiSTRICT: Dialogue State Tracking with Retriever Driven In-Context Tuning

Dialogue State Tracking (DST), a key component of task-oriented conversation systems, represents user intentions by determining the values of pre-defined slots in an ongoing dialogue. Existing approaches use hand-crafted templates and additional slot information to fine-tune and prompt large pre-trained language models and elicit slot values from the dialogue context. Significant manual effort and domain knowledge is required to design effective prompts, limiting the generalizability of these approaches to new domains and tasks. In this work, we propose DiSTRICT, a generalizable in-context tuning approach for DST that retrieves highly relevant training examples for a given dialogue to fine-tune the model without any hand-crafted templates. Experiments with the MultiWOZ benchmark datasets show that DiSTRICT outperforms existing approaches in various zero-shot and few-shot settings using a much smaller model, thereby providing an important advantage for real-world deployments that often have limited resource availability.


Introduction
Task-oriented dialogue systems are increasingly used to enable users to perform tasks through multiturn conversations in different domains such as travel reservations, banking transactions, or online shopping.Dialogue state tracking (DST) is a critical component of these systems that tracks user requirements by determining key information at each turn in the dialogue (Jacqmin et al., 2022).Given a predefined schema of task parameters or slots, DST models identify and represent the dialogue state as pairs of slots and their corresponding values as shown in Figure 1.
In real-world deployments, new task domains and slots are frequently added to improve user functionality.Hence, periodic updates to the models may be required, even when the new domains offer little to no dialogue data for training.To address these challenges, DST solutions need to be generalizable to new zero-shot and few-shot settings with minimal overhead while also maintaining a small model footprint to enable compute-efficient and cost-effective deployments.
Recent advances in DST have leveraged pretrained language models (LMs) to elicit slot values from dialogues primarily using two methodsfine-tuning and in-context learning.However, both methods suffer from several shortcomings -Fine-tuning LMs -Most approaches condition models for DST by fine-tuning their parameters using prompts derived from historical dialogue data.However, they rely on hand-crafted prompt templates that can include slot-specific questions, value based functions, and even text-to-code snippets (Lin et al., 2021b;Cao and Zhang, 2021).In addition to the manual effort required, these templates are customized to specific domains and slots, and hence have low generalizability to new domains.Some approaches also extend fine-tuning to first train models on additional natural language tasks using external datasets which is expensive and requires access to significantly more data.They also often rely on additional information from the schema such as slot descriptions and task instructions (Mi et al., 2022).However, real-world datasets may not always have this necessary information, which again limits their generalizability.In-context learning -LMs have shown remarkable performance through in-context learning of new tasks (Brown et al., 2020), where a raw LM (i.e.pre-trained LM without fine-tuning on taskspecific data) is prompted during inference using input-output task examples to condition the generated output.For DST, approaches have leveraged this to craft prompts containing examples of dialogue history and slot values (Hu et al., 2022).However, similar to fine-tuning approaches, these prompt examples are hand-crafted, customized, and require significant effort.Additionally, as a result of using raw LMs, these approaches have to rely on extremely large models limiting their practical use.Importantly, it has been shown that prompting raw LMs without fine-tuning is often oversensitive to example choices and instruction wording (Chen et al., 2022), and can demonstrate undesirable biases that significantly reduce performance (Zhao et al., 2021;Liu et al., 2021).
In this work, we address these challenges and present DiSTRICT, a generalizable approach for dialogue state tracking that fine-tunes a LM using relevant examples (i.e.) in-context tuning.For a given input dialogue and slot to be tracked, DiSTRICT retrieves semantically similar dialogues and slots from available historical data in zero-shot and fewshot settings, and concatenates them into a prompt with no hand-crafted template requirements.We first fine-tune the language model using in-context examples and input dialogues from the training set, and subsequently perform similar inference on test inputs as shown in Figure 2. Specifically, we make the following contributions -• To the best of our knowledge, DiSTRICT is the first DST approach to fine-tune a LM with in-context examples (i.e.) in-context tuning.
• We address shortcomings of prior approaches, wherein DiSTRICT leverages relevant existing dialogue and slot examples and does not require hand-crafted prompts or external datasets, thereby improving generalizability and reducing manual overhead.
• Our evaluation shows that DiSTRICT outperforms existing approaches on most zero-shot and few-shot settings, while using a much smaller model, thus demonstrating its applicability to real-world deployments.

Related Work
Dialogue state tracking (DST) is a critical yet challenging task for task-oriented dialogue systems (Williams et al., 2014), and several multi-domain benchmark conversation datasets have been proposed (Budzianowski et al., 2018;Eric et al., 2019;Rastogi et al., 2020) to evaluate research efforts.
A majority of state-of-the-art approaches finetune language models using hand-crafted templates containing descriptions or questions related to the dialogue slots.Shah et al. (2019) used slot descriptions and examples of slot values to create templates while Lin et al. (2021b) and Lee et al. (2021) provided different types of manually annotated slot descriptions to the model.Mi et al. (2022) extended this by also including task instructions and other constraints.In contrast, our approach does not require hand-crafted templates for fine-tuning and is hence more easily generalizable.
Another set of approaches aim to improve zeroshot performance by exploiting external knowledge and datasets from other natural language tasks before fine-tuning a model for DST.For instance, Gao et al. (2020); Li et al. (2021); Lin et al. (2021a) pre-train models on reading comprehension data, Shin et al. (2022) reformulate DST as a dialogue summarization task with external annotated data, and Hudeček et al. (2021) use semantic analysis and named entity recognition to identify slots.In contrast, our approach does not require any extra datasets or training efforts .
In-context learning (ICL) for DST has been explored as part of a larger set of few-shot generative tasks (Madotto et al., 2021;Xie et al., 2022), but the lack of a task-specific prompt design resulted in low performance.Hu et al. (2022) and Gupta et al. (2022) achieved improved performance using extremely large models, where the former reformulated DST as a text-to-SQL task by using semantic matching to identify relevant examples that were subsequently crafted into SQL queries, and the latter manually created example dialogues containing combinations of slots in the schema.
Recent efforts have shown that the shortcomings of ICL (Liu et al., 2022;Min et al., 2022) can be overcome through in-context tuning of LMs.Gururangan et al. (2020) and Liu et al. (2022)   In (1) zero-shot settings, the retriever must search for semantically similar dialogue contexts and slots from other domains, while in (2) few/full-shot settings, the retriever additionally has access to some example dialogues from the same domain that could also contain the query slot.
strate the improved performance of models finetuned with examples compared to ICL over a variety of language tasks.In this work, we leverage this concept specifically for DST.

Approach
We first present the background and some definitions for dialogue state tracking before describing our approach.

DST Background
A task-oriented dialogue consists of a multi-turn conversation between a user U and the system A. Given a dialogue context C t as the sequence of utterances until turn t, (i.e.) the goal of DST is to predict the dialogue state y t , defined as a set of (slot, value) pairs: where S denotes the set of possible slots predefined in an ontology or schema.In a multi-domain setting, the schema can comprise of different domains or topics, each corresponding to a service such as restaurant booking or banking.The slots associated with each domain can be either categorical with a set of candidate values (e.g.restaurant-open = 'True' / 'False'), or non-categorical, where the value is a span in the dialogue context (e.g.hotelname = 'Courtyard Marriott').

In-Context Retriever
The key concept behind our approach is the identification of the most semantically relevant in-context examples from the available training set of dialogues.Intuitively, historical labeled dialogues contain information about slots and their values under different conversation contexts.Hence, for an input dialogue and given query slot, conditioning the model during fine-tuning using example dialogues that are semantically similar and additionally contain the same or similar slots, enables the model to better learn the association between slots, their values, and the context.
As shown in Figure 2, the retriever in DiSTRICT performs semantic matching of the input dialogue and slot with single-turn training set conversations as examples (i.e. one pair of user-system utterances).This design choice is due to the fact that large prompts require additional memory and compute, significantly increasing the training time of the model.Furthermore, real-world dialogues can be lengthy, and the context needed to find the value of a particular slot can often be limited to a single sentence.Hence, constraining the in-context examples to single-turn conversations reduces the prompt size, enables the addition of more examples, and removes irrelevant dialogue context.
Formally, we define a dataset D = {(e j , s i j , v i j )} consisting of single-turn dialogue examples e j , containing an observed slot s i j and its corresponding value v i j .For a given input dialogue context C t and query slot s q , we retrieve the k most relevant examples E k ∈ D based on the similarity between their text embeddings -

Applicability to Zero-shot and Few-shot
To illustrate the generalizability of the retriever to zero-shot and few-shot settings, we use the example shown in Figure 2. Given a test input dialogue from the restaurant domain and the query slot restaurant-people, the retriever identifies the k most relevant single-turn in-context examples derived from the training set.
In a zero-shot setting (Figure 2-( 1)), dialogue and slot examples from the restaurant domain would not be available.Hence, the retriever identifies semantically similar examples from other domains.For instance, conversations about hotel reservations have similar contexts, and the slot hotel-people is semantically similar to the query slot.The retrieved example thus conditions the model to look for the number of people mentioned in the dialogue.
In few-shot and full-shot settings (Figure 2-(2)) the set of available examples would include other dialogues from the same domain which could also contain the query slot.Hence, the most similar examples retrieved would demonstrate the values of the query slot when used in similar restaurant reservation contexts (e.g.restaurant-people: 8).We note that our approach requires no changes for the different settings, and can be easily extended to include additional information like slot descriptions to further enhance the semantic retrieval.

In-context Tuning
We fine-tune the language model by retrieving incontext examples for each dialogue in the training set.As shown in Figure 2, to create the input to the model, we annotate prefixes to each of the k in-context examples E k , dialogue context C t , and the query slot s q to enable the model to distinguish between them, and then concatenate them into a single input sequence.We then fine-tune an encoder-decoder based language model, where the input is passed to the encoder, and the decoder generates the corresponding value for the query slot.The model in-context tuning objective L is to minimize the negative log-likelihood loss - where n is the total number of slots in the ontology and ⊕ denotes concatenation.
Table 1: Comparison baselines and their pre-trained model size.TRADE (Wu et al., 2019) does not use a pre-trained model.

Datasets and Evaluation
MultiWOZ (Budzianowski et al., 2018) is a multidomain task-oriented dialogue benchmark dataset that consists of around 10k multi-turn dialogues over 7 domains.The dataset has been refined and erroneous annotations have been corrected over multiple versions.To enable comparisons with most existing approaches, we use the MultiWOZ 2.0 and MultiWOZ 2.1 (Eric et al., 2019) datasets in our evaluation, and follow the same data preprocessing and domain selection steps as prior work (Wu et al., 2019;Gao et al., 2020;Lin et al., 2021b).To make consistent comparisons to prior work in zero-shot and few-shot settings, we use the same Joint Goal Accuracy (JGA) metric to evaluate our approach.For a given turn, JGA compares predicted dialogue states to the corresponding ground truth states and considers the prediction as accurate if and only if all predicted values match the ground truth values (Wu et al., 2019;Wen et al., 2017;Kumar et al., 2020).

Implementation
DiSTRICT uses T5-small (60M parameters) (Raffel et al., 2020) which is one of the smallest pretrained models available1 .It has 6 encoder-decoder layers and the size of each layer is 512.We fine-tune using an AdamW optimizer (Loshchilov and Hutter, 2018) with an initial learning rate of 1e−4.For the retriever, we use Sentence-BERT (Reimers and Gurevych, 2019) and perform semantic search with cosine similarity using the FAISS library (Johnson et al., 2019).Unless specified otherwise, we set the number of in-context examples to be k = 3.We use a single NVIDIA V100 GPU for our experiments and provide further details in the appendix.

Comparison Baselines
We evaluate DiSTRICT against existing DST baselines as shown in Table 1.
The table shows that, except for T5DST which also uses T5-small, prior DST approaches use models that are significantly larger compared to DiSTRICT.
TRADE (Wu et al., 2019) uses slot and domain embeddings as well as a copy mechanism to track slot values across domains.
STARC (Gao et al., 2020) prompts two different instances of the RoBERTa-large (Liu et al., 2019) model with separate natural language questions for categorical and non-categorical slots.
T5-DST (Lin et al., 2021b) fine-tunes a T5-small model (Raffel et al., 2020) with multiple hand-crafted prompts that include questions and different descriptions of slots.
TransferQA (Lin et al., 2021a) represents DST as a QA task, where the model is pre-trained on six external QA datasets and individual questions are manually crafted for each slot in the ontology to use in the prompt.
DS2 (Shin et al., 2022) treats DST as a dialogue summarization task, and fine-tunes T5-large and BART models with synthetic summary templates.(Hu et al., 2022) reformulates DST as a text-to-SQL task and transforms relevant in-context examples to SQL queries and prompts a Codex model without any fine-tuning.

Experimental Settings
Zero-shot -Similar to prior work (Lin et al., 2021a;Wu et al., 2019), the retriever and model have access to training data from all domains except from one 'unseen' domain, on which the model is evaluated.We note that our retriever does not result in any information leakage since no examples and slots are included from the unseen domain.Cross-domain few-shot -We include three fewshot settings, where the retriever and model additionally have access to 1%, 5%, and 10% of training data from the unseen domain, similar to Shin et al. (2022); Wu et al. (2019).Multi-domain few-shot -We follow the multidomain scenario from Shin et al. (2022); Wu et al. (2020), where 1%, 5%, and 10% of the entire training data are sampled for model training and retrieval.Note: We do not include the zero-shot results from prior in-context learning work (Hu et al., 2022) (IC-DST) since their prompt examples are designed to include information from the 'unseen' domain, which results in information leakage to the model, and hence does not reflect the traditional zero-shot learning setting.Additionally, we include results from the other comparison approaches where available.

Zero-shot DST
Table 2 shows the dialogue state tracking performance of DiSTRICT in zero-shot settings along with the available baseline results for TRADE, T5DST, and TransferQA.We observe that DiS-TRICT outperforms the baseline approaches in most domains across both datasets.DiSTRICT achieves an 8% improvement in JGA on average over the next best approach on both datasets (i.e, T5DST in MultiWoz 2.0 and TransferQA in Multi-Woz 2.1), and obtains improvements up to 23% on the 'Train' domain.
Both TransferQA and T5DST use hand-crafted prompts, where the former annotates all slots in the form of questions, and the latter uses individual slot descriptions.In the zero-shot setting, this implies that the query-slot is not truly "unseen", since semantic information about the slot is being provided to the model in the hand-crafted prompt.Furthermore, the addition of new domains and slots would first require crafting new prompts, thereby limiting generalizability.
In contrast, DiSTRICT does not possess any additional information about the unseen query slot and instead relies on identifying other semantically similar slots and dialogues from the data available from other domains to enable model reasoning.The improved performance hence reflects the effectiveness of our retriever driven approach in zero-shot settings, and also demonstrates the generalizability of our solution.

Per-domain few-shot DST
Tables 3 and 4 show the per-domain few-shot performance on MultiWOZ 2.0 and MultiWOZ 2.1 respectively.DiSTRICT outperforms the baseline approaches across across all domains and across both datasets.For MultiWOZ 2.1, DiSTRICT achieves a JGA improvement of over 15% in the best-case and 9% on average, both when compared to the next best approach, DS2-T5.
Additionally, the availability of even a few labeled examples significantly improves the performance of our retriever, as evidenced by a 46% improvement in JGA on average across all domains over the zero-shot setting with just 1% of available few-shot data in MultiWOZ 2.1, compared to a 34% improvement by TransferQA.This improvement stems from the increased relevance of available in-context examples, since the retriever now has access to a few (i.e.1%-5%-10%) dialogues from the few-shot domain.

Cross-domain few-shot and full-shot DST
From Table 5, we see that DiSTRICT achieves the best performance in the full-shot (100%) setting, obtaining ∼ 10% improvement in JGA on average over the other approaches.However, we observe a significant drop in performance in the cross-domain few-shot setting, when the total available training data is reduced.DiSTRICT suffers from a 75% drop in JGA when only 1% of training data is available in MultiWOZ 2.1, compared to a 35% drop for DS2-T5 which suffered the smallest drop in performance.
This  prompts used by the baseline approaches appear to provide sufficient additional information to the model, reducing performance drop in limited data settings.

Impact of number of in-context examples
We study the effect of varying the retrieved number of in-context examples used in the prompt on DiSTRICT's performance in all our experimental settings.From Figure 3, we observe that using no examples (i.e.) model fine-tuning and inference using only input dialogues, results in very poor performance that is worse than all the baseline comparison approaches.This shows that the addition of relevant examples has a significant impact on conditioning the model for dialogue state tracking.We also observe an improvement in performance as the number of in-context examples increases, highlighting the potential of using a larger number of examples as part of future work.However, this involves a trade-off, since the improvement is not linear and has diminishing returns, and using a larger number of examples would require more memory, compute resources, and increased training time.

Retriever performance
We examined the effectiveness of the retriever by analyzing the domains and slots that were selected for the input dialogues in the test set.Figure 4 shows the heatmaps for zero-shot and per-domain few-shot settings, depicting the relative number of examples picked from each domain for test inputs across all domains.
In the zero-shot setting, since data from the unseen test domain is unavailable, the main diagonal is empty and we observed that examples were relatively evenly picked across the other available domains.In particular, as illustrated by the examples in Table 6, we found that the retriever identified examples containing semantically similar slots or having similar dialogue contexts, thereby demonstrating the effectiveness of our approach.
In the few-shot setting, we observed that the majority of examples were selected from the same domain as the input (i.e.darker diagonals), reflecting the higher semantic and contextual match between  Finally, we studied the effectiveness of using larger models for DST.We evaluate the performance of DiSTRICT and T5DST, as both approaches use the T5-small model (Table 1), with multiple sizes of T5 in the zero-shot setting with the 'Taxi' domain.As shown in Table 7, in both approaches the larger T5-base and T5-large models achieve modest improvements over the T5-small model.However, these improvements may be too limited to justify the potentially significant increase in compute resources required to support larger model sizes in real-world deployments.

Conclusion
In this work, we present DiSTRICT, a novel approach for dialogue state tracking that uses incontext tuning of language models (LMs).For an input dialogue instance and slots, DiSTRICT retrieves the most relevant examples from the training data through semantic matching, and uses these examples as part of the input to the LM to obtain the dialogue state.The fully automated prompt construction, without requiring hand-crafted templates or additional schema information, overcomes drawbacks of prior DST approaches and also reflects the high generalizability of DiSTRICT to new task domains and slots.Our experiments show that DiSTRICT outperforms existing baselines in different zero-shot and few-shot experiments despite using a smaller and lower-resource model.We also demonstrate the effectiveness of our semanticsearch based retriever for the DST task and highlight several tradeoffs between model performance and resource requirements that impact real-world use.As part of future work, we intend to improve robustness to dialogue quality and distributiondrifts.

Limitations
The performance of DiSTRICT hinges on the effective retrieval of relevant in-context examples from the training data.This results in our approach being sensitive to issues with data quantity and quality.As shown in our results, when the amount of training data is limited, the retriever often has to select from a pool of examples that have low diversity and semantic similarity to the input, thereby adversely impacting performance.
Additionally, data quality issues such as poorly named slots (i.e.not sufficiently descriptive) and incorrect/mislabeled slot values would also impact the semantic matching and performance of our approach.Also, our zero-shot learning relies on semantic relationships between the unseen samples and the known data.However, if the new task domains are highly disparate from the existing domains, this relationship may not hold, presenting a challenge for zero-shot learning.
Recently, research efforts have studied domain generalization in the context of model robustness under data distribution shifts (i.e.) out-ofdistribution (OOD) generalization (Gulrajani and Lopez-Paz, 2020;Venkateswaran et al., 2021a,b) which can also occur in real-world task oriented dialogue systems.We did not address this as part of our work, and intend to explore OOD model robustness as part of future work.

Ethical Considerations
In this work, we focus on improving task-oriented dialogue systems by using prior examples to condition DST models.We conduct our experiments on publicly-available datasets with English conversations, that cover domains typically seen in taskoriented dialogues.In real-world deployments, it is important to ensure that user information is not revealed, which can be achieved by using a publicly available benchmark for examples as done in this work, or by limiting users to examples from their own conversations.Additionally, reliance on large models results in high resource use, which is a challenge that we also address.We hope that our work inspires more efforts to create DST solutions with smaller models and low manual overhead and hence significantly reduce resource requirements.

A Implementation
For the zero-shot and the full-shot experiments, we train our model for 3 epochs and use early stopping based on the loss in the validation set.For the few-shot experiments, we use the models from the zero-shot experiments that were trained on 4 domains (out of 5), and then train them on the fifth domain with 1%, 5%, 10% of the target domain data for 10 epochs.We also use early stopping based on the validation loss.The results reported are from a single run.We note that our use of the MultiWOZ datasets are consistent with their MIT license and does not have any identifying information (Budzianowski et al., 2018) and we also intend to release our code under the same license.

B Retriever Performance
We analyzed the distribution of in-context examples at a slot level for different test inputs.Figures 5 and 6 show the heatmap depicting the slots within the in-context examples that were picked for each test query slot for zero-shot and few-shot settings.
For the zero-shot setting, we observed that whenever possible, the retriever prioritized examples containing slots that had a similar semantic meaning as the query slot (e.g.) restaurant-area and attraction-area, hotel-price range and restaurant-price range, train-arrive by and taxi-arrive by.
In cases where the query slot had no similar example slots (e.g.) hotel-internet, the retriever picked examples based on the dialogue context similarity.
For the few-shot setting, we observed that the retriever prioritized examples containing the same slot as the query, reflected by the dark diagonal in the heatmap.Additionally, the retriever also typically picked examples from the same domain as the test input, which is shown clearly by the clusters within the heatmap.This serves to show that identifying examples using semantic matching is a viable and effective approach.

Figure 1 :
Figure 1: An example of multi-domain dialogue state tracking (DST) in a conversation to book a hotel stay and reserve a table at a restaurant.

Figure 2 :
Figure2: Overview of DiSTRICT.Given an input dialogue and slot to be tracked, high-relevance examples are first identified by the retriever and used to create the prompt.In (1) zero-shot settings, the retriever must search for semantically similar dialogue contexts and slots from other domains, while in (2) few/full-shot settings, the retriever additionally has access to some example dialogues from the same domain that could also contain the query slot.

Figure 3 :Figure 4 :
Figure 3: Impact of the number of in-context training examples on joint goal accuracy (JGA) for Zero-shot, Few-shot, and Full-shot settings using the Train domain

Figure 5 :
Figure 5: Slot-based heatmap of in-context examples picked by retriever in zero-shot settings

Figure 6 :
Figure 6: Slot-based heatmap of in-context examples picked by retriever in few-shot settings demon-
performance drop can be attributed to the limited diversity of in-context examples arising from the unavailability of training data.For instance, the 1% setting in MultiWOZ 2.1 translates to the availability of only 84 training examples.Hence, the retriever is limited to repeatedly using the same examples, restricting the model's reasoning capabilities.In contrast, the hand-crafted
System] is there anything else that i can do for you ?[User] can you find me an expensive place to stay , it should be only 4 stars and include free wifi.Query slot + gold label hotel-price range : expensive Retrieved example[System] there are several restaurants .whattype of food would you like ?[User] i want somewhere cheap in the centre please

Table 6 :
In-context examples selected by the retriever for different inputs in the zero-shot setting.intra-domaindialogues.We also studied the distribution of examples at an individual slot level, shown in the appendix, and observed the same patterns.In particular, for the few-shot setting, the retriever prioritized examples containing the same slot, followed by the same domain, validating the use of semantic matching.

Table 7 :
Impact of model size on JGA for the 'Taxi' domain zero-shot setting in MultiWOZ 2.0.* T5DST results for larger models obtained from a local implementation based on Lin et al. (2021b).