In-Context Learning for Few-Shot Dialogue State Tracking

Collecting and annotating task-oriented dialogues is time-consuming and costly; thus, zero and few shot learning could greatly benefit dialogue state tracking (DST). In this work, we propose an in-context learning (ICL) framework for zero-shot and few-shot learning DST, where a large pre-trained language model (LM) takes a test instance and a few exemplars as input, and directly decodes the dialogue state without any parameter updates. To better leverage a tabular domain description in the LM prompt, we reformulate DST into a text-to-SQL problem. We also propose a novel approach to retrieve annotated dialogues as exemplars. Empirical results on MultiWOZ show that our method IC-DST substantially outperforms previous fine-tuned state-of-the-art models in few-shot settings. In addition, we test IC-DST in zero-shot settings, in which the model only takes a fixed task instruction as input, finding that it outperforms previous zero-shot methods by a large margin.


Introduction
Dialogue state tracking (DST) is an important module in many task-oriented dialogue systems. The goal of this module is to extract users' intentions at each turn in the dialogue as represented in the slot values of a predefined schema. Collecting and annotating turn-level dialogue states is notoriously hard and expensive (Budzianowski et al., 2018). Also, in commercial applications, it is common to extend the schema and incorporate new domains. Thus, it is important to develop DST learning strategies that are flexible and scalable, in addition to requiring less data.
Previous studies have explored zero/few-shot DST, but with some limitations. Most few-shot methods are based on finetuning pretrained language models (Wu et al., 2020;Li et al., 2021;Su 1 Our code : https://github.com/Yushi-Hu/IC-DST  Figure 1: Illustration of DST task and IC-DST approach. The task is to track the slot values associated with a user request up to the current turn (dialogue state). In fewshot settings, given a test turn (1), IC-DST first retrieves a few most similar turns from the labeled dialogues as examples (2). The task schema (not shown in the figure), examples, and the test dialogue turn are concatenated in the prompt to a LM (e.g. GPT3) (3) to produce the current turn dialogue state changes as a SQL query (4). Lin et al., 2021b;Xie et al., 2022). Such systems are less flexible, since they need to be retrained when new slots or domains are added, and finetuning large LMs is computationally expensive. Most zero-shot methods have involved domain-transfer approaches (Hosseini-Asl et al., 2020;Lin et al., 2021b,a), which have not yielded good performance.
To address the above challenges, we propose the IC-DST model to solve the DST problem with the in-context learning (ICL) paradigm (Brown et al., 2020), in which a large language model makes predictions based on the task instruction and/or examples in the prompt. In few-shot settings, the prompt contains exemplars that are retrieved from a small set of labeled training data. A motivation behind this framework is that it requires no finetuning (i.e., no parameter updates), which makes systems flexible in that they can handle queries in a new domain via the exemplar retrieval process without re-training. This enables developers to quickly prototype systems in new domains and rapidly leverage new collected data. ICL has been used successfully in semantic parsing (Rajkumar et al., 2022;Pasupat et al., 2021;Rubin et al., 2022), especially in few-shot scenarios. However, these studies focus on sentence-level tasks. ICL has been explored for DST (Madotto et al., 2021;Xie et al., 2022), but the performance fell short of pretraining and domain-transfer approaches to few/zero-shot learning. DST involves long, two-party dialogue histories with grounding in a structured ontology. We believe these challenges cause the poor ICL performance on DST tasks in previous work.
To address these challenges, we explore incontext learning with three novel contributions. First, we reformulate DST as a text-to-SQL task, including a tabular description of the ontology in the prompt. This is a better match to the knowledgegrounded scenario, and it takes advantage of large language models pretrained with code: Codex (Chen et al., 2021), GPT-Neo (Black et al., 2021), and CodeGen (Nijkamp et al., 2022). Second, we use the dialogue state in representing context, rather than the full conversation history, which is more efficient and better suited to domain changes. Lastly, in the few-shot scenario, we propose a new approach to learning a similarity score for selecting in-context examples that is trained to match similarity based on dialogue state changes. The IC-DST approach, which incorporates these advances, achieves a new state of the art on MultiWOZ fewshot settings, i.e. when using 1-10% training data. We also substantially improve the zero-shot state of the art by 10-30% absolute accuracy on each domain. A further contribution is an extensive analysis demonstrating impact from each innovation.
In summary, our work makes the following contributions: • To our knowledge, we are the first to successfully apply in-context learning for DST, building on a text-to-SQL approach.
• To extend in-context learning to dialogues, we introduce an efficient representation for the dialogue history and a new objective for dialogue retriever design.
• Our system achieves a new state of the art on MultiWOZ in zero/few-shot settings.
• We provide insights into how in-context learning works for dialogue, including the importance of good in-context examples and the LM's ability to generalize beyond examples.

DST Task Framing
Notation A task-oriented dialogue consists of a sequence of utterances alternating between the user and the system, A 1 , U 1 , ..., A T , U T , where A and U represent the system and user utterances, respectively. The task of DST is to predict the dialogue state y t at each turn t, 2 given the dialogue where y t is a set of slot-value pairs: The set of possible slots s i is given in a pre-defined schema. The schema can contain multiple domains, where a "domain" corresponds to a backend capability such as hotel or restaurant booking. Each domain is associated with a set of slots; for example, the 'hotel' domain has slots 'hotel-name', 'hotel-price_range,' 'hotel-area', etc. Each observed slot is associated with a value, which may have predefined categories (e.g., 'hotel-price_range' may be 'cheap, ' 'moderate,' or 'expensive') or open (e.g. 'hotel-name'). We focus on a multi-domain scenario in this work, in which dialogue states may contain slots from multiple domains. One popular way to generate the dialogue states for the current turn is finetuning auto-regressive language models (Hosseini-Asl et al., 2020;Peng et al., 2021). For each turn, the model takes the dialogue context C t as input, and generates a sequence of slot-value pairs (s i , v i ) sequentially. Equivalently, one can generate a sequence of slot-value pair dialogue state changes.
Dialogue states vs. state changes Dialogues can be lengthy and complex, resulting in dialogue states that can include several slots and values, which means that coverage of the possible states is sparse for few-shot learning. However, the dialogue state change from one turn to the next typically involves a small number of slots. For that reason, we use state changes at each turn as a label for prediction. The concept of state changes is illustrated in Figure 1. Possible state changes include slot addition, Figure 2: Illustration of the dialogue context representation: the full dialogue context C t−1 before the current turn is replaced by the associated dialogue state y t−1 . slot deletion, and slot value change. For example, in the current test turn of Figure 1, the user asks for Catalan food in turn t − 1, and changes it to French food in turn t. The state change updates the dialogue state by replacing 'Catalan' with 'French'. Specifically, given the previous turn dialogue state y t−1 and the predicted current turn state changes c t , we update the dialogue state by first copying y t−1 to y t , and then executing add, delete and change operations according to each slot-value pair (s i , v i ) in c t . Our analysis in Section 5 shows that using state changes leads to substantial improvements.
DST as Text-to-SQL Here we propose a new representation for dialogue states: SQL. This is inspired by the fact that dialogue states are used to determine how to query backend databases for the information users need. Our representation follows three rules: (1) each domain is defined as a table and each slot is defined as a column; (2) all the slots and values are in the WHERE clause; and (3) for turns with multiple domains, we rename each domain to d 1 , ..., d m . Using the SQL state representation with a generative LM, DST becomes a Text-to-SQL problem. This approach is facilitated by language models pretrained with code (Codex and GPT-Neo) as SQL is closer to the code used for pre-training.
Dialogue context representation Previous work generally represents dialogue context by concatenating all the system and user utterances A 1 , U 1 , · · · , A t , U t (Lee et al., 2021;Lin et al., 2021b;Peng et al., 2021). However, real world dialogues can be lengthy, and there is a length limit for current large language models (2048 tokens for GPT-3, 4096 tokens for Codex). It is not practical CREATE  to represent full dialogue context for multiple exemplars in the prompt. A simple solution is to just include the N recent turns in the dialogue history (Lei et al., 2018;Budzianowski and Vulić, 2019;Wu et al., 2021). We adopt a new approach that takes advantage of the fact that the dialogue state is a summary of the dialogue history, as shown in Figure 2. Specifically, we represent dialogue context by [y t−1 , A t , U t ], in which y t−1 is the accumulated dialogue state after user turn t − 1.

In-Context Learning
In-context learning is an alternative to finetuning that keeps pretrained language model parameters fixed (Brown et al., 2020). The language model takes a prompt P that contains task descriptions, in-context examples, and the test instance as input, and predicts the label by capturing the patterns in the context. ICL has two advantages over finetuning. First, it avoids the need for repeated finetuning when the schema is updated or new examples are added. This is particularly important for large models like GPT-3, since finetuning at this scale is extremely expensive. Second, by simply adding/removing training examples, in-context learning enables us to quickly manipulate the model's predictions and correct mistakes without re-training.
An overview of our IC-DST system is shown in Figure 1 for the few-shot setting. The details of our prompt are shown in Figure 3. The task description is the schema associated with the task ontology, and a retriever is used to select labeled example turns from the training data. In the zero-shot setting, there is no retriever.
Schema prompting We use an SQL table for each domain to represent the dialogue schema in the prompt. Each table includes a row of slot names followed by three rows of example values associated with each slot, as illustrated in Figure 4. Slots like "restaurant-name" or "restaurant-book time" typically have many possible values. Thus, for these slots, we only list a few example values. In our experiments, we create SQL tables for all domains and concatenate them to be part of our input.

In-context examples
In the few-shot scenario, a retriever takes the dialogue context as input (either and retrieves similar example contexts from the labeled training set. Advantages of using the dialogue state y t−1 rather than the full history C t−1 are that it is shorter (allowing for more examples) and it leads to a more effective retrieval similarity score. In the zero-shot scenario, following previous work on zero-shot learning , the incontext example is a formatting example turn. We call this setting "zero shot" because the prompt is fixed by the system developer and does not use any labeled data. Prompt examples for the few-shot and zero-shot settings are given in Appendices A.1 and A.2, respectively.

Dialogue Retriever
In few-shot settings, successful prediction of incontext learning relies on the quality of the context examples. Usually this is accomplished by semantic retrieval using the testing input as the query. Previous studies have explored methods for building sentence-level retrievers (Poesia et al., 2022;Rubin et al., 2022). Our work goes beyond sentences to retrieving dialogue histories (contexts).
We want to retrieve example dialogue contexts that are relevant for the predicted state change of a test sample. Formally, suppose X = {(e i , c i )} is a dataset of dialogue context e i and corresponding state change c i pairs. Each labeled turn in a dialogue is a candidate. Given a test dialogue context x, our goal is to retrieve k examples {(e 1 , c 1 ), (e 2 , c 2 ), · · · , (e k , c k )} such that c 1 , c 2 , · · · , c k should be similar to the state changes associated with x.
Unsupervised retriever One approach is to use a pretrained embedding model with a cosine similarity score as the retriever. This approach does not need any DST data. Let M be an embedding model. For each test example x, we retrieve the k training examples such that e i , e 2 , · · · , e k are the nearest neighbors of x * , given by a similarity score, score(x, e) = cos(M (x), M (e)).
We experiment with RoBERTa (Liu et al., 2019) and SBERT (Reimers and Gurevych, 2019) as embedding models. We also try BM25 (Robertson and Zaragoza, 2009), a retriever based on surface text similarity. We find SBERT leads to the best result; therefore, all IC-DST results reported here use SBERT as the retriever.
Retriever finetuning We also finetune SBERT on the few-shot examples to get a retriever that is better matched to the objective of predicting state changes. We first define the similarity between state changes. Suppose there are two sets of state changes, Let F (set 1 , set 2 ) be the average of two F 1 scores calculated by using set 1 vs. set 2 as the target. Here we use the standard definition F 1 = 2P R P +R , in which P is precision, and R is recall. We define the slot similarity as , and the slot-value pair similarity as . Then the similarity between c a and c b is The positive and negative examples for training sample x i = (e i , c i ) are identified by computing s(c i , c j ) for each sample x j = (e j , c j ), sorting, and taking the k highest and lowest scoring samples, respectively. We finetune the embedding model M with a contrastive loss so that the similarity between a positive example pair is high and the similarity between a negative example pair is low.

Baselines
TRADE (Wu et al., 2019) An encoder-decoder framework is applied to the DST problem, enabling generalization to unseen values and domains. This was the first work to explore cross-domain transfer in DST. Different from IC-DST, TRADE has to make a prediction for each domain and slot pair in separate passes.

SGP-DST (Lee et al., 2021)
In SGP-DST, schema information is used as a prompt to query a sequence-to-sequence language model (e.g., T5). It achieves SOTA on MultiWOZ 2.2. Similar to TRADE, the value for each domain and slot pair is predicted in a separate pass.
TransferQA (Lin et al., 2021a) TransferQA reformulated DST as QA problem. It is the stateof-the-art model for zero-shot DST. The model is pretrained with a large amount of QA data. At inference time, the model predicts slot values by taking synthesized extractive questions as input.
DS2  In DS2, DST is reformulated as a dialogue summarization problem. Sequence-to-sequence language models are trained with synthetic summary templates. The dialogue states can be recovered by reversing the template generation rules. This is by far the strongest fewshot model in the literature, outperforming recent few-shot models like T5-DST (Lin et al., 2021b). However, different from us, this model still requires finetuning on DST labels.

Experimental settings
Few-shot setting We follow the multi-domain scenario from Wu et al. (2020), where 1%, 5%, and 10% of training data are sampled as the selection pool. The retriever is fine-tuned on the selection pool and does not see any other DST data.
Zero-shot setting There are no labeled examples to retrieve, but a single formatting example turn is included, following previous zero-shot learning work .

Experimental Details
Language models GPT3 (Brown et al., 2020) is a language model with 175B parameters pretrained on a large web corpus. It demonstrates strong zeroshot results on language modeling benchmarks. Its successor, Codex (Chen et al., 2021), is pretrained using open-source code from Github. 4 This enables interesting applications such as code completion. In our initial studies, as in (Shin and Van Durme, 2022), Codex substantially outperforms GPT3; therefore, we use Codex for the following experiments. In this paper, we use Codex-Davinci. 5 In addition, we report results using GPT-Neo Evaluation The standard joint goal accuracy (JGA) is used as the evaluation metric. It treats a prediction as correct only if for every domain all slots exactly match the ground-truth values. To be consistent with prior work (Wu et al., 2019), we report all-domain JGA on few-shot settings and per-domain JGA on zero-shot settings. We also report the F 1 on slot-value pairs for analysis.

IC-DST details
The retriever is initialized with SBERT all-mpnet-v2 (220M). We use AdamW optimizer (Loshchilov and Hutter, 2018) and set 2 × 10 −5 as learning rate, 1000 as warmup steps. During retriever finetuning, for each training sample, we first compute similarity based on target labels. length for retriever input is 512. With the dialogue state context representation, no turn contexts in MultiWOZ exceed this length limit. For few-shot experiments, we use 10 context exemplars in Codex experiments and 5 for GPT-Neo due to different length limits. We set temperature to 0 to enable greedy argmax sampling during generation.

Results
Few-shot DST on MultiWOZ Table 1 shows the result on few-shot settings and full-shot settings of our IC-DST compared with several recent baselines on MultiWOZ 2.1 and 2.4. As discussed in Section 3.1, MultiWOZ 2.4 is a clean version of MultiWOZ 2.1 and therefore the performance is better. Our system achieves state-of-the-art performance for 1%, 5%, and 10% few-shot learning settings using Codex, outperforming previous works that require model finetuning. When given more data as retrieval candidates, our systems improve. GPT-Neo and Codegen have a similar trend but are generally worse than Codex and other baselines. This suggests that the size of language model matters when deploying ICL. The prior ICL DST systems did not report on the standard few-shot configurations, so were not included as baselines. However, our system represents a significant advance over these systems as well, substantially outperforming both Unified-SKG (Xie et al., 2022) Table 2 shows the zero-shot DST results on MultiWOZ 2.1 and 2.4. Our IC-DST outperforms previous results by a large margin. In addition, IC-DST has the ad-vantage over these approaches that no training is required. The multi-domain JGA of our IC-DST is 35.3% on MultiWOZ 2.4, which can be compared to 48.35% for the system using few-shot learning with 1% training (roughly 80 labeled dialogues). SimpleTOD++ (Hosseini-Asl et al., 2020) and T5DST (Lin et al., 2021b) are trained on four domains on MultiWOZ and tested on the unseen domain. In addition, T5DST uses human-written slot descriptions to boost zero-shot performance. TransferQA (Lin et al., 2021a) does not need any training on DST data. However, each slot has to be reformatted into a question, and the model is trained on a large amount of QA data. Our results show the flexibility of IC-DST. For each new domain or slot added, by updating the SQL tables and adding a demonstration example, the model attains good performance on the new ontology without any training.

Analysis
To better understand the effectiveness of our proposed methods, we provide detailed analysis in this section. All ablation experiments are conducted on 100 random MultiWOZ 2.4 development set dialogues in the 5% few-shot setting. Table 3 compares approaches to representing dialogue context in the examples in the prompt, with different retriever fine-tuning objectives. For each setting, we train a retriever with the given dialogue context representation and retrieval objective for a fair comparison. We experiment with representing the dialogue history by: (1) concatenating the whole dialogue history, (2) only the latest turn (one system utterance and one user utterance), and (3) Figure 5: JGA of each turn. The blue line is the JGA of state changes predicted by the system on Table 3 row 5 (our IC-DST setting). The red line is the JGA of dialogue states produced by accumulating predicted state changes. The yellow line is the JGA of dialogue states predicted by the system on Table 3 row 2.

Representation of dialogue context
both of the retrieval objectives, representing the dialogue context by the previous dialogue state and the current turn gives the best performance. There are multiple possible explanations. First, the previous state is more relevant than the full context for predicting the next-turn dialogue state, especially after a topic shift. Second, full dialogue contexts are too long. To fit in the length limit of Codex, we sometimes need to truncate examples when using full context, which lowers the system performance, as evidenced by the fact that a single turn outperforms the full context but not the state-based representation.
Dialogue states vs. state changes We also explore the benefit of using state change vs. full dialogue state labels (c t vs. y t ). Note that example labels also act as the basis for the similarity objective in training the retriever. Table 3 shows that using state changes gives substantial improvement. A likely reason is that state changes contain fewer slots compared with dialogue states, making for an easier prediction task for the LM. Also, there are many more example turns with the same state changes, compared with examples with the same dialogue states, making it easier for the retriever to fetch good examples in few-shot scenario. We further compare these two kinds of example labels by investigating the JGA on each turn, illustrated in Figure 5. Because states have increasing numbers of slots with more turns, JGA (red and yellow lines) of the full state decreases for later turns. However, all turns have a relatively small number of state changes, so the state change JGA (blue line) remains high throughout the dialogue. As a result, the JGA of the full state benefits from using predicted state change updates (red line, Table 3 row 5) as compared to predicting the full state (yellow line,  Performance dependence on the retriever Table 4 compares Codex's performance when using different retrievers. F 1 is computed over the pre-dicted and gold full dialogue states. The "copy" baseline makes a prediction by copying the label of the nearest example retrieved by the corresponding retriever in the selection pool. This baseline gives a direct measure of the quality of the retrieved examples. First, the result shows that Codex's performance is highly dependent on the quality of the retrievers and their retrieved examples. We can also see that F 1 of "copy" baseline correlates with the end task performance (JGA), which supports the use of state-change F 1 as a learning objective for our similarity measure. Our finetuned SBERT retriever obtains substantial improvements over the unsupervised one. The oracle retriever fetches the examples that have the highest state change F 1 , assuming prior knowledge of test instance gold dialogue states. It shows the upper bound of the retriever performance in this few-shot setting. Note that the state-of-the-art full-shot performance on MultiWOZ 2.4 is 73.62% (Ye et al., 2021b). The oracle experiment shows that the few-shot ICL may match or outperform the state-of-the-art full-shot model with further retriever improvements. Table 4 also shows that the LM is not simply copying the example labels. For both evaluation metrics and all the retrievers, the "copy" baseline performance is much lower than IC-DST, indicating that the inference language model is generalizing over the examples, rather than copying the closest example labels.   Table 5 shows the performance of our ICL framework given different input-output formats. We follow Simple-TOD (Hosseini-Asl et al., 2020) as the traditional format to formulate the input-output. More specifically, the exemplar labels and the generation target are slot-value pairs. Also, we rewrite the dialogue schema in the same format in the prompt to replace the SQL tables. The prompt in the setting is shown in Appendix A.3. By reformulating DST as a text-to-SQL task, and leveraging large language models pretrained with code, ICL can make better use of the structured knowledge associated with the task schema. As shown in Table 5, GPT-Neo, CodeGEN, and Codex all perform better with text-to-SQL format. Note that the performance of GPT-Neo with the traditional format is much worse than with the text-to-SQL format, possibly due to its much smaller model size compared to Codex. It is easier for GPT-Neo to work with SQL, rather than learning a slot value pair format.

Effect of DST as Text-to-SQL
Error analysis In examining a subset of IC-DST errors, we identified three common types, as shown in Table 6. The first type of error is caused by a noisy training example, such as a missing slot. ICL is sensitive to noisy training data because the inference LM only sees a few examples in the prompt during prediction. The second type of error is retrieval limitations. In this case, the retrieved samples are not good exemplars, because they lack some slots that should be predicted in the test instance. This could be due to sparse annotated data (which impacts all few-shot learning methods) or a retriever error.  . The state-of-the-art few-shot DST model is , in which the authors reformulate DST as a summarization task. We propose to represent the dialogue states as SQL and reformulate DST into a text-to-SQL task. Most zero-shot methods have involved domaintransfer approaches, including multi-task training with related task-oriented domains (Hosseini-Asl et al., 2020;Lin et al., 2021b) and questionanswering datasets (Lin et al., 2021a). Unfortunately, performance of these systems is quite low.   (2022) and Rajkumar et al. (2022) find that GPT3 or Codex can be generalized to produce different target programs with a few incontext exemplars. Xie et al. (2022) and Madotto et al. (2021) were the first to apply ICL for DST, but their systems underperform other methods. Future work may consider improving dialogue retrieving methods and task prompting formulation.
Retrieval Most current work on ICL focuses on sentences or documents, while our task involves retrieving dialogues. There are two general kinds of semantic retrievers. The first is similarity-based retrieval. Poesia et al. (2022) and Das et al. (2021) define a similarity metric between semantic parsing results and use this similarity as the training objective for the retriever. Another approach is LMscore based retrieval. Rubin et al. (2022) andShin et al. (2021) measure the quality of an example by the probability of a large language model decoding the correct answer. The k highest and lowest quality samples are used as positive and negative samples for the retriever training. The most relevant retrieval studies on dialogue focus on tasks like knowledge identification (Wu et al., 2021) and response selection (Yuan et al., 2019;Han et al., 2021). Their tasks and settings are different from ours.

Conclusion
We successfully apply in-context learning for dialogue state tracking by introducing a new approach to representing dialogue context, a novel objective for retriever training, and by reformulating DST into a text-to-SQL task. On MultiWOZ, our system achieves a new state of the art in both few-shot and zero-shot settings. Our analyses show that each innovation benefits performance. We also study in detail the contribution of each design decision. Future work may apply this in-context learning framework to a wider range of dialogue tasks.

Acknowledgments
This research was supported in part by funding from Allstate. We thank OpenAI for free access to Codex and Amazon for AWS credits.

Limitations
The performance of our framework is highly dependent on the inference language model, which may limit the framework's usage. For example, our framework may not work as well on speech recognition transcripts or other languages because of the lack of such data during language model pretraining. Future work may explore the robustness and generalization ability of in-context learning, for which our IC-DST can serve as a test bed. Also, there is a tradeoff between avoiding the cost of fine tuning with a large language model vs. the cost of inference. Fortunately, thanks to the recent efforts in open-source large models like OPT (Zhang et al., 2022) and BLOOM (BigScience), and model compression techniques like LLM.int8() (Dettmers et al., 2022), the cost of running large language models has been drastically reduced, and we are optimistic that this trend will continue in the future. Further, it is possible to leverage zero/few-shot methods as a teacher model to generate "pseudolabels" for training a system that has a lower inference cost, as in Ye et al. (2022). Future work may investigate more on low-cost approaches to applying in-context learning and large language models.

A.1 Prompt in Few-Shot Settings
Below is the full version of the prompt for few shot settings. Here we are including 5 examples that are retrieved by our finetuned retriever from a 5% subset of MultiWOZ training set. Notice that the test instance is preceded by "Example #6" but the label needs to be completed by the LM. The completion of Codex is at the end.

A.2 Prompt in Zero-Shot Settings
Below is an example of the prompt in zero-shot settings. Some slots are renamed to be easier for model to understand. There is one crafted turn for task demonstration, which is marked as "Example #1". The "Example #2" is the test instance that needs to be completed. CREATE IN (dontcare, 0, 1, 2, 3, 4, 5)), internet text CHECK (internet IN (dontcare, yes, no)) ) / * 4 example rows: SELECT * FROM hotel LIMIT 4; name pricerange type parking book_number_of_days book_day book_people area stars internet a and b guest house moderate guest house dontcare 3 friday 5 east 4 yes ashley hotel expensive hotel yes 2 thursday 5 north 5 yes el shaddia guest house cheap guest house yes 5 friday 2 centre dontcare no express by holiday inn cambridge dontcare guest house yes 3 monday 2 east dontcare no * / CREATE For retriever finetuning, we train all our retrievers on one single NVIDIA A10G graphics card. For 5% few-shot scenario, each epoch of training takes 10 minutes on average with our given hyperparameters, and we train for 10 epochs. For 100% full-shot scenario, we train for 2 epochs, which takes 6 hours. The training time is linear to the data size.
Only retriever finetuning requires hyperparameter tuning. Hyperparameters are given in our code, and the best configuration is provided in the experiment section. They are chosen by manual tuning. The development evaluation metrics is the average F 1 score of retrieved examples, which has been described in the paper. For learning rates, we experimented over 1 × 10 −5 and 2 × 10 −5 . For contrastive learning, we scale the number of positive/negative examples for each data point linearly with the selection pool size. When using 1% of labeled data, the number of positive/negative examples is 2. For 5% of data, the number is 10. We also experimented with doubling the number of positive/negative examples. All few-shot test numbers are the average of three evaluation runs, and all retrievers are trained just once with each hyperparameter configuration. MultiWOZ (Budzianowski et al., 2018) is an English multi-domain task-oriented dialogue dataset. It contains 7 different domains. Following the previous work (Wu et al., 2019), we use 5 domains: hotel, taxi, attraction, restaurant, and train. There are 8438 dialogues in the training set, and 1000 dialogues in the dev and test set. On average, there are 13.46 turns per dialogue and 13.13 tokens per turn. For preprocessing, we use the scripts of Ye et al. (2021a). This script mainly fixes typos and standardizes the formatting. All data are downloadable from Ye et al. (2021a).