Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue

Building universal dialogue systems that operate across multiple domains/APIs and generalize to new ones with minimal overhead is a critical challenge. Recent works have leveraged natural language descriptions of schema elements to enable such systems; however, descriptions only indirectly convey schema semantics. In this work, we propose Show, Don’t Tell, which prompts seq2seq models with a labeled example dialogue to show the semantics of schema elements rather than tell the model through descriptions. While requiring similar effort from service developers as generating descriptions, we show that using short examples as schema representations with large language models results in state-of-the-art performance on two popular dialogue state tracking benchmarks designed to measure zero-shot generalization - the Schema-Guided Dialogue dataset and the MultiWOZ leave-one-out benchmark.


Introduction
Task-oriented dialogue (TOD) systems need to support an ever-increasing variety of services. Since many service developers lack the resources to collect labeled data and/or the requisite ML expertise, zero and few-shot transfer to unseen services is critical to the democratization of dialogue agents.
New approaches to TOD that can generalize to new services primarily rely on combining two techniques: large language models like BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020), and schema-guided modeling i.e. using natural language descriptions of schema elements (intents and slots) as model inputs to enable inference on unseen services (Rastogi et al., 2020a,b). Models combining the two currently show state-of-the-art results on dialogue state tracking (DST) (Heck et al., 2020;Lee et al., 2021a;Zhao et al., 2022). *

*Equal contribution
However, description-based schema representations have drawbacks. Writing precise natural language descriptions requires manual effort and can be tricky, and descriptions only provide indirect supervision about how to interact with a service compared to an example. Furthermore, Lee et al. (2021b) showed that state-of-the-art schemaguided DST models may not be robust to variation in schema descriptions, causing significant accuracy drops.
We propose using a single dialogue example with state annotations as an alternative to the description-based schema representation, similar to one-shot priming (Brown et al., 2020). Rather than tell the model about schema element semantics in natural language, we show the schema through a demonstration, as in Figure 1. Applying our approach, Show, Don't Tell (SDT), to two SotA DST models consistently results in superior accuracy and generalization to new APIs across both the Schema-Guided Dataset (SGD) (Rastogi et al., 2020b) and MultiWoZ Leave-One-Out (Budzianowski et al., 2018;Lin et al., 2021b) benchmarks, while being more data-efficient and robust to schema variations.

Show, Don't Tell
Following SoTA models, we pose DST as a seq2seq task (Wu et al., 2019;Zhao et al., 2021a), where the seq2seq language model (in our case, T5) is finetuned on a DST dataset. During finetuning and evaluation, the model input consists of a prompt and context, and the target contains ground truth belief states. We compare against two baselines: • T5-ind (Lee et al., 2021a) T5-seq SDT-seq P = 0: The amount of money to send or request 1: Name of the contact or account to make the transaction with 2: Whether the transaction is private or not a) True b) False 3: The source of money used for making the payment a) credit card b) debit card c) app balance P seq = [ex] [user] I want to make a payment to Jerry for $82 from my mastercard [system] Confirming you want to pay Jerry $82 with your credit card yes? [user] Yes that's right, make the transaction private too [slot] amount=$82 receiver=Jerry private_visibility=a of a) True b) False payment_method=a of a) credit card b) debit card c) app balance Figure 1: Illustration of all prompt formats for a payment service for both description-based and Show, Don't Tell models with independent (top) and sequential (bottom) decoding of dialogue state.
• T5-seq (Zhao et al., 2022): Model input comprises the descriptions of all slots as the prompt, followed by the dialogue history as the context. The target is the sequence of slot-value pairs in the dialogue state -i.e. the dialogue state is decoded sequentially in a single pass.
We modify the prompt formats above to utilize demonstrations instead of descriptions as described below and illustrated in Figure 1.
• SDT-ind: A prompt P ind i comprises a single utterance labeled with a slot value pair formatted as i is a user utterance where slot i is active and sv i is the slot value pair.
• SDT-seq: A prompt P seq comprises a single labeled dialogue formatted as: where u j is an utterance, and other symbols are shared with SDT-ind. In simple terms, the prompt is constructed by concatenating all utterances in the example dialogue followed by all slot-value pairs in the final dialogue state.
The context in both formats is a concatenation of the dialogue history for the current training example. The final model input is formed by concatenating the prompt and the context strings. The target string is the same as T5-*, containing only a single slot value for *-ind models and the entire turn's belief state for *-seq models.
For both T5-* (baseline) and SDT-*, we enumerate the categorical slot values in multiple-choice format in the prompt and task models with decoding the correct multiple choice letter.
More details on the prompt design and its impact on performance are provided in Appendix C.
Formulating prompt examples: It is imperative that SDT prompts contain sufficient information to infer the semantics for all slots in the schema. This is easy for SDT-ind, which uses a separate prompt for each slot. However, for SDT-seq, we only choose example dialogues where all slots in the schema are used.

Experimental Setup
Datasets: We conduct experiments on two DST benchmarks: Schema-guided Dialogue (SGD) (Rastogi et al., 2020b) and MultiWOZ 2.1 (Budzianowski et al., 2018;Eric et al., 2020). For MultiWOZ, we evaluate on the leave-one-out setup (Wu et al., 2019;Lin et al., 2021a), where models are trained on all domains but one and evaluated on the holdout domain. Additionally, we apply the recommended TRADE pre-processing script 1 for fair comparison with other work. For both datasets, we created concise prompt dialogues modeled after dialogues observed in the datasets. Implementation: We train SDT models by finetuning pretrained T5 1.1 checkpoints. For both datasets, we select one example prompt per service schema (for SDT-seq) or slot (for SDT-ind), We report the average JGA across these versions and the 95% confidence intervals. SDT-seq achieves the highest JGA, showing major gains, particularly on unseen services, over its description-based counterpart T5-seq and the next-best model T5-ind. SDT-ind is comparable to its counterpart T5-ind, and better than T5-seq. Based on these results, conveying service semantics via a single dialogue example appears more effective than using natural language descriptions. We hypothesize that SDT-seq outperforms SDTind because the full dialogue prompts used in SDTseq demonstrate more complex linguistic patterns (e.g. coreference resolution, long term dependencies) than the single utterance prompts of SDT-ind. On the other hand, T5-seq does not outperform T5ind because no additional information is conveyed to the model through stacking descriptions. Also all-else-equal, decoding all slots in one pass is more challenging than decoding each slot independently.
We also experimented with using more than one dialogue to prompt SDT-seq, but we did not see an increase in performance.   (Kumar et al., 2020), (Campagna et al., 2020), and (Lin et al., 2021a), respectively.

MultiWOZ Results
state-of-the-art results on the overall task by +2% and in 3 of the 5 domains.

Impact of Model Size
T5's XXL size may be unsuitable in a number of settings; consequently, we measure SDT's performance on SGD across other model sizes in Table 3.
For the base and large model sizes, both SDT variations offer higher JGA than their description-based counterparts, possibly due to smaller T5 models being less capable of inferring unseen slots with just a description. Additionally, SDT-ind outperforms SDT-seq for these sizes.

Data Efficiency
To examine the data efficiency of SDT models, we train SDT-seq in a low-resource setting with 0.16% (10-shot), 1%, and 10% of the SGD training data and evaluate on the entire test set. For 10-shot, we randomly sample 10 training dialogues from every service; for 1% and 10%, we sample uniformly across the entire dataset. SDT-seq demonstrates far higher data efficiency than T5-seq (Table 4).

T5-seq misses active slots
T5-seq: new_alarm_name=None Can you please add an alarm called Grocery run.

SDT-seq misses categorical values not seen in prompt
SDT-seq: event_type=music (ground truth=theater) I like Broadway shows and want to see one on Tuesday next week. Figure 2: Examples of common error patterns made by T5-seq but not SDT-seq, and vice versa. Donell, 2021). To this end, we evaluate SDT-seq on the SGD-X (Lee et al., 2021b) benchmark, which comprises 5 variants with paraphrased slot names and descriptions per schema (see Appendix Figure  4). Table 5 shows SDT-seq achieves the highest average JGA (JGA v 1−5 ) and lowest schema sensitivity (SS JGA ), indicating it is the most robust of the compared models. Even so, the JGA decline indicates SDT-seq is sensitive to how slot names are written.

Writing descriptions vs. demonstrations
We note that the information provided to SDT is not identical to what is provided to typical schemaguided models, as SDT exchanges natural language descriptions for a demonstration of identifying slots in a dialogue. However, we argue that from the developer standpoint, creating a single example is similar in effort to writing descriptions, so we consider the methods comparable. Creating the SDT-seq prompts for all 45 services in SGD took an experienced annotator ∼2 hours, compared to ∼1.5 hours for generating slot descriptions. SDTind prompts are even simpler to write because they relax the requirement for creating a coherent dialogue where all slots are used. One advantage of descriptions is that they can be easier to generate than a succinct dialogue that covers all slots. However, given the performance gain, example-based prompts may be a better choice for many settings, especially for smaller model sizes where the gain is more pronounced.

Descriptions plus demonstrations
Training with descriptions has proven effective for improving DST performance (Zhao et al., 2022;Lee et al., 2021a), and our experiments show that demonstrations are even more effective. We combined both together to see if the two are complementary or overlapping, and we find that performance does not improve above using demonstrations alone (Appendix Table A1. We hypothesize that demonstrations already convey slot semantics sufficiently and descriptions become extraneous.

Prompting vs. traditional finetuning
To understand the impact of using a single demonstration as a prompt vs. traditional finetuning, we finetune T5-seq a second time on the same set of dialogues used in SDT-seq prompts; it therefore has access to both slot descriptions as well as a single demonstration for each service. In this case, T5-seq is provided strictly more information than SDT-seq. T5-seq with finetuning obtains a JGA of 87.7% on SGD, on par with T5-ind but still lower than SDT-seq, suggesting that dialogue examples are better used as prompts (Le Scao and Rush, 2021). Interestingly, finetuning on more than one dialogue example per service did not improve performance (Appendix Figure 3).

Error analysis
Figure 2 compares some common errors made by T5-seq and SDT-seq. The patterns suggest that SDT's demonstrations are helpful when multiple slots are similar to each other (#1) and when prompt dialogues closely match target dialogues (#2). However, SDT can be limited by its prompt. For instance, in #3 it has only seen the "music" value for the event_type slot in the prompt, potentially resulting in under-predicting the other categorical value ("theater").

Conclusion
We study the use of demonstrations as LM prompts to convey the semantics of APIs in lieu of natural language descriptions for TOD. While taking similar effort to construct, demonstrations outperform description-based prompts in our experiments across DST datasets (SGD and MultiWOZ), model sizes, and training data sizes, while being more robust to changes in schemata. This work provides developers of TOD systems with more options for API representations to enable transfer to unseen services. In future work, we would like to explore this representation for other TOD tasks (e.g. dialogue management and response generation).

Ethical Considerations
We proposed a more efficient way of building TOD systems by leveraging demonstrations in place of descriptions, leading to increased accuracy with minimal/no data preparation overhead. We conduct our experiments on publicly-available TOD datasets in English, covering domains which are popular for building conversational agents. We hope our work leads to building more accurate TOD systems with similar or less overhead, and encourages further research in the area. Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021b. Calibrate before use: Improving few-shot performance of language models.

A SDT Model Details
All T5 checkpoints used are available publicly 2 .
For all experiments, we use a sequence length of 2048, dropout of 10% and a batch size of 16. We used a constant learning rate of 1e − 3 or 1e − 4. All models were trained for 50k steps or until convergence, and each experiment was conducted on either 64 or 128 TPU v3 chips (Jouppi et al., 2017).
Transfer QA is based on T5-large, and T5-ind and T5-seq are based on T5-XXL in this paper unless otherwise noted.

C Prompt Design
We experimented with various formats for the SDT prompt before arriving at the final format. Below, we list alternative designs that we tried and their impact on JGA, as evaluated on the SGD test set.

C.1 Categorical value strings vs. multiple choice answers
We found that JGA dropped -2% when we tasked the model with decoding categorical values instead of multiple choice answers -e.g. payment_method=debit card instead of payment_method=b (where b is linked to the value debit card in the prompt as described in Section 2). We found that when tasking the model to decode categorical values, it would often decode related yet invalid values, which we counted as false in our evaluation. For example, instead of debit card, the model might decode bank balance.

C.2 Slot IDs vs. slot names
When we delexicalized slot names with slot IDs, JGA dropped -5%. One downside of this approach is that the model lost access to valuable semantic information conveyed by the slot name. Another downside is that the model could not distinguish two slots that had the same value in the prompt. For example, if the prompt was "I would like a petfriendly hotel room with wifi" and the corresponding slots were 1=True (has_wifi) and 2=True (pets_allowed), it is ambiguous which ID refers to which slot. The potential upside of using slot IDs was to remove dependence on the choice of slot name, but this did not succeed for the reasons above.

C.3 Decoding active slots vs. all slots
We experimented with training the model to only decode active slots rather than all slots with none values when they were inactive. JGA dropped -0.4%, which we hypothesized could be a result of greater dissimilarity between the slot-value string in the prompt (which contained all slots by construction) and the target, which only contained a subset of slots.

C.4 In-line annotations vs. dialogue+slots concatenated
We hypothesized that bringing the slot annotation in the prompt closer to where it was mentioned in the dialogue might help the model better understand the slot's semantic meaning. We changed the format as follows: However, this decreased JGA by more than -20%. We hypothesized that this was likely due to a mismatch between the prompt's annotations and the target string format, which remained the same.

All
Seen Unseen SDT-seq + desc 88.6±0.9 95.7±0.5 86.2±1.0 SDT-seq 88.8±0.5 95.8±0.2 86.4±0.7 Table A1: We experiment with prompting using both descriptions and demonstrations (SDT-seq + desc) vs. demonstrations-only (SDT-seq) and find that descriptions do not improve performance. Figure 3: Results of secondarily finetuning T5-seq with dialogues, to help understand whether prompting or finetuning is more effective. The examples used for finetuning are derived from the set of dialogues used as prompts across the 5 trials of SDT-seq. From this, we observe that prompting outperforms finetuning. Figure 4: The original schema for a Payment service its closest (v 1 ) and farthest (v 5 ) SGD-X variants, as measured by linguistic distance functions. For the SGD-X benchmark, models are trained on the original SGD dataset and evaluated on the test set, where the original test set schemas are replaced by SGD-X variant schemas.