Prompter: Zero-shot Adaptive Prefixes for Dialogue State Tracking Domain Adaptation

A challenge in the Dialogue State Tracking (DST) field is adapting models to new domains without using any supervised data — zero-shot domain adaptation. Parameter-Efficient Transfer Learning (PETL) has the potential to address this problem due to its robustness. However, it has yet to be applied to the zero-shot scenarios, as it is not clear how to apply it unsupervisedly. Our method, Prompter, uses descriptions of target domain slots to generate dynamic prefixes that are concatenated to the key and values at each layer’s self-attention mechanism. This allows for the use of prefix-tuning in zero-shot. Prompter outperforms previous methods on both the MultiWOZ and SGD benchmarks. In generating prefixes, our analyses find that Prompter not only utilizes the semantics of slot descriptions but also how often the slots appear together in conversation. Moreover, Prompter’s gains are due to its improved ability to distinguish ”none”-valued dialogue slots, compared against baselines.


Introduction
Task-oriented dialogue (TOD) systems serve users through several tasks, such as booking a table in a restaurant or suggesting tourist attractions. One crucial component of these systems, Dialogue State Tracking (DST), is responsible for extracting users' preferences (i.e. slot-values) over key attributes (i.e. slot-labels) of their service (Wu et al., 2019). DST has a significant role in TOD systems as it ensures that both the action taken in the back-end and the responses returned to the users are aligned with the preferences that the users indicate.
A challenging task in this field is to adapt an existing DST model to a new domain it has not seen before without using any supervised data, i.e. in the zero-shot scenario. This is important, as in many new scenarios, it is hard to collect data, let alone annotate it. Yet it is still an essential need for a TOD system to appropriately answer such queries in new contexts. The challenge arises from the differences in dialogue context, slot values, and slot labels among different domains. For example, a model could be trained on the 'taxi-booking' domain and thus capable of extracting the destination for a taxi; but when deployed to the 'train-booking' domain, the range of slot-values changes, resulting in a higher probability of a mistaken inference. We show an example (Figure 1), where due to the superficial connections a baseline T5 model forms, it incorrectly predicts 'Ashley Hotel' as the train destination (bottom left). In many dialogue contexts, a large number of slots are unspecified. These are known as "none"-valued slots. In cases where the model is adapting to a new domain without any prior training, it often incorrectly predicts none values. This makes it even more important to address the problem of domain shift. Lin et al. (2021b) proposed to address this domain shift challenge via the language model's intrinsic ability to reason over prompts. Specifically, they concatenate the description of each slot as a hard prompt into the dialogue context and then generate the answers using the T5 model. While it does well for a naive baseline, it makes mistakes due to its superficial understanding of slot labels.
Meanwhile, another line of study has shown that Parameter-efficient Transfer Learning (PETL) methods are effective training methods to address domain shift. Due to the small number of parameters it introduces per task/instance, it overcomes overfitting in few-shot scenarios, outperforming earlier baselines. There have been various attempts to use these methods for DST tasks within a fewshot, continual learning setting (Zhu et al., 2022;Madotto et al., 2021). However, a significant barrier to adopting PETL is that such methods cannot be directly applied in zero-shot, as they all require some form of supervised training.
In this study, we propose a new method to use prefix-tuning under a zero-shot scenario to benefit from the gains it brings for robustness, even without supervised data. Rather than fine-tuning the prefixes during training, we add a new mechanism into the T5 architecture called Prompter 1 . Prompter simply takes the description of the slot and then generates the prefixes on the fly. We then append these prefixes at each layer of the encoder to represent the dialogue from the perspective of the subject slot label. This method makes minimal changes to LM parameters while generating unsupervised prefixes. This ensures both the preservation of general-purpose traits and extrapolation to new domains.
We conduct experiments with the MultiWOZ 2.1 and SGD datasets. Prompter improves average JGA results across domains by 1.7 for MultiWOZ, and 9.1 points for the SGD dataset (considering 4 domains reported in prior studies) compared to the strongest baseline. This shows that PETL methods' robustness advantage is also favorable for unsupervised domain adaptation scenarios. To the best of our knowledge, these are the highest results achieved so far using a small language model. Through further analysis, we have discovered that Prompter not only considers the semantic similarities of slot descriptions but also the frequencies in which slots co-appear in the dialogue context. Furthermore, Prompter proves to be more effective in identifying slots that have no value within a conversation in comparison to previous methods.

Related Work
Dialogue State Tracking. DST has a long history of models working with a static, ontologybased problem definition (i.e. slot-values are fixed) (Balaraman et al., 2021). The static-ontology DST is a simplified classification problem where the model selects a value from each slot's value pool. (Zhang et al., 2020;Rastogi et al., 2017;Zhong et al., 2018). Recently interest in dynamic ontologies have received attention, adding flexibility at inference time (Wu et al., 2019;Rastogi et al., 2019;Heck et al., 2020).
Low-resource Domain Adaptation. Dynamic ontology introduces slot-value level flexibility, but its ability to work with new slot-labels is limited. Domain adaptation of DST systems aims to make the model adaptable to new domains/slot-labels. Few studies have attempted to utilize language models' intrinsic reasoning abilities by mapping DST as a question-answering task Zhou and Small, 2019). Shin et al. (2022), on the other hand, map DST to a dialogue summarization task, and Xie et al. (2022) map it to a structuredknowledge grounding task. Many use data augmentation to address the lack of supervision in the target domain (Qiu et al., 2022;Mi et al., 2021;Gritta et al., 2021;Aksu et al., 2022;Li et al., 2020). Finally, remaining studies focus on improving the model's architecture and training strategies for robustness toward domain changes. (Feng et al., 2022;Balaraman and Magnini, 2020;Madotto and Liu, 2020;Huang et al., 2020;Coope et al., 2020;Wu et al., 2019;Lei et al., 2018;Lin et al., 2021b;. Wang et al. (2022) have a similar goal as our own, but they use a different method. They create cross-slot dependency by combining multiple slot prompts to create a final prompt, which encourages the model to apply what it has learned in one slot to other slots.
PETL for DST Domain Adaptation. Parameter Efficient Transfer Learning (PETL) is a recently trending set of methods that aims to adapt models more efficiently by significantly reducing the number of parameters that need to be fine-tuned. (Pfeiffer et al., 2020;Lester et al., 2021;Li and Liang, 2021;Houlsby et al., 2019). Many studies have found that PETL is advantageous for low-resource domain adaptation settings due to its efficient parameter training scheme. This scheme minimizes changes in LM parameters and thus believed to prevent over-fitting (Li and Liang, 2021;. However, He et al. (2022) argues that tuning the entire language model does not negatively impact its robustness advantage. Researchers in the DST field have also utilized PETL methods for their robust capabilities. In their work, Zhu et al. (2022) employed soft prompts and fine-tuned them for each domain in a continual learning setting, utilizing validation sets from target domains to decide which previous prompts to use for initialization. Madotto et al. (2021) also tackled the problem of continual learning, using unique adapters for each domain and relying on a classifier to select which adapter to use during inference. Both studies only explored the use of PETL methods for DST with few-shot availability. In contrast, this study aims to investigate a well-known PETL method, prefix-tuning (Li and Liang, 2021), for zero-shot domain adaptation of DST models.

Dialogue State Tracking Task
A task-oriented dialogue consists of a number of consecutive system and user utterances, together referred to as a turn, t i = (s i , u i ). Each turn is annotated with a belief state that shows the user's preferences over a number of attributes from various domains up to and including that turn, B i = (D 0 , D 1 , ..., D K ) where D j is the belief state for domain j, and K is the total number of domains. The belief state for each domain is made up of a list of slot-label (e.g. 'restaurant-area') and slot-value pairs (e.g. 'center'), D j = {s 0 : v 0 , s 1 : v 1 , ..., s N : v N }, where N is the number of slots within domain j. Each s i is further annotated with a description that explains the attribute in the context of the domain (e.g. 'restaurant-area':'The area of the city where the restaurant is located.'). For each v i , if s i is not discussed in the dialogue context, v i is set to 'none'. Otherwise, v i is a sequence of tokens. The task of DST is to predict the belief state B i for a given dialogue context DC, i.e. dialogue turn history up to and including turn i, DC = (t 0 , t 1 , ..., t i ).

Prefix-Tuning
Prefix-tuning is a parameter-efficient alternative to fine-tuning which optimizes a small continuous task-specific vector called the prefix for each new task. These tunable prefix vectors are prepended to the keys and values of the multi-head attention at every layer of the transformer (Li and Liang, 2021;He et al., 2021). Li and Liang (2021) also report that prefix-tuning also improves extrapolation to unseen tasks in few-shot settings. However, there is no straightforward way to use this method for the zero-shot setting, as it requires supervision to fine-tune the prefixes.

Method
We propose to add a new mechanism into the T5 architecture (Raffel et al., 2019), called Prompter, to take advantage of prefix-tuning's extrapolation capabilities without requiring supervision. Instead of fine-tuning the prefixes with source domain data, we generate them on the fly for each slot. However, we need a way to condition Prompter for a new domain without any supervised data. Taskoriented dialogue schemas provide a solution by annotating the slot descriptions for each slot-label. Using these slot descriptions Prompter can generate domain-specific prefixes which allow it to adapt to any domain without the need for supervised data. We can summarize the Prompter pipeline in three key parts: (1) Slot Prompt Generation, (2) Prefix Generation, and (3) Multi-head Self Attention.
Slot Prompt Generation. is responsible for generating a prompt that is specific to each slot, using its unique description. Previous approaches to this problem, such as simply concatenating the description to the input, result in only a superficial understanding of the slots in zero-shot settings (Lin et al., 2021b). Additionally, using slot embeddings as soft prompts can cause unstable training and hinder zero-shot adaptation due to changes in the descriptions. Instead, we propose using a global prompt that is modified according to each slot's description. This modification is applied through a cross-attention mechanism that attends the global Figure 2: The architecture of our proposed method, Prompter. Prompter leverages the prefix-tuning method to enable zero-shot learning without the need for supervised data and it is composed of three parts: (a) Slot Prompt Generation where the information from the description is fused with some global prompt to generate slot-specific prompts, (b) Prefix Generation which feeds slot prompts across two linear layers and an activation function to generate per-layer key and value prefixes, (c) Finally these prefixes are concatenated to keys and values at every layer of the T5 encoder.
prompt to the slot description's embedding, c.f. Figure 2a. This approach ensures that each slot prompt shares the same initialization addressing unstable training, and the modifications reflect changes in the slot-label addressing domain shift. It also has the advantage of making the final prompt's length fixed, regardless of the length of the description. The slot prompt is calculated as follows: where W q ,W k , and W v ∈ R d×d are query, key, and value weights for the cross attention mechanism and d is the model dimension, G ∈ R N ×d is the global prompt 2 , E ∈ R K×d is the slot embedding, K is the length of slot description, and S ∈ R N ×d is the slot prompt.
Prefix generation. For the DST task, the dialogue context can make up the majority of the language model input (i.e. 100-400 tokens long dialogue context compared to 10-15 tokens long slot description), this results in challenges with the prompt-tuning method because the prompt's impact can vanish easily before the decoding starts. This is why we opt for prefix-tuning because it ingests prompts at each layer and thus the generated value will have higher exposure to the prompt.
So following the generation of slot prompts the next step is to generate key and value prefixes for each layer. For this step, we have tried several different architectural designs such as a simple MLP or a whole transformer block. We empirically observed that while the former lags behind due to the small number of parameters the latter results in overfitting. Thus, inspired by He et al. (2022) we use a sequence of down and up projections separated by an activation function as prefix generators, c.f. Figure2b. Note that each transformer layer has a pair of dedicated prefix generators to generate key and value prefixes: where K i , and V i ∈ R N ×d are key and value prefixes for the i th layer; W k down i , W v down i ∈ R d×r , W k up i and W v up i ∈ R r×d are the respective down and up projectors for the i th layer; r is the bottleneck dimension. r is set to d/4 throughout our experiments.
Multi-head Self Attention. After we get K i and V i for each layer i we split them to N h head vectors K j i and V j i ∈ R N ×d h for each head j, where d h = d/N h is the dimension per head. Finally, we concatenate these key and value prefixes into the self-attention mechanism at each layer of the transformer encoder completing our modifications to the original T5 architecture, c.f. Figure 2c.
where head j i is the output from the j th head of self-attention mechanism at layer i; W j q i , W j k i , and W j v i ∈ R d×d h are query, key and value weight matrices of the j th head in the ith layer; and h i is the input to the i th layer. The final output of the multi-head self-attention at layer i is calculated as:

Datasets
We conduct experiments with two well-known DST benchmarks: MultiWOZ and SGD (Budzianowski et al., 2018;Rastogi et al., 2019). MultiWOZ is a task-oriented dialogue dataset collected in a wizard of oz setting using human speakers. It has 10k dialogues that span over 7 domains. It provides turn-level annotations and descriptions of each slot label. In line with previous studies, we limited our experiments to only 5 domains because the police and hospital domains do not have a sufficient number of examples in the test set. We use Multi-WOZ version 2.1 which addresses the noisy state annotations within the original dataset (Eric et al., 2020). Similar to MultiWOZ, the SGD dataset also has turn-level annotations and descriptions, i.e. schema, for each domain and slot. It has over 20k annotated conversations between a human and a virtual assistant. These span over 20 domains. Besides, the SGD dataset has unseen domains in the test set specifically formed to evaluate zero-shot performance.

Baseline Models
We compare our method with a range of DST models from the past as well as the recent state of the art. The only models we utilize that do not depend on a language model are TRADE (Wu et al., 2019) and MA-DST . The former introduces the copy mechanism to ease predicting slots not seen during training, whereas the latter adds cross-attention to model relationships between the context and slots at different semantic levels and self-attention to resolve cross-domain coreferences to a base RNN layer. SUMBT by  is built with BERT and again uses an attention mechanism to learn relations between domains and slots. SGD-baseline (Rastogi et al., 2019) feeds slots, domains, and value embeddings into a BERT encoder to create schema embedding and uses it to predict dialog state in the target domain under zero-shot. Seq2seq-DU (Feng et al., 2021) formalizes DST as a sequence-to-sequence task where the dialog history is transformed directly into semantic frames.  on the other hand use GPT-2 and define DST as a generative question-answering approach. TransferQA builds on a similar motivation but combines both extractive and multichoice QA enabling tracking categorical and noncategorical slots simultaneously (Lin et al., 2021a). T5DST (Lin et al., 2021b) and Wang et al. (2022) both use the T5 architecture. The former concatenates slot descriptions with dialogue context and generates slot values in an auto-regressive manner. Whereas the latter proposes a unique design that models cross-slot dependency by composing multiple slots as the final prompt so that the model is forced to learn the relations among each slot.

Training Details
For all experiments, we used a Tesla-V100 GPU. We use the small-sized PPTOD (Su et al., 2022) built on the T5 architecture for the T5DST baseline and our own Prompter. We empirically found PP-TOD to be more suitable for prompt-tuning tasks most probably due to the nature of its pretraining tasks. We set the batch size to 8 with gradient accumulation every 8 steps. We use AdamW optimizer (Loshchilov and Hutter, 2017) for training and set the initial learning rate to 1e − 4.

Semi-frozen Training Scheme
Contrary to what is typically recommended for limited data scenarios by traditional PETL techniques, we discovered that freezing LM parameters does not improve performance in the zero-shot scenario. This is in line with what He et al. (2022) suggests. However, we also find that tuning all parameters is imperfect. In search for a better strategy we experiment with different combinations of frozen layers and compare the results for zero-shot train domain performance. We found that the best strategy is a semi-frozen (S.F.) training scheme, where all LM parameters are trained for 1k steps and then all layers of the T5 model are frozen except the first and last layers of the encoder and decoder (  more details). Thus for the experiments conducted in this section, we employ this strategy to train the models.

Evaluation
We evaluate the performance of all models using Joint Goal Accuracy (JGA) following prior studies. For MultiWOZ, a zero-shot setting is used where training occurs on four domains and the remaining domain is used for testing. For SGD, results are reported on domains that are not included in both the training and validation sets, as they have already been included in the PPTOD pretraining. We modified the official SGD evaluation script to reflect this change. Therefore, in our evaluation settings, unseen domains refer only to domains in the test data, contrary to the original definition by Rastogi et al. (2019) which considers domains only showing up in the validation data unseen as well.

Results and Analysis
In MultiWOZ (Table 1), our addition of Prompter shows improvements in all domains except Hotel, boosting the average JGA by 1.7 points, compared to the state-of-the-art model by Wang et al. (2022). We believe the lack of improvements in the hotel domain for Prompter is due to it having many unique slots (i.e. 'hotel-internet', 'hotel-parking', 'hotel-type', etc.). This makes it harder to take advantage of earlier domains as they lack similar slots. This is also in line with the results from Wang et al. (2022), as their cross-slot dependency design also lags behind for hotel domain results.
We also present the results on the SGD dataset in Table 2, where Prompter shows improvements on average. We share results over 6 representative domains along with results for official unseen domain performance. Once more, Prompter demonstrates superior performance on average in unfamiliar domains. Compared to the results reported in the original paper by Wang et al. (2022) for four domains (Columns 1 through 4 of Table Table 2), Prompter shows an average improvement of 9.1 in JGA. The Alarm domain is excluded from the comparison as PPTOD has been pretrained on it.

Ablation Study
We further conducted ablation to analyze the contribution of Prompter's components (Table 3). Adding the S.F. training scheme (second row) to the T5DST baseline introduces performance increase across all domains. This demonstrates that this training scheme plays a significant role in the robustness of the model. If we switch the pre-trained model from T5 to PPTOD (third row), we see another round of improvement but it is inconsistent across domains. Finally, it is evident from the final row that adding the Prompter increases the results by another margin, clearly showing its contribution.

Fine Grained Analysis
How does Prompter improve results? We define two new metrics to better understand Prompter's improvements: Miss-prediction (MP), where the model fails to correctly identify a gold slot-label, mistakenly labeling it as 'none' instead; and Over-prediction (OP), where the model incorrectly predicts a 'none' valued slot-label as something else. We then combine these metrics in None Accuracy, a metric that measures the accuracy of the model's predictions regarding the "activeness" of a slot-label. In other words, it measures how often the model correctly predicts whether a slotlabel has the value 'none' or not. The results over all 5 domains can be found in Table 4. It is evident that Prompter's improvement comes from the None accuracy measure as its results are in line with the change in JGA (i.e. improvements across all domains except the Hotel domain). Moreover, we find that this is mostly due to the reduction of over-prediction mistakes -Prompter decreases this class of error in every domain.
How does Prompter connect slots? To better understand the benefits of using Prompter, we look at how it connects target domain slots with source domain slots. This is done by aggregating the key (a) Source domain slots in close proximity to 'Traindestination', 'Train-arriveby', and 'Train-bookpeople' slots, according to generated prefix similarities.
(b) Source domain slots close to 'Taxi-departure' and 'Taxiarriveby' slots, according to generated prefix similarities. prefixes across each layer and attention head for every slot and then comparing them to the source domain slot prefixes from the training set using cosine similarity. Figure 3 highlights important similarities among some of the taxi and train domain slots (c.f. Appendix A for a comprehensive version that includes all domains and slots). Figure 3a shows that 'train-destination' has a high similarity with 'taxi-departure' and 'destination', as well as the 'attraction-name' slots. The first two connections are expected, but the latter is also relevant because the 'attraction-name' often appears as the 'taxidestination' in training. This indicates that the model finds that the 'destination' slots can often contain named entities (such as locations) within the dialogue. For 'train-arriveby', the most similar slot is also the semantically closest: 'taxi-arriveby'. Finally, for the 'train-bookpeople' slot, the most similar slots are those related to booking from the hotel and restaurant domains, which makes sense as these often co-occur in the training data. Figure 3b shows the results of adapting in the taxi domain. The similarity between the 'taxiarriveby' slot and its train domain counterpart, 'train-arriveby', is high as expected. Moreover, for the 'taxi-departure' slot, the generated prefixes are most similar to slots for attraction, restaurant,  Hello, I am looking for places to go in the centre? S1 There are many attractions in the centre like museums, architecture, boating, and concert halls. What are you interested in? U1 How about a boating attraction? S2 There are 2 in the centre of town. Scudamores punting co., and the cambridge punter. Would either of those interest you? U2 Could you give me the address for the Cambridge punter, please? I also need a place to stay, preferably somewhere cheap.  and hotel names. This is likely because the 'traindeparture' slot also has named entities as values.
The findings show that Prompter not only utilizes slots with similar descriptions to create prefixes, but also accounts for other slots that co-occur in the same conversation with a similar source slot. This is important as slots may have different descriptions but exhibit significant semantic overlap (e.g., 'taxi-departure' and 'hotel-name' having location named entities as values).

Case study
We use three dialogues from the MultiWOZ test set to demonstrate some of the phenomena observed in previous analysis studies (Table 5). The first example shows how the T5DST baseline is susceptible to overgeneralization from training data. When the T5DST model encounters a hotel name during zero-shot inference on the train domain, it mistakenly assumes that the hotel is the departure for the train because it has been trained to associate location names with taxi departure/destination. Prompter avoids this mistake through its deeper understanding of cross-slot relations. In the second case, the model has made predictions for the hotel type and area even though the dialogue does not mention a hotel. This happens because the model has learned to predict the same type of slots for the attraction domain and has overfitted them during training. In contrast, Prompter ameliorates this form of overprediction ( §6.2).
Our model has a weakness when it comes to dealing with slots that are unique and do not have similar slots in the source domain. In the third case, the model struggles to accurately predict the 'hotel-type' and 'hotel-internet' slots because they  are dissimilar to all slots in the source domain.

Why Prefix-Tuning?
We also try implementing Prompter using soft prompt-tuning rather than prefix-tuning. Under this setting, the learned prompts are fed directly at the input layer instead of as prefixes to the attention mechanism at each layer. We compare the performance of this method with the baseline T5DST, using T5-small as the language model. We find that prompt-tuning is not even comparable to the fine-tuning baseline let alone to prefix-tuning, c.f. Table 6. We believe this difference is due to the fact that prompts fed in the initial layer of the transformer have a diminishing effect on the output of the decoder. This is also evident in the original prefix-tuning paper where Li and Liang (2021) claim it performs better compared to prompt-tuning when it comes to generation tasks.

Conclusion
Parameter Efficient Transfer Learning methods have been frequently used for their strong robust features under a low-resource setting. However, there is no straightforward way to take advantage of these features under a zero-shot setting because they require at least some supervised data during adaptation. The dialogue state tracking (DST) task, on the other hand, has just the right annotation for this scenario as it contains schema annotations with slot label descriptions. We propose Prompter, which uses these descriptions to enable prefix-tuning, a well-known PETL method, for use under a zero-shot domain adaptation setting. We show through experiments that this method improves the JGA metric for the two most common DST benchmarks. We further explain through analyses and a case study that the reason behind the Prompter's power is two-fold. (1) It has better capability to distinguish 'none' valued slots within the dialogue and (2) it can digest the frequency of slots co-occurrences within the dialogue context into the prefix generation process. We believe that this study shows PETL's hidden potential for DST domain adaptation under a zero-shot setting.

Acknowledgements
This research was supported by the SINGA scholarship from A*STAR. We would like to thank anonymous reviewers for their insightful feedback on how to improve the paper.

Limitations
One limitation of our study is that we only evaluated our method on the T5 architecture. Further experiments on other architectures could be useful to determine the generalizability of our findings. Additionally, as in previous SOTA, our model also did not produce better results for the hotel domain, even though it did improve performance in general. We have attempted to explain why this domain is more difficult, but more research is needed to fully understand the reasons for this variability and to create methods that can improve performance across all domains.

B Semi-Frozen Training
After discovering that completely freezing the parameters of the Language Model (LM) does not lead to improved performance in zero-shot adaptation, we conducted a series of initial experiments to determine the most effective configuration. These preliminary experiments focused on the train domain of MultiWOZ 2.1. Each experiment involved training all parameters for 1,000 steps, which consistently showed benefits. We then selectively froze layers, with the specific layers varying for each row in Table 7. For example, in the first row, we froze     all layers except the first layer of the encoder and decoder after the initial 1,000 steps. Our findings revealed that the optimal approach is to freeze all layers except the first and last layers of both the encoder and decoder after 1,000 steps.