Generative Zero-Shot Prompt Learning for Cross-Domain Slot Filling with Inverse Prompting

Zero-shot cross-domain slot filling aims to transfer knowledge from the labeled source domain to the unlabeled target domain. Existing models either encode slot descriptions and examples or design handcrafted question templates using heuristic rules, suffering from poor generalization capability or robustness. In this paper, we propose a generative zero-shot prompt learning framework for cross-domain slot filling, both improving generalization and robustness than previous work. Besides, we introduce a novel inverse prompting strategy to distinguish different slot types to avoid the multiple prediction problem, and an efficient prompt-tuning strategy to boost higher performance by only training fewer prompt parameters. Experiments and analysis demonstrate the effectiveness of our proposed framework, especially huge improvements (+13.44% F1) on the unseen slots.


Introduction
Slot filling in a task-oriented dialogue system aims to extract task-related information like hotel_name, hotel_address from user queries, which is widely applied to existing intelligent conversation applications (Tulshan and Dhage, 2019;Zhang et al., 2020).Traditional supervised methods (Zhang and Wang, 2016;Goo et al., 2018;Qin et al., 2019;Wu et al., 2020;He et al., 2020a,b) have shown remarkable performance, but they still rely on large-scale labeled data.Lack of generalization to new domains hinder its further application to practical industrial scenarios.
In this work, we focus on zero-shot cross-domain slot filling which transfers knowledge from the source domain D S to the target domain D T without requiring any labeled training data of D T .Conventional approaches (Bapna et al., 2017;Shah et al., 2019;He et al., 2020c;Wang et al., 2021) formulate slot filling as a sequence labeling task and use meta-information such as slot descriptions and slot examples to capture the semantic relationship between slot types and input tokens.However, these models only learn a surface mapping of the slot types between D S and D T and get poor performance on unseen slots in the target domain (Wang et al., 2021).Further, (Lee and Jha, 2019;Mehri and Eskenazi, 2021;Du et al., 2021;Yu et al., 2021) propose a machine reading comprehension (MRC) framework for slot filling to enhance the semantic interaction between slot types and slot values.They firstly construct many well-designed question templates based on slot schema or slot examples, then train an MRC model (Rajpurkar et al., 2018a) to predict corresponding slot values for a given slot type question.But they rely on handcrafted question templates using heuristic rules and pre-defined ontologies, which suffers from poor model robustness.Besides, employing additional pre-training on large-scale external MRC datasets is also timeconsuming and prohibitively expensive.To solve the above issues, in this paper, we propose a Generative Zero-shot Prompt Learning (GZPL) framework for cross-domain slot filling.Instead of transforming the slot filling task into sequence labeling or MRC, we formulate it as a language generation task (see Fig 1).Specifically, we concat the question of each slot type, names of all slot types, the input query together to construct the input sequence and take the related slot values as output sequence.The converted text-to-text format has two benefits for zero-shot slot filling:

Inverse Prompting
(1) Compared to sequence labeling, our formulation enriches deep semantic interaction between slot types and slot values via pre-trained language models (Raffel et al., 2020), which helps recognize unseen slots only existing in the target domain.We find it significantly improves unseen slot F1 by 13.44% compared to the previous state-of-the-art (SOTA) model (see Section 4.2).The result proves the strong generalization capability to new domains of our proposed framework.(2) Compared to MRC, our framework reduces the complexity of creating well-designed question templates and is more robust to different templates (see Section 4.2).Besides, we concat the names of all slot types into the input sequence to construct direct connections between different slot types, while MRC makes independent predictions for each slot type.Along with our proposed framework, we present an inverse prompting strategy to distinguish different slot types for a given entity to avoid the multiple prediction problem (He et al., 2020d) where the model possibly predicts multiple slot types for one entity span.Different from the above formulation, we take each slot value as input and corresponding slot type as output to build a mapping from entity tokens to entity types.In this way, we force the model to learn explicit distinctions of different types.Inspired by recent parameter-efficient tuning work (Li and Liang, 2021a;Lester et al., 2021), we also introduce an efficient prompt tuning strategy to boost higher performance by training fewer prompt parameters instead of the whole PLM.
Our contributions are three-fold: (1) We propose a simple but strong generative zero-shot prompt learning framework for cross-domain slot filling, which has better generalization capability and robustness than previous work.(2) We present a novel inverse prompting strategy to distinguish different slot types to avoid the multiple prediction problem.Besides, we introduce an efficient prompt tuning strategy to boost higher performance only training fewer prompt parameters.(3) Experiments and analysis demonstrate the effectiveness of our proposed framework, especially for good generalization to unseen slots (F1 +13.44% ↑), strong robustness to different templates (∆ F1 +10.23% ↑), parameter efficiency (10x fewer parameters).

Methodology
Our model is shown in Fig 2 .In our framework, we first construct several simple template sentences for the model input, where each sentence includes a slot type question, all slot types and the original query.Then we use a PLM to generate the corresponding slot values.Along with the main task formulation, we perform an inverse-prompting task to warm up the parameters to strengthen the relationship between entities and slot types.

Problem Definition
Given a user input sentence containing n words X input = {x 1 , x 2 , ..., x n } and slot type sets S = {s 1 , s 2 , ..., s m }, the slot filling task aims to find all the entities in X input .For zero-shot setting in our paper, we train models using labeled data from the source domain and make predictions in the target domain.

Generative Zero-shot Prompt Learning Framework
We customize the entire task using a generative zero-shot prompt learning framework.Specifically, we concat the question of each slot type, names of all slot types, the input query together to construct the input sequence and take the related slot values as output sequence.We formulate it as follows: what is the slot_type ?{all slot types} x 1 x 2 ... x n where slot_type represents the queried slot type, {all slot types} represents all slot types across all domains.For slot types that do not exist in the input, we set the answer to special token "none".For each original input query, we construct QA pairs as the same number of slot types2 .
Label Prompt Construction We do not focus on the question template construction as the previous works Du et al. (2021);Yu et al. (2021).Instead, we simply set up the simplest question form of "what is the ?" to highlight the simplicity and effectiveness of our proposed framework.It is worth noting that we also include slot names from all domains in the prompt.The main purpose of this setting is to enhance the interaction between different slot types, so that the model can find the best answer from the original text.
Inverse Prompting Previous MRC works suffer from the multiple prediction problem (He et al., 2020d) where the model possibly predicts multiple slot types for one entity span.To solve such conflict, we design an invert prompting task to warm up the model parameters first.We inverse the original QA pair, that is, set the question to the entities and the answer to the corresponding slot types.This task enables the model to distinguish different slot types for slot entities.In this way, deep semantic relationships between slot types are learned, and the model will learn stronger entity-slot relations.We both train the main task and the inverse task in the same auto-regressive way.Experiments show that first using the inverse task for pre-training then the main task gets the best performance.
In addition, since the result of the main task could be "none", we additionally use a negative sampling strategy here to ensure the consistency of the two tasks.We just randomly sample different spans in sentences, and set the corresponding answers to "none".This strategy can also improve the anti-noise ability of the model and improve the robustness of the framework.In our experiments, we set the ratio of positive and negative samples to 1:1.
Training and Inference During training, we try two different training strategies: fine-tuning and prefix-tuning (Li and Liang, 2021b).In the finetuning mode, we first use the inverse task to warm up the model parameters, and then perform the main task.All the PLM parameters are finetuned.For prefix-tuning, the parameters of the pre-trained model are fixed during training, and only the parameters of the new added prefix embeddings are trained.Specifically, we add a trainable prefix embedding matrix in each attention layer of the PLM3 .This method requires 10x fewer trainable parameters and is more parameter-efficient.
During the inference, we only perform the main task.We query for all slot types, and the model directly generates the corresponding slot entities.Compared with the previous method (Yu et al., 2021), our model will not need additional span matching mechanism, so it will be more concise and intuitive.To ensure task consistency with MRC-based models, we add a post-processing step: if multiple slot types predict the same entity span, we choose the answer with the highest generation probability of the first word.

Baselines
Sequence Tagging Models: Concept Tagger (CT) proposed by (Bapna et al., 2017), which utilizes slot descriptions to boost the performance on detecting unseen slots.Robust Zero-shot Tagger

Implementation Details
We use T5-base 4 as the backbone in our experiments.Model parameters are optimized using the AdamW optimizer (Kingma and Ba, 2014) with a learning rate 5e-05.We set the batch size to 8 and use early stop with a patience 10 to ensure the sta-4 T5 is a transformer-based pre-training language model, whose pre-training tasks include text-to-text formulation.We select it as our pre-training model for the consistency between the pre-training tasks and the downstream slot-QA tasks.
bility of the model.The prefix length is set to 5 and the dropout rate is set to 0.1.Since RCSF uses the BERT-Large5 model, we use T5-large6 model to match the number of parameters of the model used in RCSF.The number of parameters of T5-base7 , T5-large and prefix parameters are 2.2 billion, 7.7 billion, and 20 million, respectively.For all experiments, we train and test our model on 3090 GPU and use f1-score as the evaluation metric.During the training process, we only do prefix-tuning on T5-base, we fix the parameters of T5-base and only fine-tune the parameters of prefix embeddings.We take the average F1 scores of three experiments as our final result.

Main Results
Results show that our proposed framework GZPL significantly outperforms SOTAs.Our base model GZPL(pt) outperforms PCLC by 15.00% and QASF by 13.37% respectively.We don't directly compare our model with RCSF because it uses two unfair settings: using BERT-large as backbone and pre-training it on the QA dataset SQuAD2.0 (Rajpurkar et al., 2018b).Nevertheless, our base model still outperforms RCSF by 2.06%.We adopt another setting to compare with RCSF, that is, change the backbone model to T5-large to ensure that the model size is consistent.We can see GZPL*(pt) with T5-large outperforms RCSF by 6.31%.Besides, we also find using prefix-tuning is better than traditional fine-tuning, which proves prefix-tuning has better knowledge transferability.8source domains, it will be categorized into the "unseen slot" part, otherwise "seen slot".The results are shown in Table 2.We can see that our method outperforms previous methods by a large margin on unseen slots, while performs slightly worse than RCSF on seen slots.Our model focuses more on the generalizable knowledge transfer rather than overfitting on the seen slots in source domains, so it has stronger generalization ability than the previous methods.

Analysis
Robustness Analysis To verify the robustness of our framework, we change the original template "what is the ?" as RCSF.We still use the complete template during training, but delete some tokens of the template during testing, and the results are shown in Table 3.Our model drops slightly by average 4.2% when the template changes, while RCSF drops significantly by 15.6%.This demonstrates that our model is more robust to different input templates.
Effectiveness Analysis To further explore the effectiveness of the GZPL under low resource scenarios, we conduct several low-resource settings on source domains, which means only 20, 50, 100, 200 and 500 samples in source domain are used during training stage.As SOTA model (RCSF) does not show results of few-shot experiments, we evaluate RCSF using its open source code.As shown in Table 4, the per formance of our model is much better than that of RCSF under low resource conditions.Besides, with only 100 samples (5%), our model maintains 63.13% performance compared to the results using complete source domain data.While using 500 samples (25%), 82.08% performance can be maintained.This demonstrates our approach is more data-efficient than other slot filling models.
Ablation Studies To better prove the effectiveness of the label prompt strategy and the inverseprompt task, we conduct ablation experiments on these two components.sults of ablation, where "w/o" denotes the model performance without specific module.As we can see, the model will have a slight performance drop (-2.35%) if the slot types in template are removed and the performance of the model will degrade significantly (-3.5%) without the inverse-prompt task.Besides, it is observed that when removing both the label-prompt and inverse-prompt jointly, the performance of the model will drop drastically (-4.69%).This suggests that both of them play an important role in improving the performance.

Conclusion
In this paper, we introduce a generative prompt learning framework for zero-shot cross-domain slot filling.Based on this, we introduce the label prompt strategy and the inverse prompting to improve the generalization capability and robustness of the framework.Another prefix-tuning mechanism is performed to boost model training efficiency.The exhaustive experimental results show the effectiveness of our methods, and the qualitative analysis inspire new insight into related area.Generally, our framework can be applied to more complex situations, such as nested NER, discontinuous/multiple slots, which we leave to future work.Another interesting direction is to improve the inference efficiency, like concat all the slot questions together and get final results.A Details about the input and output formats Table 6 shows an example of how to perform slot filling tasks for a user query under our settings.As shown in the table, since we already know the slot type information for the domain the data belongs to, we will customize the unique questions for each slot type according to our template and the model then generate the answers for each question.The answer can be one or more spans in the original sentence, or be the special token "none".It is worth noting that when a slot type corresponds to multiple slot entities, the answer will be separated by commas.However, this situation hardly exists in the Snips dataset, so it is rare to have multiple spans as answers when testing.

B Analysis of the Inverse-prompting Task
To further explore whether our auxiliary task alleviates the problem of repeated generation, we verify its effect through the following two metrics: precision and recall score.We use these metrics based on our recognition that repeated generation will result in more entities being predicted.On the one hand, this will improve the recall score, and on the other hand, it will hurt the accuracy of the model prediction.The experimental results are shown in Figure 3.As can be seen from the figure, after adding this inverse-prompt task, the recall-score of the model decreased by 3%, while the precision-score increased by 5.5%, which also increased the overall f1-score by 2.4%.We also conducted a case study on the output of the model, and the results are shown in Table 7.After the tasks are added, the repeated generation of the model is significantly reduced.These results above illustrate that the proposed task enables the model to learn deep relationships between slot types, thereby reducing the problem of repeated generation.

C Limitations and Future Work
The current work does achieve better performance than previous methods, but processing only one slot type at a time also reduces the efficiency of the model.In the future, we will explore how to maximize model efficiency.It would be an interesting challenge to generate answers for all the slots at once without degrading the effect of the model.Also, we will also try to apply our framework to more scenarios, such as NER and other tasks to explore the adaptability of the proposed method.

Figure 1 :
Figure 1: Illustration of different frameworks for zeroshot slot filling.

Figure 2 :
Figure 2: The overall architecture of our proposed GZPL framework with inverse prompting.
SNIPS(Coucke et al., 2018) is a public spoken language understanding dataset consisting of crowdsourced user utterances with 39 slots across 7 domains.It has around 2000 training instances per domain.To simulate the cross-domain scenarios, we followLiu et al. (2020) to split the dataset, which selects one domain as the target domain and the other six domains as the source domains each time.

Figure 3 :
Figure 3: Impact of the proposed inverse-prompt task on F1, precision and recall scores.

Table 1 :
(Liu et al., 2020) on SNIPS for different target domains under zero-shot settings.ftandptstandsforfine-tuning and prefix-tuning respectively.*indicatesthebackbonemodel is a large version of pre-trained model.(RZT)proposedby(Shahetal., 2019), which is based on CT and leverages both slot descriptions and examples to improve the robustness of zero-shot slot filling.Coarse-to-fine Approach (Coach) proposed by(Liu et al., 2020), which contains coarse-grained BIO 3-way classification and a fine-grained slot type prediction.In this model, slot descriptions are used in the second stage to help recognize unseen slots, and template regularization is applied to further improve the slot filling performance of similar or the same slot types.

Table 2 :
Average F1 scores on seen and unseen slots across all target domains.

Table 5 :
Table 5 illustrates the re-Ablation studies.LP and RP stands for label prompt and inverse prompt, respectively.

Table 7 :
The case study of GZPL w/o Inverse Prompting