QA-Driven Zero-shot Slot Filling with Weak Supervision Pretraining

Slot-filling is an essential component for building task-oriented dialog systems. In this work, we focus on the zero-shot slot-filling problem, where the model needs to predict slots and their values, given utterances from new domains without training on the target domain. Prior methods directly encode slot descriptions to generalize to unseen slot types. However, raw slot descriptions are often ambiguous and do not encode enough semantic information, limiting the models’ zero-shot capability. To address this problem, we introduce QA-driven slot filling (QASF), which extracts slot-filler spans from utterances with a span-based QA model. We use a linguistically motivated questioning strategy to turn descriptions into questions, allowing the model to generalize to unseen slot types. Moreover, our QASF model can benefit from weak supervision signals from QA pairs synthetically generated from unlabeled conversations. Our full system substantially outperforms baselines by over 5% on the SNIPS benchmark.


Introduction
Automatic slot filling, which extracts task-specific slot fillers (e.g. flight date, cuisine) from user utterances, is an essential component to spoken language understanding (Bapna et al., 2017). As shown in Figure 1, the model predicts the slot filler "Joe A. Pass" for the slot type "artist" given an input utterance. However, fully supervised slot filling models (Young, 2002;Goo et al., 2018) require labeled training data for each type of slot (Shah et al., 2019). It is even more of a problem for dataintensive models (Mesnil et al., 2014). This makes the development of new domains in these systems a challenging and resource-intensive task.
This has motivated studies in cross-domain zeroshot learning for the slot-filling task (ZSSF), where * Work done during internship at Google Research. the goal is to achieve good slot-filling performance on new domains without requiring additional training data. Previous work (Bapna et al. (2017); Shah et al. (2019)) often uses a sequence tagging approach (similar to the upper image in Figure 1 in a high-level way). To achieve zero-shot domain transfer, they directly encode raw slot descriptions or names, such as "playlist", "music item", to enable models to generalize to slot types unseen at training time. However, slot descriptions are often ambiguous and typically do not encode enough semantic information by themselves. Instead of directly encoding slot descriptions and examples, we introduce a QA-driven slot filling framework (QASF) (Figure 1). Inspired by the recent success of QA-driven approaches (McCann et al., 2018;Logeswaran et al., 2019;Gao et al., 2019;Li et al., 2020;Namazifar et al., 2020), we tackle the slot-filling problem as a reading comprehension task, where each slot type (e.g. "artist") is associated with a natural language question (e.g. "Who is the artist to play?"). A span-based reading comprehension model is then used to extract a slot filler span from the utterance by answering the question. 1 In this work, we use a linguisti-cally motivated question generation strategy for converting slot descriptions and example values into natural questions, followed by a BERT-based QA model for extracting slot fillers by answering questions. As shown in our experiments, this QAdriven method is better at exploiting the semantic information encoded in the questions, therefore it generalizes better to new domains without any additional fine-tuning, as long as the questions are meaningful enough. To the best of our knowledge, we are the first to leverage weakly supervised synthetic QA pairs extracted from unlabeled conversations for a second-stage pretraining. Drawing insights from Mintz et al. (2009), we create a weakly supervised QA dataset from unlabeled conversations and an associated ontology. The synthetic QA pairs are constructed by matching unlabeled utterances against possible slot values in the ontology. This provides a general and cost-effective way to improve QA-based slot filling performance with easily obtainable data.
Experimental results show that (1) our QASF model significantly outperforms previous zeroshot systems on SNIPS (Coucke et al., 2018) and TOP (Gupta et al., 2018); (2) encoding natural questions help models better leverage weakly supervised signals in the pretraining phase, compared to encoding raw descriptions.

Task Definition
Given an input utterance u, a slot filling model extracts a set of (slot type, span) pairs (s i , a i ), i = 1, . . . , k where s i comes from a fixed set of slot types S, and each a i = (j, k), 1 ≤ j < k ≤ |u| is a span in u. Each slot type is accompanied with a short textual description that describes its semantic meaning (Table 1). We also assume that a small amount of example slot values are given, following Shah et al. (2019).
Our goal is to build a slot filling model that performs well on a new target domain with unseen slot types. Our training data consists of utterances from N source domains D 1 , D 2 , ..., D N . Each domain D i is associated with a set of predefined slot types S i . At test time, utterances are drawn from a new domain D N +1 . The new domain contains both seen and unseen slot types from the source domain. For example, in the SNIPS dataset (Coucke et al., 2018), domains "GetWeather" and "BookRestaurant" both predicate-argument structures are represented as QA pairs. have a slot type called "city", while "condi-tion_temperature" only appears in the "GetWeather" domain.

Methodology
In this section, we describe our framework for Question Answer-driven Slot Filling (QASF). The framework consists of (1) a question generation strategy that turns slot descriptions into natural language questions based on linguistic rules; (2) a generic span-extraction-based question answering model; (3) an intermediate pretraining stage with generated synthetic QA pairs from unlabeled conversations, which is before task-specific training.

Question Generation Strategy
To benefit from both language model pretraining and QA supervision, we design a question generation strategy to turn slot descriptions into natural questions. During this process, a considerable amount of knowledge and semantic information is encoded (Heilman, 2011). A generated question consists of a WH word and a normalized slot description following the template below: WH_word is slot_description ?
Generating WH_word We draw insights from the literature on automatic question generation. Heilman and Smith (2010) propose to use linguistically motivated rules. In their more general case of question generation from the sentence, answer phrases can be noun phrases (NP), prepositional phrases (PP), or subordinate clauses (SBAR). Complicated rules are designed with help from su-perTagger (Ciaramita and Altun, 2006).
For our spoken language understanding (SLU) tasks, slot fillers are mostly noun phrases 2 . Therefore, we design a simpler set of conditions based on named entity types and part-of-speech (POS) tags. For each slot type, we sample 10 (utterance, slot value) examples from the validation set. Then we run a NER and a POS tagging model 3 to obtain entity types and POS tags for each of the sampled answer spans. Finally, we select WH_word based on a set of rules described in Table 6 in Appendix.
Generating slot_description Instead of directly adding a raw description phrase in the question template, we normalize the phrase with the 656 Unlabeled Conversation Example: USER: I am looking for a place to stay in the north of the city. I would prefer a 4-star hotel please.

SYS:
There are several guesthouses available. Do you have a price reference?

USER:
The restaurant should be in the moderate price range. ...    following simple rule: If the description is of the format "A of B", where both A and B are noun phrases (NP), we only keep B in the phrase if the WH_word is "How long" or "How many". Examples of generated questions for corresponding slots are presented in Table 1. Compared to slot descriptions, our questions are more precise and can encode more semantic information.

Question Answering Model
We use BERT (Devlin et al., 2019) where x 1:M are the input tokens. Then the model predicts answer spans with two binary classifiers on top of the BERT outputs e 1:M . The two classifiers are trained to predict whether each token is the start or the end of an answer span, respectively, For negative examples, where a question has no answer spans in the utterance, we map the start and end token both to the [CLS] token. During training, we minimize the negative log-likelihood loss. All parameters are updated. During inference, predicting slot filler spans is more complex because there could be several or no spans to be extracted for each slot type. We first enumerate all possible spans and only keeping spans/answers satisfying certain constraints (Appendix Section B) as fillers.

Pretraining with Weak Supervision
Pretrained masked language models do not have the capability of question answering before being fine-tuned on task-specific data. We hypothesize that adding a pretraining step with synthetic QA pairs before fine-tuning can contribute to models' understanding of interactions between question and utterance. For example, improvements have been reported by QAMR (He et al., 2020) on SRL and textual entailment (TE). Previous researches (Wu et al., 2020;Gao et al., 2020) have used crowdsourced QA pairs, but typically the improvement margin is not significant (Wu et al., 2020) when the task-specific data is in a different domain (SQuAD v.s. newswire). Therefore we introduce a method of collecting relevant and distantly supervised QA pairs and investigate their influences in pretraining. More specifically, we draw insights from Mintz et al. (2009) for creating a weakly supervised dataset. Figure 2 illustrates the process. Given an ontology or database of slot types and all possible values for each slot type, we find all utterances containing those value strings in a large set of unlabeled conversations. For example, in Figure 2, for the "hotel_price_range" slot, there are three possible values "expensive", "cheap" and "moderate" in the ontology. We then form question-answerutterance triples using the question generation strategy proposed in Section 3.1.
To obtain the pre-defined ontology and unlabeled conversations, we use MultiWOZ 2.2 (Zang et al., 2020), which is an improved version of Mul-tiWOZ (Budzianowski et al., 2018). We do not use annotations in the dataset such as the (changes of) states in the conversations and we treat each utterance independently. We remove slot types that exist in the task-specific training/test data (i.e., TOP (Gupta et al., 2018) is a task-oriented utterance parsing dataset. It is based on a hierarchical annotation scheme for annotating utterances with nested intents and slots. Each slot type also comes with a description. In our setup, we train on all seven domains of SNIPS as well as varying amounts of training data from the TOP training set (0, 20, and 50 examples), and use the TOP test set as an out-of-distribution domain for evaluation. We report span-level F1 (micro-average).
We compare our method against a number of representative baselines. Concept Tagger (CT) (Bapna et al., 2017) is a slot-filling framework that directly uses original slot descriptions to general-ize to unseen slot types. Robust Zero-shot Tagger (RZT) (Shah et al., 2019) is an extension of CT, which incorporates example values of slots to improve the robustness of the model's zero-shot capability. Coach (Liu et al., 2020) is a coarse-tofine model for slot-filling. It also encodes raw slot descriptions. We also include a Zero-Shot BERT Tagger (ZSBT) based on BERT (Devlin et al., 2019) as an additional baseline. ZSBT directly encodes raw slot descriptions and utterances and predicts a tag (B, I, or O) for each token in the utterance.

Results and Analysis
We report F-1 of the baselines and our model on each target domain test set of SNIPS as well as average F-1 across domains. All models are trained on the other six domains for each target domain. As shown in Table 2, our QA-driven slot filling framework (QASF) significantly outperforms all baselines in five of the seven domains, with slightly lower performance on BookRestaurant than ZSBT, and lower performance on Find-ScreeningEvent than Coach. The average F-1 of QASF is around 7% higher than the prior published state-of-the-art Coach model, and about 2% higher than the Zero-shot BERT Tagger baseline. Adding the intermediate pre-training stage on weakly supervised data further improves performance on top of QASF in six of the seven domains except for AddToPlaylist. On average, adding pre-training improves over QASF by 2.9% F-1. The zero-shot performance of all models are relatively worse on PlayMusic, RateBook and FindScreeningEvent. A more detailed discussion is in Appendix Section C. Table 3 summarizes TOP test results: (1) In both the zero-shot and few-shot settings, our QASF outperforms ZSBT, with a bigger improvement on the zero-shot setting. (2) Pretraining on the weakly supervised QA pairs helps more in the zero-shot setting than in the few-shot setting, with a 20% relative improvement. This shows that QASF (w/ pretraining) is more robust to the domain shift when there is no target domain training data.
Impact of QG strategy and pretraining To understand the influence of question generation and impact of pretraining with synthetic QA pairs, we perform ablation studies of both components on the SNIPS dataset. The table below shows ablation results (F-1). "w/o QG" refers to a model trained with raw slot descriptions and utterances.
Firstly, the question generation strategy consis-    tently helps, with a 2.48% F-1 gain in "w/o pretraining" and a 4.22% F-1 gain in "w/ pretraining". Secondly, the pretrained representations from additional weakly supervised data improve F-1 by 2.86% in "w/ QG" and 1.12% in "w/o QG". More interestingly, the gain from the questioning strategy is larger when combined with the pretraining (9.8% as compared to 5.9%). This demonstrates that synthetic QA pairs are also helping with getting better QA-aware representations before fine-tuning on the task-specific data for slot-filling.

Error Analysis
We further conduct manual error analysis on the models' predictions on SNIPS. We find that there are several sources where the errors are from: The variance between the source and target domains. Sometimes even slot types of the same name refer to different kinds of objects in different domains. For example, slot type "object_type" in the "RateBook" domain refers to object types like textbook, essay and novel; while in the "Find-ScreeningEvent", it refers to event types like movie times/schedules. In the two domains, they have the same raw descriptions. In the table below, we show the performance of models on utterances with "object_type" and "object_name" spans (according to gold annotations). We can see that the performances on these special slots are significantly lower than the general average on all the examples (40-50%). But still, the questioning strategy helps improve the transferring of semantic information. Plus, the variance in semantic meaning between slot types in SNIPS and TOPS is even larger. For slots like "location_modifier", "road_condition", there are no semantic similar slots in SNIPS or pretraining dataset, which results in low performance. Having more specific/detailed slot descriptions and use them in the question generation would help further (Brown et al., 2020;Du and Cardie, 2020).
Annotation artifacts of SNIPS dataset and sparsity of vocabulary for certain slot types. Our QASF framework does not perform well on the target domain "BookRestaurant", thus we take a close look at it. We find that there are only 25 possible values in total for slot restaurant_type, over 51% of them are of a single token "restaurant" (Table  below). A very simple approach (assigning type "restaurant_type" to all tokens "restaurant" can obtain decent performance). This does not happen for Slot Value "restaurant" "bar" "pub" "brasserie" Proportion 51.81% 7.26% 6.60% 6.23% other slot types in BookRestaurant (e.g., cuisine, restaurant_name). The possible values are more diverse and the distribution is more balanced.

Conclusion
We propose a QA-driven method with weakly supervised pretraining for zero-shot slot filling. Our experimental results and analyses demonstrate the benefits of QA-formulation, especially in the setting with synthetic QA pairs pretraining.

A Details of Question Generation
The part-of-speech tagset is based on the Universal Dependencies scheme 4 . The named entity labels are based on OntoNotes 5.0 (Weischedel et al., 2013). In Table 6, we describe the set of rules for the selection of WH_word.

B Inference Constraints
At inference time, predicting the slot filler spans is more complex -for each slot type, as there can be several or no spans to be extracted. After the output layer, we have the probability of each token x i being the start (P s (i)) and end (P e (i)) of the span. We harvest all the valid candidate spans for each slot type with the following heuristics: 1. Enumerate all possible combinations of start offset (start) and end offset (end) of the spans ( M (M −1) 2 candidates in total); 2. Eliminate the spans not satisfying the constraints: (1) start and end token must be within the utterance; (2) the length of the span should be shorter than a maximum length constraint; (3) spans should have a larger probability than the probability of "no answer" (which is represented with [CLS] token), namely,

C Further Analysis and Discussions
We conduct further analysis to understand how and why the models are effective.
C.1 Impact of question generation strategy and pretraining Table 7 shows full ablation results.

C.2 Analysis on Seen versus Unseen Slots
To understand the transferring capability of our models, we further split the SNIPS test for each target domain into "seen" and "unseen" slots. An example is categorized into "unseen" as long as there is an unseen slot (i.e., the slot does not exist in the remaining six source domains in its gold annotation.) Otherwise, it counts as "seen". A full list of unseen slots for each target domain can be found in the Appendix. 4 universaldependencies.org/u/pos/ As is shown in Table 5, we can see that (1) both ZSBT baseline and our models perform better on the "seen" slots than the "unseen" ones -the numbers substantially drop on the "unseen" slots. This proves that transferring from the source domains to the unseen slots in the target domain is a hard problem.
(2) On the portion of examples with "seen" slots, our best model outperforms ZSBT with around a 2% margin. (3) On the "unseen" portion of examples, the margin is larger -our QASF and pretraining step help improve the performance more (over 4%). The second and third observation together demonstrates that the questioning strategy help improve the model's capability of transferring between related but not exactly same slot types (e.g., "object_name" and "entity_name").  Table 5: Averaged F-1 scores over all target domains of SNIPS dataset (for "unseen" and seen "slots").

D Hyper-parameters and Training Details
We use the uncased version of the BERT-base (Devlin et al., 2019) model for QA finetuning and pretraining. The model is fine-tuned for 5 epochs with a starting learning rate of 3e-5 on the SNIPS dataset. The model is pretrained for 5 epochs with a starting learning rate of 5e-7 on the synthetic QA dataset. Our implementations are based on https://github.com/google-research/bert/ blob/master/run_squad.py

WH_word Conditions Answer Examples
How long The answer phrase is modified by a cardinal number (CARDINAL) or quantifier phrase (QUANTITY) whose object is a temporal unit, as is defined in (Pan et al., 2011), i.e., second/minute/hour/day/week/month/year/decade/century.

nights
How many The answer phrase is modified by a cardinal number (CARDINAL) or quantifier phrase (QUANTITY) and the object is not a temporal unit. 2 stars, 3 tickets How adjective ADJ moderate, expensive

When
The answer phrase's head word is tagged DATE or TIME 1:30 PM, 1999

Who
The answer phrase's head word is tagged PERSON or is a personal pronoun PRON (I, he, herself, them, etc.) mother, Dr. Williams

Where
The answer phrase is a prepositional phrase whose object is tagged GPE or LOC, whose preposition is one of the following: on, in, at, over, to amc theaters, fort point san francisco, east, west, ...

Which
The answer phrase is a determiner DT (this, that) or an ordinal ORDINAL this, first, current, last What all other cases