Frustratingly Simple Few-Shot Slot Tagging

We propose a simple and effective few-shot model for slot tagging. Recent work shows that it is promising to extend standard few-shot classiﬁcation methods to sequence labeling with CRF-speciﬁc augmentations. Such methods show strengths in encoding slot name semantics and slot dependencies. However, we ﬁnd these strengths can be obtained by a much simpler method, which casts slot tagging into machine reading comprehension (MRC). We ﬁne-tune a standard BERT-based MRC model with a mixture of source domain and (few-shot) target domain data. Such simple method outperforms state-of-the-art methods by a large margin on the SNIPS dataset.


Introduction
This paper considers the task of few-shot slot tagging. Slot tagging (Zhang and Wang, 2016;Haihong et al., 2019) is a core component for taskoriented dialog systems (Papineni et al., 2001), where the goal is to provide a fine-grained, structured description of user request for a given intent. Example (1) shows the input-output of a slot tagging module for the book restaurant intent, where the module yields the semantic analysis for the input query in terms of slot label-value pairs such as restaurant type is 'brasserie' and time range is '15 minutes', etc.
(1) Query: I want to book a far brasserie that serves minestrone in PA for a party of 9 in 15 minutes. Tagged Slots: {restaurant type: 'brasserie', time range: '15 minutes', state: 'PA', party size number: '9', served dish: 'minestrone'} In real world systems, slot taggers are required to rapidly cover new domains to address increasing * Equal contribution. Order decided by tossed coins. user needs. A key challenge here is that labeled data are often scarce in new domains and the high cost of manually annotating large-scale data becomes a major obstacle for domain adaptation. An attractive alternative is few-shot learning, which aims at achieving reasonable good results using only a few labeled instances in the new domain.
Although there are many successful few-shot classification methods, especially meta learning ones (Bapna et al., 2017;Luo et al., 2018;Fritzler et al., 2019), directly adapting them to slot tagging often yields unsatisfactory results . This is due to the sentence-level, extractive nature of slot tagging, where token dependencies within sentences are important but ignored by token-level classification models. Recent work  tackles this problem by extending meta learning methods to sequence transduction within the BERT-CRF framework, using several CRF-specific augmentations. Such augmentations show strengths in both encoding slot name semantics and modeling slot dependencies, which are key elements for effective few-shot slot tagging.
This paper shows that we can enjoy similar strengths with a frustratingly simple method. Our approach is based on the idea of transforming fewslot tagging into supervised machine reading comprehension (MRC), the detailed formulation of which is described in Section 3, with an example shown in Table 1. The implementation of our method is incredibly simple: we fine-tune an offthe-shelf BERT-based (Devlin et al., 2019) MRC model with data being merged from the source domain and (few shot) target domain data, without any meta learning or extra engineered components.
Our simple method works for good reasons. As the MRC-based approach extracts the full span of each slot value based on the complete sentence, such a model is aligned with the extractive nature of the task and implicitly considers slot dependencies. Moreover, the model can naturally encode label se-mantics by mentioning the label names in the constructed questions. For our MRC model, the slot labels and the sentences being tagged from both source and target domains all reside in the same semantic space, where the training upon mixed data forces the model to generalize to a semantic space that is compatible with both domains. Through experiments, we also find that our model can better leverage linguistic and world knowledge in pre-trained language models, than previous BERTbased few-shot slot taggers.
The contributions of this paper are twofold: (1) We propose a simple and effective approach to fewshot slot tagging, which is based on training a supervised MRC model. (2) We empirically show the effectiveness of the proposed method, which outperforms previous state-of-the-art by 4+ points on the SNIPS benchmark.

Related Work
Slot Tagging Intent detection and slot tagging are two key modules in spoken language understanding. Slot tagging is often cast to sequence labeling problem (Zhang and Wang, 2016;Liu and Lane, 2016;Haihong et al., 2019;Qin et al., 2019). This paper adopts MRC formulation and focuses on the few shot learning setup.

Few Shot Learning
In NLP, few shot learning methods mostly focus on classification tasks Geng et al., 2019), while efforts on sequence labeling like slot tagging are rarely (Luo et al., 2018;Fritzler et al., 2019).  explored few shot slot tagging by considering both label dependency transfer and label name semantics. Our model enjoys similar strengths but is much simpler and more effective.
QA Format for NLP Tasks Question answering, in particular machine reading comprehension (MRC) models , is typically trained to answer questions by extracting a text span from the given context. Recently, there is a trend to cast non-QA NLP tasks, such as information extraction (Levy et al., 2017;Li et al., 2019, coreference resolution (Hou, 2020; and more (McCann et al., 2018) into MRC, which can achieve comparable or improved results. Our work is inspired by these works, but tackles a new task of slot tagging with a focus on few-shot learning.

Slot Tagging as MRC
MRC formulation Given a question Q = q 1 , q 2 , ..., q L , and a context passage C = c 1 , c 2 , ..., c M , where |Q| = L and |C| = M are their token numbers respectively, while question Q is used to extract required spans from context C. The task is to find the span between the start token C start and end token C end in the context, for the given question w.r.t. each slot type. Some example questions and their context are shown in Table 1. For all questions associated with the same sentence, we provide one context C, which consists of the original sentence, being concatenated by special tokens "NO ANSWER" , which should be extracted, when no answer (span) is available in the given sentence for that question (slot type).
Following the MRC setup as in BERT (Devlin et al., 2019), we resort to the standard questionanswer usage of BERT to find span C start -C end , i.e. feeding the token sequence in the form of [CLS], q 1 , q 2 , .., q L , [SEP], c 1 , c 2 , ..., c M as the input to the BERT model, where the special tokens [CLS] and [SEP] are inserted between Q and C to distinguish them. Then, the hidden states from the last layer of BERT are extracted as the representations of input tokens. Probabilities P start (i) and P end (i) of each token position being the start and end position of the answer span are computed through the following formulas (1), where i = 1, 2, ..., M . For both start and end index predictions, tokens between the highest P start (i) and P end (j) will be predicted as the slot content for the slot type being asked in the question Q.
To train the MRC model, we first convert the original dataset, pairs of split sentence tokens and slot types into a set of <question, answer, con-text> triples, similar to the *format* of SQuAD 1.1 dataset (Rajpurkar et al., 2016). Triples of different samples are shuffled into batches to feed into the MRC model to get predictions of start and end position indices. Every slot type in the corresponding domain will be asked sequentially to find corresponding spans, or 'NO ANSWER' will be extracted if that slot type doesn't appear. We use cross entropy loss between predictions and ground truths.
Question Generation For each slot type y ∈ Y to be predicted, we use a unified template to generate a question Q by joining domain name with slot type, which are both split by upper case letters and underlines, since they have the necessary information and short enough to keep model focusing on context C. As shown in example (1)

Few Shot Learning
For a few-shot learning task, we have a target domain D 1 = (x i , y i ) with few labeled data, and n resource-rich source domains D 2 ...D n . The task is to discover the optimal hypothesis h from x to y in domain D 1 . To fit our MRC model into N -way k-shot settings, we follow the data construction procedures in , where the support set S = (x i , y i ) is constructed by ensuring every slot in target domain appears approximately k times and each entity only appears once in a sentence.
We randomly generate 100 above support sets. For each set, we pair it with a query set having 20 excluded samples to form an episode. Together with D domains, our test set is made up of D × 100 episodes. Note that our model is trained in one-go with the data being mixed up from the source domain and the support set of the target domain in each episode, unlike two-phase training in typical few shot learning methods. Despite the difference, the amount/split of the data for training and evaluation is exactly the same as in previous work.For zero-shot learning, we directly evaluate the source domain trained model on the full target domain.

Setup
Dataset Our experiment is based on the SNIPS data set (Coucke et al., 2018) which is a benchmark dataset for slot tagging. It has data samples from 7 different domains, namely, Weather (We), Playlist (Pl), Book (Bo), Music (Mu), Restaurant (Re), Screening Event (Se) and Creative Work(Cr). Following few-shot setup in previous work, we split SNIPS data by domain. Each time, we leave one for testing, one for development and the others for training. Such procedure is repeated 7 times for cross validation.
Baselines Bi-LSTM (Schuster and Paliwal, 1997) is trained on the support set and tested on the query set using word embeddings of GloVe (Pennington et al., 2014). Matching Network (MN) with BERT (Vinyals et al., 2016) builds on top of BERT and labels the sequence in a token-level classification way. For each word, the most similar token in the support set is chosen and its label is assigned accordingly. WarmProtoZero (WPZ) (Fritzler et al., 2019) adopts similar strategy as MN, except replacing matching network with the prototypical network (Snell et al., 2017). SimBERT also classifies each token with the most similar word in the support set and assigns the corresponding label, where BERT embeddings are used without fine-tuning. TransferBERT is a domain transfer model based on vanilla BERT. It is pre-trained on source domain data, followed by fine-tuning on the support set of the target domain. L-TapNet+CDT  is a sequence labeling model based on BERT+CRF, where the Collapsed Dependency Transfer is used for transferring label-tolabel dependencies and TapNet is used for transferring label semantics.

Implementation Details
The pre-trained BERTbase uncased model is used for our method, where the batch size is set to 16, with max sequence length of 512. We use Adam optimizer (Kingma and Ba, 2014) with initial learning rate of 1 × 10 −5 during training. We train the model with 30 epochs for each episode of evaluation, and get results according to develop domain.

Experimental Results
Main Results for 5-Shot Learning Table 2 shows the results for 5-shot learning. Each column indicates the per-domain results, where that domain  is used as the target domain while others are used as source domains. In most domains, our model achieves improved results than all baseline, being based on BERT or not. In particular, our method outperforms the previous SOTA, L-TapNet+CDT, by 4.16% on average F1 score. Analysis Figure 1 compares the performance of our model in zero/one/five-shot setup. Since the training of our method is just fine-tuning upon the mixture of source and target domain data, one-shot setup means using 0.01% extra (target domain) data over zero-shot setup. Yet, the boost is dramatic for most domains. We speculate that these tiny bit of target domain has a catalyst effect, which changes the optimization trajectory of model. It might be the case that such training forces our MRC model to generalize to a semantic space that is compatible with both source and target domains. Note that our zero-shot model achieves an average F1-score of 52.5%, outperforming previous zero-shot SOTA, such as 40.6% of Shah et al. (2019) and 37.39% of . As for one-shot setting, the average F1-score of our model is 69.3%, on par with 70.4% of the 1-shot SOTA (Hou, 2020).
Few shot learners typically rely on source do-main data to arrive at a good hypothesis. Figure 2 shows the sensitivity of our model w.r.t. the scale of available source domain data. In each episode of evaluation, we select a subset (100,1000,2000) of sentences from source domains according to the rank of text similarity between them and the support set. While the more the better holds in general, we see that 1000 source domain sentences suffice for competitive results.

Conclusion
In this paper, we propose a BERT-based MRC approach to few-shot slot tagging. By casting slot tagging into MRC problem, the learning consists of fine-tuning the MRC model with labeled sentences from a mixture of source domain and few-shot target domain data. Such an MRC-based method can naturally encode the label semantics in the form of questions, while the training forces the model to generalize to a semantic space that is compatible with both domains. Experiment results show the effectiveness of our simple method, as it outperforms previous SOTA on the SNIPS benchmark by a large margin. For future work, we plan to extend our approach to similar tasks, such as semantic role labeling and named entity recognition.