Few-shot Learning for Slot Tagging with Attentive Relational Network

Metric-based learning is a well-known family of methods for few-shot learning, especially in computer vision. Recently, they have been used in many natural language processing applications but not for slot tagging. In this paper, we explore metric-based learning methods in the slot tagging task and propose a novel metric-based learning architecture - Attentive Relational Network. Our proposed method extends relation networks, making them more suitable for natural language processing applications in general, by leveraging pretrained contextual embeddings such as ELMO and BERT and by using attention mechanism. The results on SNIPS data show that our proposed method outperforms other state of the art metric-based learning methods.


Introduction
Neural networks have been successfully utilized in natural language processing (NLP) applications with a large amount of hand-labeled data whereas they suffer a persistent challenge of low-resource. The approach of learning with few samples, known as few-shot learning -a branch of meta-learning (learn to learn) -has recently been popularized (Fei-Fei et al., 2006;Ravi and Larochelle, 2016;Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018) in computer vision. Recently, few-shot learning has also been applied to NLP tasks, e.g. natural language understanding (Dou et al., 2019), text classification (Jiang et al., 2018;Rios and Kavuluru, 2018;Gao et al., 2019;Geng et al., 2019), machine translation (Gu et al., 2018) and relation classification (Obamuyide and Vlachos, 2019).
In the slot tagging task, we aim at predicting taskspecific values (e.g. artist, time) for slots (placeholders) in user utterances. Oguz and Vu (2020) propose a two-stage modeling approach to exploit domain-agnostic features to tackle low-resource do-main challenges. Besides, the other state of the art techniques e.g. based on external memory (Peng and Yao, 2015), ranking loss (Vu et al., 2016), encoder (Kurata et al., 2016), and attention (Zhu and Yu, 2017) have achieved promising results with a wide range of neural networks methods.
However as many other NLP applications, the low-resource issue is a tremendous challenge for slot tagging in new domains, although labeled samples exist in related domains. Many studies have recently proposed to overcome this low-resource challenge using different techniques, e.g. multitask modeling (Jaech et al., 2016), adversarial training (Kim et al., 2017), and pointer networks (Zhai et al., 2017). In addition, studies like zero-shot learning has influenced the studies of the domain scaling problem for slots prediction (Bapna et al., 2017), eliminating the need of labeled examples for transferring reusable concepts (Zhu and Yu, 2018;Lee and Jha, 2019), and conveying the domainagnostic concepts between the intents (Shah et al., 2019) by exploiting label names and descriptions. Likewise, (Hou et al., 2020) use label semantics within a few-shot classification method TapNet (Yoon et al., 2019).
We suggest using a small amount of annotated samples from different domains as training input instead of slot descriptions and slot names as in previos zero-shot (Bapna et al., 2017;Lee and Jha, 2019;Shah et al., 2019) and few-shot (Hou et al., 2020) slot tagging studies for two reasons: (1) The creation of slot descriptions needs qualified linguistic expertise and is thus expensive. (2) The relationship between slot names and the corresponding tokens is not constant. To give an example, the relationship between the 'genre' slot name and 'drama' token is hypernymic whereas the relationship between the 'artist' slot name and 'Tarkan' token is instance based. Hence, it may not be valid to learn only one function to represent the different relationships between names and tokens.
In this paper, we provide a new experimental design where the slot tagging task needs to be solved for unseen slot labels. The experimental design mimics previous few-shot learning studies (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018). Thus, the existing data sources from different domains are used to learn meta-knowledge, whereas unseen labels from low-resource domains are used to evaluate the models. Furthermore, we propose a novel modeling approach -Attentive Relational Network, inspired by (Sung et al., 2018;Jiang et al., 2018;Jetley et al., 2018), that leverages contextual embeddings such as ELMO and BERT and extends the previous relation networks (Sung et al., 2018) by learning to attend local and global features (Jetley et al., 2018). Experimental results on SNIPS data show that the proposed model outperforms other few-shot learning networks.

Input
FastText (Mikolov et al., 2018) is an approach to enrich the word vectors with a bag of character n-gram vectors. ELMo (Peters et al., 2018) is a contextualized word representation methods. It concatenates the output of two LSTM independently trained on the bidirectional language modeling task and return the hidden states for the given input sequence.
BERT (Devlin et al., 2019) uses a bidirectional transformer model that is trained on a masked language modeling task. Because of WordPiece embeddings (Wu et al., 2016), there are different choices of presenting words. We use the first subtoken for representing the word as proposed in (Devlin et al., 2019). Additionally, due to the structure of multiple successive layers, i.e., 24 layers and as suggested in (Oguz and Vu, 2020), we select 10th, 11th, 12th, and 13th as the focused layers on local context (Clark et al., 2019;Tenney et al., 2018) for slot tagging.

Meta-learning strategy
Despite the fact that the proposed methods differ in their learning strategies, episode-based training is the same in meta-training and meta-testing phases for proposed meta models as mentioned in . For the purpose of applying episodic training in a robust way, we follow the proposed procedures in Vinyals et al. (2016); Snell In the episodic training, each stepepisode -is formed to compute gradients and update the model parameters. An episode consists of two components: support and query sets. To construct an episode, C unique classes are randomly sampled, and for each selected C unique classes K labeled examples randomly drawn for support The same episode composition strategy is applied in the metatesting stage to evaluate the performance of the trained model over unseen classes. Meta-training. The aim of this phase is to learn a meta learner that maps from a few labeled samples to a classifier. In each episode, metatraining employs a two-stage process: (1) the first stage implies producing the feature maps from the given input S and Q, called embedding function f φ (x) (2) the second stage is to make prediction conditioned on few labeled examples, S. More formally, we define an episode E train includes S and Q selected from train data,D train . Then, the model is trained to minimize the label prediction error in the Q conditioned on S, i.e., P θ (y j |x j , S), by utilizing the distance or relation as also shown with an example in Figure 1. Meta-testing. In this phase, we test the performance of trained meta-learner on unseen labels by following the same steps in meta-training phase.
An episode E test with S and Q is formed by randomly selecting from test data, D test . The over all accuracy is computed by averaging the test episodes, acc = 1 ||Etest|| i E test . We define C = 5 in meta-training and metatesting stages except the meta-testing stage of SearchCreativeW. domain with C = 3 because SearchCreativeW. domain has only three slots. We train all the model within 10,000 episodes, and evaluate with 1000 test episode after every 500 steps of total train episodes.

Models
We focus on three metric-based learning fewshot learning methods such as Matching Networks (Vinyals et al., 2016), Prototypical Network (Snell et al., 2017), and Relation Network (Sung et al., 2018). Each network consists of two consecutive modules. The first module, called embedding function, focuses on the learning of the transferable embeddings for support and query samples. The second module is the classifier which identify the corresponding classes over the defined metric scores, e.g., distance and relation.
Matching Networks (MatchingNets) compare the cosine distance between the query feature and each support feature, and computes average cosine distance for each class. Prototypical Networks (PrototypicalNets) compare the Euclidean distance between query features and the class mean of support features. Relation Networks (RelationNets) propose a learnable non-linear relation module to output the relation scores over element-wise sum of each support and query features.
In one-shot scenario, MatchingNets and Prototypi-calNets could be interpreted as identical, Relation-Nets differs with the relation module in order to calculate the relation score.

Attentive Relational Networks
We propose a novel metric-learning approach -Attentive Relational Networks (AttentiveRelational-Nets) that highlight the relevant, and suppress the misleading between support and query samples.
AttentiveRelationalNets address the few-shot classification problem by utilizing learn to compare based on attention insight. This can be seen as extending the strategy of Sung et al. (2018) to include a learnable attention module. A trainable attention module, inspired from Jetley et al. (2018), is added to incorporate the relation module of RelationNets. Besides, we make use of pretrained (contextual) embeddings since they have the proven strength on feature extraction for linguistics items instead of using embedding module, Figure 2. For AttentiveRelationalNets, as shown in Figure  2, we implement two convolution blocks as it is in RelationNets with residual connection, as proposed in He et al. (2016).Then, the convolution blocks produce local descriptors, i.e., l 1 and l 2 , as the output of activation function and pass them to the attention estimator in order to find the global g feature vector.
In order to compute the compatibility function, we define a convolution function with the input of two local features to an addition operation, c = u, l 1 +l 2 . Here, u represents the universal set of features relevant to the s and q pairs in the object categories. We normalize the compatibility scores by using sigmoid operation, a = σ(c). Then, the global feature vector is assessed by element-wise weighted average, i.e, g = l 1 * a. Afterwards, we concatenate the global features g with learned compatibility scores c as the input of the linear classifier which eventually produces a scalar in range of 0 to 1 representing the similarity between s and q , which is called relation score r. We define mean square error, as proposed in RelationNets, as the objective function of our model.

Evaluation
As we use the same implementation details for meta-training and meta-testing stages, we also evaluate the performance by few-shot classification accuracy following previous studies few-shot learning (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018) with a small change: since the metalearning approaches are fast learning methods, we present the average accuracies of training epochs instead of presenting the best accuracy.

Resources
In our study we address the few-shot learning approaches to recognize novel slot categories with very few examples from a new domain. In order to provide a deep experimental analysis of proposed networks and language models and to compare our model among each other, we set various experimental scenarios with different data and different K-shot sizes. Hence, we utilize the SNIPS dataset (Coucke et al., 2018) as a base dataset in our experiment. SNIPS is a SLU dataset of crowd-sourced user utterances with 39 slots and 7 intents. Thus, it is a well-categorized dataset which include tasks in domains, which makes the setup more realistic; learn to learn on a bunch of domains and test on new domains. We split SNIPS with the purpose of creating a single-domain dataset. We combine the originally divided training, testing, and development sets and separate them into domain sets in order to create new train and test data.

Few-shot Data Construction
Meta-learning models aim to learn from the training tasks, i.e. the train label space is disjoint with test label space and the trained model evaluated on unseen classes. Therefore, we utilize other domain data as the training set whereas the models are evaluated by using the current domain. Thus, we created 7 different sets contain a train which consists of 6 different domains as well as a test set which includes only one test domain.
As can be seen from Figure 3, we aggregate the six different domain data for the training set, whereas the remain one domain is used for testing, aiming at evaluating the performance of models on unseen classes per domain. Then, we convert the train and test sets to train and test collection that contain triplets in order to mimic the same data organization in the previous meta-learning studies. The triplet consists of three items: token, label, vector. The vector of the corresponding token is produced by using different (contextual) embeddings from randomly selected sentences for each label from the train and test set separately. Thus, the train collection is formed with the triplets from the different domain slots, whereas the test collection includes only the triplets of labels from the corresponding domain.
To investigate the efficiency of different models according to data availability in the few-shot setting, we experiment with three data collections sizes of slot values: 50, 100, and 200. Note that the collection size controls the total number of values that can be seen for each slot during training. In the meta-testing stage, we only use the test collection with size of 200 to be able to keep the comparative analyses under control. Furthermore, we examine different numbers of K shot, namely 5-shot, 10-shot, and 15-shot. Table 1 shows the performance in comparison with state of the art metric-based learning models on slot tagging task with different (contextual) embeddings. As can be seen from the scores, Rela-tionNets with ELMo embeddings constantly give the best results, whereas MatchingNets present the lowest scores for each embedding variance. We assume that learning from the distance or relation scores with individual support and query samples instead of the class sum, i.e., as it is in Prototypi-calNets, or class mean, i.e., as it is in RelationNets, decrease the learning performance.

Different Network Architectures
Furthermore, Table 2 presents the results from AttentiveRelationalNets with different embeddings methods and demonstrates that our proposed model with ELMo and BERT outperforms the previous models consistently. Additionally, AttentiveRela-tionalNets significantly improve the results with BERT from the previous experiments in Table 1. When the success of AttentiveRelationalNets is    Table 2, the lower results can be seen. Additionally, when FastText features are compared between the results of RelationNets and AttentiveRelation-alNets, we observe that RelationNets outperform AttentiveRelationalNets.

Different Contextual Embeedings
We further look at the wrong predictions in order to understand the reason for the success of ELMo on FindScreeningE. and observe that ELMo shows high performance on the labels of proper nouns such as city and location and the labels like object type and movie type. However, BERT demonstrates high performance on proper nouns such as artist, album, hence it outperforms ELMo on Ad-dToPlaylist and PlayMusic domains. In addition, the improvement of BERT with AttentiveRelation-alNets mostly relies on the increase of the accuracy of overall labels, but especially the slot labels that contain common nouns.

Different Collection and Shot Sizes
AttentiveRelationalNets demonstrate a linear correlation between the increase of performance and the increase of collection size. In addition, the increase of shot size mostly shows improvement in overall results, apart from PrototypicalNets which show their highest result with 10-shot.

Conclusion
We presented a deep analysis with a wide variety of few-shot learning methods and pretrained (contextual) embeddings for slot tagging. Furthermore, we proposed a novel architecture that leverages attention mechanism attending both, local and global features of given support samples. Experimental results on SNIPS dataset show that a) pretrained contextual embeddings contributed to high performance and b) our proposed approach consistently outperformed other methods in all setups.