Learning to Bridge Metric Spaces: Few-shot Joint Learning of Intent Detection and Slot Filling

In this paper, we investigate few-shot joint learning for dialogue language understanding. Most existing few-shot models learn a single task each time with only a few examples. However, dialogue language understanding contains two closely related tasks, i.e., intent detection and slot filling, and often benefits from jointly learning the two tasks. This calls for new few-shot learning techniques that are able to capture task relations from only a few examples and jointly learn multiple tasks. To achieve this, we propose a similarity-based few-shot learning scheme, named Contrastive Prototype Merging network (ConProm), that learns to bridge metric spaces of intent and slot on data-rich domains, and then adapt the bridged metric space to the specific few-shot domain. Experiments on two public datasets, Snips and FewJoint, show that our model significantly outperforms the strong baselines in one and five shots settings.


Introduction
Few-Shot Learning (FSL) that committed to learning new problems with only a few examples (Miller et al., 2000;Vinyals et al., 2016) is promising to break the data-shackles of current deep learning. Commonly, existing FSL methods learn a single few-shot task each time. But, real-world applications, such as dialogue language understanding, usually contain multiple closely related tasks (e.g., intent detection and slot filling) and often benefit from jointly learning these tasks (Worsham and Kalita, 2020;Qin et al., 2019;Goo et al., 2018). In few-shot scenarios, such requirements of joint learning present new challenges for FSL techniques to capture task relations from only a few examples and jointly learn multiple tasks. This paper explores the few-shot joint learning in dialogue language understanding as an early attempt for this issue. As shown in Figure 1, FSL models are usually first trained on source training domains, then evaluated on an unseen target test domain. Although joint learning can improve dialogue language understanding by utilizing the relation between intents and slots, e.g., "Harry Potter" is "film" in "PlayVideo" intent and "book" in "PlayVoice" intent, it faces serious challenges when engaging to FSL setting. Firstly, it is hard to learn generalized intent-slot relations from only a few support examples. Secondly, because the intent-slot relation differs in different domains, it is hard to directly transfer the prior experience from source domains to target domains. For instance, the intent-slot relation, "PlayVideo"-"film", has never appeared in source domains.
To tackle the aforementioned joint learning challenges in few-shot dialogue language understanding, we propose the Prototype Merging, which learns the intent-slot relation from data-rich training domains and adaptively captures and utilizes it to an unseen test domain. The intent-slot relation is learned with cross-attention between intent and slot class prototypes, which are the mean embeddings of the support examples belonging to the same classes. Such intent-slot relation adaptively connects the metric spaces of the two tasks.
Further, to jointly refine the intent and slot metric spaces bridged by Prototype Merging, we claim that related intents and slots, such as "PlayVideo" and "film", should be closely distributed in the metric space, otherwise, well-separated. To achieve this, we propose Contrastive Alignment Learning, which exploits class prototype pairs of related intents and slots as positive samples and nonrelated pairs as negative samples. With these samples, it regularizes the FSL process with a margined contrastive loss.
Overall, we named the above novel few-shot joint learning framework as Contrastive Prototype Merging network (ConProm), which connects intent detection and slot filling tasks by bridging the metric spaces of them. Two main components of it cooperate to accomplish this goal. As shown in Figure 2, Prototype Merging builds the connection between two metric spaces, and Contrastive Alignment Learning refine the bridged metric space by properly distributing prototypes.
Experiments on two public datasets show both Prototype Merging and Contrastive Aligning Objective significantly boost the few-shot joint learning effects and outperform strong baselines. In summary, our contribution is three-fold: (1) We investigate the few-shot joint dialogue language understanding problem, which is also an early attempt for few-shot joint learning problem. (2) We propose a novel Prototype Merging mechanism to build intent-slot connections adaptively. (3) We introduce a Contrastive Alignment Learning objective to jointly refines the metric spaces of intent detection and slot filling. For reproducibility, our code for this paper is publicly available at https://github.com/AtmaHou/FewShotJoint.

Background
Before start, we introduce the background of dialogue language understanding and few-shot learning.

Dialogue Language Understanding
Dialogue language understanding contains two main components: intent detection and slot filling (Young et al., 2013). Intent detection is a sentencelevel classification problem that classifies a user utterance into one of N intent categories. Different from intent detection, slot filling aims to extract key entities within user utterances, which is often achieved by assigning slot tags to each token of a user utterance and is usually formulated as a sequence labeling problem. Given input utterance x = x 1 , x 2 , . . . , x n as a sequence of words, joint dialogue language understanding predicts the corresponding semantic frame y = (l, t), where l is the intent label and t = t 1 , t 2 , . . . , t n is the slot tags sequence of the utterance.

Few-shot Learning
Few-shot learning (FSL) extracts prior experience that allows quick adaption to new problems. Therefore, FSL models are usually first trained on a set of source domains, then evaluated on another set of unseen target domains. Figure 1 shows an example of the training and testing process of few-shot learning for dialogue language understanding.
A target domain only contains a few labeled examples, which is called support set S = (x (i) , y (i) ) |S| i=1 . S includes K examples (K-shot) for each of N classes (N-way). Taking classification problem as an instance: given an input query example x = x 1 , x 2 , . . . , x n and a K-shot support set S as references, we find the most appropriate class y * of x: State-of-the-art few-shot learning is often similarity-based methods (Bao et al., 2020;Snell et al., 2017). These methods conquer the extreme lack of data by learning a general similarity metric space on data-rich source domains. Then on fewshot target domains, they classify a query example according to example-class similarity, where class representations are obtained from a few support examples.
Prototypical network (Snell et al., 2017) is one of the most classical similarity-based methods. It obtains the class representation as to the mean embedding of support examples belonging to the same class, so called prototypes: where S i is the set of support examples of the ith class, and E(·) is the embedding function. The probability of x belongs to the ith class is then made as: where SIM(·, ·) is a vector similarity function.

Proposed Method
In this section, we introduce the proposed Contrastive Prototype Merging network (Con-Prom). Firstly, we describe the few-shot intent detection and slot filling with Prototypical network ( §3.1). Based on that, we present two key components of ConProm: the Prototype Merging mechanism that adaptively connects two metric spaces of intent and slot ( §3.2) and the Contrastive Alignment Learning that jointly refines the metric space connected by Prototype Merging ( §3.3).

Few-shot Intent Detection and Slot Filling
We build our few-shot intent detection and slot filling model based on the Prototypical Network described in Section 2.2. Given a query sentence x and a support set S, we estimate the probability of Thicker lines indicate higher crossattention scores. For example, "PlayVideo" and "film" are more related, so the corresponding score is larger.
x being associated with intent label l i as: , and estimates the probability of the kth token in x belonging to the ith slot class as: where C intent i and C slot i are prototypes derived with support examples. E intent (·) and E slot (·) are embedder functions for intent and slot respectively. We adopt BERT (Devlin et al., 2019) as the embedder, and the sentence embedding E intent (x) is calculated as the averaged embedding of its tokens. We use the dot-product similarity for function SIM(·, ·).

Prototype Merging
To achieve few-shot joint learning and capture the intent-slot relation with the similarity-based method described above, we need to bridge the metric spaces of intent detection and slot filling. However, as mentioned in the introduction, intentslot relation differs in different domains, it is hard to transfer the bridged metric space learned from source domains to target domains.
To remedy this, we propose the Prototype Merging that can bridge metric spaces adaptively. As shown in Figure 3, Prototype Merging adaptively estimates intent-slot relevance with crossattention between intent and slot, and then merges the intent and slot prototypes with attentive information fusion. Such an attentive fusion process enables both intent and slot prototype representations to reflect intent-slot relation and improves domain transferability.
On an unseen target domain, we estimate the intent-slot cross-attention scores from the support set with two methods: (1) use the statistic of cooccurrence of different intents and slots; (2) estimate the intent-slot relevance score using prototype representations.
Firstly, for the statistic-based attention-score, we estimate intent-slot attention scores A S by counting the co-occurrence of different intents and slots, where A S i,j records the normalized number of cooccurrence times for the ith intent and the jth slot (normalized by row).
Secondly, for representation-based attentionscore, we estimate the cross-attention scores with the Additive Attention (Bahdanau et al., 2015): 1 where A R is the attention matrix, and A R i,j records the cross-attention score between the ith intent and the jth slot. U , V and W are parameters learned on source domains,which preserve the general experience of estimating relevance with representations. C intent i and C slot j are prototypes of ith intent and the jth slot respectively. We normalize A R by row with softmax function.
We obtain the final cross-attention score matrix A by combining A S and A R .
where λ is the interpolation factor. After obtaining the cross-attention scores, we represent each intent by fusing the information of related slot prototypes, where the attention scores are used as fusing weights. Similarly, we use intent prototypes to represent slots (See Figure 3). The fusion process is as follows: where C F intent i and C F slot j are the fused prototypes of ith intent and the jth slot respectively.
At last, we obtain the representation of merged prototypes C by combining the origin prototype 1 We adopt additive attention because we find it outperforms common product-based attention in our setting. This is mainly due to that additive attention interferes less with product-based similarity calculations.
C with the fused prototype C F : where the α is a hyper-parameter that controls the importance of intent-slot relation.

Contrastive Alignment Learning
Similarity-based few-shot learning relies heavily on a good metric space, where different classes should be well separated from each other (Hou et al., 2020a;Yoon et al., 2019). In joint-learning scenarios, there are further requests to connect metric spaces of joint learned tasks and jointly optimize these metric spaces.
In response to the above requests, we argue that the distribution of prototypes of dialogue language understanding should fit these intuitions: (1) different intent prototypes should be far away and the same as slot prototypes (Intra-Contrastive); (2) the slot prototypes should close to the related intent prototypes and should be far away from the unrelated intent prototypes (Inter-Contrastive). 2 To achieve these, we introduce a Margined Contrastive Loss to force the model to learn the separation and alignment of intent and slot prototypes.
Firstly, to encourage separation of prototypes from the same task, we regularize the learning of intent and slot prototypes with Intra-Contrastive loss L Intra = 1 2 (L Intra−intent + L Intra−slot ), where both the L Intra−intent and L Intra−slot are calculated as: where m is the margin value and N is the number of prototypes. The margin m is important since it can protect metric space from excessive dispersion. Next, we learn the alignment (separation) between intent prototypes and slot prototypes with Inter-Contrastive loss L Inter : where R i is the set of slots related to the ith intent and U i is the set of slots that are not related to the ith intent. N I is the number of intents.
Here, we simply obtain the relatedness with the co-occurrence matrix M S in Section 3.2. Finally, the Margin Contrastive Loss is calculated as:

Learning Objective
In dialogue language understanding task, we joint learn the intent detection task and slot filling by optimizing both losses at the same time. Specifically, we use CrossEntropy (CE) to calculate the loss for intent detection and slot filling. Combining with the loss of Contrastive Alignment Learning, we train the entire model with the following objective function:

Experiments
We evaluate our method on the dialogue language understanding task of 1-shot/5-shot setting, which transfers knowledge from source domains (training) to an unseen target domain (testing) containing only 1-shot/5-shot support set.

Settings
Dataset We conduct experiments on two public datasets: Snips (Coucke et al., 2018) and FewJoint (Hou et al., 2020c). Snips is a widely-used dataset for dialogue language understanding, containing seven single-intent domains together with 53 slots. The other dataset FewJoint is joint dialogue language understanding used in the few-shot learning contest of SMP2020-ECDT Task-1. 3 It contains 59 multi-intent domains, 143 different intents, and 205 different slots.
In the few-shot learning setting, we train models on several source domains and test them on unseen target few-shot domains. For Snips, we follow Krone et al. (2020a) and combine singleintent domain into multi-intent domain to achieve the classification of intents. After that, we split the Snips dataset into 3 parts: the training domain with 3 intents, the developing domain with 2 intents and the testing domain with 2 intents. FewJoint is already a few-shot learning benchmark. Therefore, 3 The Eighth China National Conference on Social Media Processing https://smp2020.aconf.cn/smp.html we follow the original data split and there are 45 domains for training, 5 domains for developing and 9 domains for testing.

Few-shot Dataset Construction
To simulate the few-shot learning situation, we follow previous few-shot learning works (Vinyals et al., 2016;Krone et al., 2020a;Finn et al., 2017) and construct the dataset into a few-shot episode style, where the model is trained and evaluated with a series of few-shot episodes. Each episode contains a support set and query set. However, different from the single-task problem, joint-learning examples are associated with multiple labels. Therefore, we cannot guarantee that each label appears K times while sampling examples for the K-shot support set. To remedy this, we build support sets with the Mini-Including Algorithm (Hou et al., 2020a), which is intended for such situations. It constructs support set generally following two criteria: (1) All labels appear at least K times in support set. (2) At least one label will appear less than K times in the support set if any support example is removed from the support set. For Snips, we construct 200 few-shot episodes for training, 50 for developing, and 50 for testing. We set the query set size as 16 for training and developing, 100 for testing. For FewJoint, we use the few-shot episodes provided by the original dataset.
Evaluation We adopt three metrics for evaluation: Intent Accuracy, Slot F1-score, Joint Accuracy. 4 For joint dialogue language understanding, Joint Accuracy is the most important metric among all three metrics (Hou et al., 2020c). It evaluates the sentence level accuracy, which considers one sentence is correct only when all its slots and intents are correct.
To conduct a robust evaluation under few-shot setting, we validate the models on multiple fewshot episodes (i.e., support-query set pairs) from different domains and take the average score as final results. To control the non-deterministic neural network training (Reimers and Gurevych, 2017), we report the average score of 5 random seeds for all results.

Baselines
We compare our model with two kinds of strong baseline: fine-tune based transfer learning methods  Table 1: Scores on 1-shot dialogue language understanding task on Snips and FewJoint datasets. +FT denotes finetune model.
+TR denotes using the trick of transition rule, which blocks illegal slot prediction, such as "I" tag after "O" tag. Results above the mid-line are from non-finetune based methods, and results below the mid-line are from finetuning based methods.  (JointTransfer, Meta-JOSFIN) and similarity-based FSL methods (SepProto, JointProto, LD-Proto).

Models
JointTransfer is a domain transfer model based on the JointBERT . It consists of a shared BERT embedder with intent detection and slot filling layers on the top. We pretrain it on source domains and finetune it on target domain support sets.
Meta-JOSFIN (Bhathiya and Thayasivam, 2020) is a meta-learning model based on the MAML (Finn et al., 2017). The meta-learner model here is a BERT-based joint dialogue language understanding model similar to Joint-Transfer. It learns initial parameters that can fast adapt to the target domain after only a few updates.
SepProto is a prototypical-based dialogue language understanding model with BERT embedding, that learns intent detection and slot filling separately. During the experiment, it is pre-trained on source domains and then directly applies to target domains without fine-tuning.
JointProto (Krone et al., 2020a) is all the same as SepProto except that it jointly learns the intent and slot tasks by sharing the BERT encoder.

LD-Proto is also a prototypical model similar to
JointProto. The only difference is that it is enhanced by the logits-dependency tricks (Goo et al., 2018), where joint learning is achieved by depending on the intent and slot prediction on the logits of the accompanying task.
Implements For both ours and baseline models, we determine the hyperparameters on the development set. We use ADAM (Kingma and Ba, 2015) for training and set batch size as 4 and learning rate as 10 −5 . We adopt embedding tricks of Pairs-Wise Embedding (Gao et al., 2019;Hou et al., 2020a) and Gradual Unfreezing (Howard and Ruder, 2018). The λ and α in Section 3.2 are both set as 0.5. We implement both our and baseline models with the few-shot platform MetaDialog. 5 Besides, to use the information in target domains and make a fair comparison with fine-tuning baselines, we explore the performance of the similarity-based model under fine-tuning setting (+FT) and enhance the model with a fine-tune process similar to Meta-JOSFIN. In addition, following the suggestions of Hou et al.
(2020a), we investigate adding Transition Rules (+TR) between slot tags, which bans illegal slot prediction, such as "I" tag after "O" tag.

Main Results
In this section, we present the evaluation of the proposed method on both 1-shot and 5-shot dialogue understanding setting.
Result of 1-shot setting As shown in Table 1, our method (ConProm) achieves the best performance on Joint Accuracy, which is the most important metric. Among all metrics, ConProm only lags a bit than LD-Proto on intent accuracy. We address this to the fact that there are many slots shared by different intent, and representing an intent with slots may unavoidably introduce noise from other intents. Considering the huge improvements on Slot and Joint performance over LD-Proto, we argue that the limited loss is a worthy compromise here. Since similarity-based models predict slot tags independently for each token, they tend to predict illegal tags. We employ a simple transition rule (+TR) to remedy such defects and further improves the performance. For fairness, we also enhance LD-Proto with TR trick and our model still outperforms the enhanced baseline. For those non-finetuned methods, ConProm outperforms LD-Proto by Joint Accuracy scores of 11.05 on Snips and 2.62 on FewJoint, which show that our model can better capture the relation between intent and slot. Our improvements on Snips are higher than those on FewJoint, which is mainly because that there is clearer intent-slot dependency in Snips. The performance of JointProto is even lower than SepProto, which demonstrates that fewshot joint learning is not a trivial issue as simply sharing the embeddings 5 https://github.com/AtmaHou/MetaDialog  When finetuning brings significant improvements for all methods, our model (ConProm+FT) still achieves the best performance. Interestingly, we observe that finetuning often hurts the intent prediction. This shows that finetuning brings limited gains on sentence-level domain knowledge but leads to overfitting. Table 2 shows the 5shot results. The results are consistent with 1-shot setting in general trending and our methods achieve the best performance. While more learning shots improve the performance for all methods, the superiority of our best performed baseline is further strengthened. This shows that the model can better exploit the richer intent-slot relations hidden in 5-shot support sets.

Analysis
Ablation Test To inspect how each component of the proposed model contributes to the final performance, we conduct ablation analysis. As shown in Table 3, we independently removing two main components: Prototype Merge (PM) and Contrastive Alignment Learning (CAL).
When PM is removed, the intent and slot prototypes are represented only with corresponding support examples, and Joint Accuracy drops are witnessed. There is more loss on FewJoint. Because there are much more slots shared by different intents in FewJoint, and the attention mechanism of PM is important for identifying relatedness between intents and slots.
For our model without CAL, we train the model with only cross entropy loss and get lower scores on all settings. There are more performance drops on Snips. This is mainly because that there much clearer intent-slot relation in Snips, which can be easily handled by CAL.
In terms of contribution, there are opposite performance for CAL and PM on two dataset, which shows that PM and CAL complement each other and reach a balance for various situations.  Visual Analysis of Prototype Distribution To get further an understanding of the model effects on bridging the metric spaces of intent and slot, we visualize the prototype distributions in the metric space. As shown in Figure 4, it is exciting to see that our model successfully refine the prototype distribution by aligning the slots to related intent and making prototypes properly well-separated.
Sentence level slot accuracy analysis There is some confusion in Table 1 and Table 2 that there are huge performance differences of Joint Accuracy score when Intent Accuracy scores and Slot F1 scores are similar. We inspect this issue by evaluating the Sentence Level Slot Accuracy, which considers a sentence to be correct when all slots are correct. As shown in Table 4, there is a huge gap in the slot accuracy score between LD-Proto and ConProm, which explains the gap in Joint score.

Related Work
Few-shot learning is one of the most important direction for machine learning area (Fei-Fei, 2006;Fink, 2004) and often achieved by similarity-based method (Vinyals et al., 2016) and fine-tuning based method (Finn et al., 2017). FSL in natural language processing has been explored for various tasks, including text classification Geng et al., 2019;Yu et al., 2018), entity relation classification Gao et al., 2020;Ye and Ling, 2019), sequence labeling (Luo et al., 2018;Hou et al., 2018;Shah et al., 2019;Hou et al., 2020a;. As the important part of a dialog system, dialogue language understanding attract a lot of attention in few-shot scenario. Dopierre et al. (2020); Vlasov et al. (2018); Xia et al. (2018) explored fewshot intent detection technique. Luo et al. (2018) and Hou et al. (2020a) investigated few-shot slot tagging by using prototypical network. Hou et al. (2020b) explored few-shot multi-label intent detection with an adaptive logit adapting threshold. But all of these works focus on a single task.
Despite a lot of works on joint dialogue understanding (Goo et al., 2018;Qin et al., 2019;E et al., 2019;Gangadharaiah and Narayanaswamy, 2019;Qin et al., 2020), few-shot joint dialogue understanding is less investigated. Krone et al. (2020b) and Bhathiya and Thayasivam (2020) make the earliest attempts by directly adopt general and classic few-shot learning methods such as MAML and prototypical network. These methods achieve joint learning by sharing the embedding between intent detection and slot filling task, which model the relation between intent and slot task implicitly. By contrast, we explicitly model the interaction between intent and slot with attentive information fusion and constrastive loss. Experiment results also demonstrate the superiority of our method on this task.

Conclusion
In this paper, we propose a similarity-based fewshot joint learning framework, ConProm, for dialogue understanding. To adaptively model the interaction between intents and slots, we propose the Prototype Merging that bridges the intent metric and slot metric spaces with cross-attention between intent and slot. To learn better bridged metric space for intent and slot, we propose the Contrastive Alignment Learning to align related crosstask labels in metric space and force unrelated labels properly separated. Experiment results validate that both Prototype Merging and Contrastive Alignment Learning can improve performance.