An Explicit-Joint and Supervised-Contrastive Learning Framework for Few-Shot Intent Classification and Slot Filling

Intent classification (IC) and slot filling (SF) are critical building blocks in task-oriented dialogue systems. These two tasks are closely-related and can flourish each other. Since only a few utterances can be utilized for identifying fast-emerging new intents and slots, data scarcity issue often occurs when implementing IC and SF. However, few IC/SF models perform well when the number of training samples per class is quite small. In this paper, we propose a novel explicit-joint and supervised-contrastive learning framework for few-shot intent classification and slot filling. Its highlights are as follows. (i) The model extracts intent and slot representations via bidirectional interactions, and extends prototypical network to achieve explicit-joint learning, which guarantees that IC and SF tasks can mutually reinforce each other. (ii) The model integrates with supervised contrastive learning, which ensures that samples from same class are pulled together and samples from different classes are pushed apart. In addition, the model follows a not common but practical way to construct the episode, which gets rid of the traditional setting with fixed way and shot, and allows for unbalanced datasets. Extensive experiments on three public datasets show that our model can achieve promising performance.


Introduction
With the vigorous development of conversational AI, task-oriented dialogue systems have been widely-used in many applications, e.g., virtual personal assistants like Apple Siri and Google Assistant, and chatbots deployed in various domains (Liu et al., 2019a;Yan et al., 2020). Intent classification (IC) and slot filling (SF) are key components in task-oriented dialogue systems, and their performance will directly affect the downstream dialogue management and natural language generation tasks * Corresponding author. (Xu and Sarikaya, 2013). Traditional IC/SF models have achieved impressive performance (Gupta et al., 2019), but they often require large amount of labeled instances per class, which is expensive and unachievable in industry especially in the initial phase of a dialogue system.
Few-shot learning aims to solve the data scarcity issue, which can recognize novel categories effectively with only a handful of labeled samples by leveraging the prior knowledge learned from previous categories. Most few-shot learning studies concentrate on computer vision domain (Fei-Fei et al., 2006;Finn et al., 2017;Jung and Lee, 2020). Recently, to handle various new or unacquainted intents popped up quickly from different domains, some few-shot IC/SF models are proposed (Geng et al., 2020;Hou et al., 2020). Nevertheless, these methods usually focus on a single task and do not attempt to address these two tasks simultaneously.
Intuitively, IC and SF are two complementary tasks and the information of one task can be utilized in the other task to improve the performance. Existing joint IC and SF models have achieved impressive performance in supervised learning scenarios (Weld et al., 2021). But only a couple of methods are custom-designed for few-shot joint IC and SF task. Krone et al. (2020) directly apply the popular few-shot learning models MAML and prototypical network to explore the few-shot joint IC and SF. During the same period, Bhathiya and Thayasivam (2020) also attempt to utilize MAML to deal with this problem in a similar way. Though these models outperform the single task model, they just implicitly model the relationship between IC and SF. The mutual interaction between IC and SF in these methods is still unknowable, which seems to be a black box (not using a concrete formula to characterize the interaction), thus difficult to analyze the internal mechanism.
In this paper, we propose to model the relationship between IC and SF precisely and clearly, as  Figure 1: Illustration of our framework. In the training process, labeled utterances from support set and query set are first encoded by pre-processing module. Meanwhile, intent and slot labels' descriptions are fed into preprocessing module to generate intent embedding matrix and slot embedding matrix. Then the two matrices and utterance's embedding are fed into explicit joint learning module, while utterance's embedding is put forward into supervised contrastive learning module. In explicit joint learning module, intent and slot extractors are used to extract intent and slot information, which leverage the attention mechanism. Then, we can obtain slot-attentionbased intent representation and intent-attention-based slot representation. Next, prototypical network uses intent labels to guide slot embedding learning and vice versa. In supervised contrastive learning module, we construct contrastive samples for each query instance using support set. And the SCL loss function can push samples from the same class more close and samples from different classes further apart. In the testing process, prototypical network is used to predict intent and slot labels, while supervised contrastive learning module is disabled. well as integrating with contrastive learning. As illustrated in Figure 1, our framework consists of two main components. First, we present an explicit-joint learning framework for few-shot intent classification and slot filling, which effectively utilizes the bidirectional connection between IC and SF via leveraging slot-attention-based intent representation and intent-attention-based slot representation. In addition, we integrate with supervised contrastive learning to obtain more classdiscriminative embeddings, which is a strong complementary part to improve our framework.
To verify the effectiveness of the proposed model, we conduct extensive experiments on three public datasets. Catering to the unbalanced datasets and very limited labeled samples in real application scenarios, we adopt a not common but practical way to construct the episode for few-shot learning, i.e., in each episode, the way and shot are variable. The empirical study validates our proposal and shows promising results of our framework on IC and SF tasks.

Related Work
Few-shot learning Few-shot learning aims to use the knowledge learned from seen classes, of which abundant labeled samples are available for training, to recognize unseen classes, of which limited labeled samples are provided (Wang et al., 2020a). It has been widely studied in computer vision such as classification (Fei-Fei et al., 2006;Wang et al., 2020b), segmentation (Wang et al., 2019;Rakelly et al., 2018) and generation (Liu et al., 2019b). Recently it has been expanded to natural language processing such as intent detection (Yu et al., 2021;Kumar et al., 2021).
Few-shot classification is an important and challenging task. Several methods have been proposed to tackle this problem. In particular, several metricbased methods (Vinyals et al., 2016;Snell et al., 2017;Yu et al., 2018;Geng et al., 2019;Bao et al., 2020) have been proposed, which first learn an embedding space and then utilize a metric to classify instances of new categories according to prox-imities with the labeled examples. In addition to metric-based methods, some optimization-based approaches (Ravi and Larochelle, 2017;Finn et al., 2017;Yoon et al., 2018) have also been explored for few-shot classification.
Contrastive learning Contrastive learning applied to self-supervised representation learning has seen a resurgence of interest in recent years, leading to state-of-the-art performance in unsupervised training of deep image models . Khosla et al. (2020) extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Recently, Gunel et al. (2020) propose a novel objective function that contains a supervised contrastive learning term for fine-tuning pre-trained language models, which improves the model generalization ability significantly.
Joint intent classification and slot filling Due to the close relationship between IC and SF, Liu and Lane (2016); Zhang and Wang (2016);Goo et al. (2018); Qin et al. (2019Qin et al. ( , 2021 propose joint models to consider the correlation between these two tasks. These models can be classified into two categories. One type of approaches (Liu and Lane, 2016;Zhang and Wang, 2016) adopt a multitask framework to solve these two tasks simultaneously. Although these models outperform the single-task model, they just model the relationship implicitly by sharing the encoder parameters. The other type of approaches (Goo et al., 2018;Qin et al., 2019) explicitly adopt the intent information to guide the slot filling task. Qin et al. (2021) further propose a co-interactive transformer which considers the cross-impact between these two tasks. These explicit-joint learning models have achieved very remarkable performance, but they mainly focus on the traditional supervised learning setting.

Problem Definition
A labeled utterance with T words (tokens) can be represented as (x, t, y), where x = (w 1 , w 2 , ..., w T ) is an utterance with T words, t = (t 1 , t 2 , ..., t T ) is composed of slot labels of each word in x, y is the intent label of x. In this paper, few-shot classification is conducted via episode learning strategy. In the training period, we partition the training set into multiple episodes. Each episode consists of a support set S and a query set Q. In particular, we randomly select Symbol Explanation C set of intent classes in each episode S support set of an episode Q query set of an episode Sc set of support data in the c-th class Qc set of query data in the c-th class x an utterance with T words, x = (w1, ..., wT ) t slot labels of each word in x, t = (t1, ..., tT ) y intent label of utterance x kc number of supports in Sc kq number of queries in Qc H pre-processed utterance embedding EI intent label embedding ES slot label embedding HI slot-attention-based intent representation HS intent-attention-based slot representation c sentence embedding of utterance x N classes from the training classes, and obtain a class set C in each episode. Then the support set is formed by randomly selecting k c labeled samples (utterances) from each of the N classes, And a fraction of the remainder of these N classes' samples (k q examples per class) serve as the query set, i.e., In the test period, we also partition the test set into multiple episodes. Each episode contains a support There is no overlap between the training classes and test classes. Table  1 summarizes the symbol explanation in details.

Pre-processing
Given an utterance x = (w 1 , w 2 , ..., w T ) with T words (tokens), each word in the utterance can obtain its word embedding by BERT (Devlin et al., 2019). And each word can be further encoded using a recurrent neural network such as bidirectional LSTM, i.e., where LSTM f w and LSTM bw denote the forward and backward LSTM respectively, and − → h t ∈ R d h and ← − h t ∈ R d h are the hidden states of the t-th word learned from LSTM f w and LSTM bw respectively. The entire hidden state of the t-th word , and the hidden state matrix of the To express concisely, we use d = 2d h to represent the dimension of hidden state and obtain H ∈ R T ×d .

Extracting Intent and Slot Representations via Bidirectional Interaction
To explicitly establish the interaction between intent classification and slot filling, for each utterance, we first use the attention mechanism over slot and intent label descriptions to get the initial intent and slot representations (Cui and Zhang, 2019;Qin et al., 2021). Then, these initial representations are concatenated with the utterance embedding matrix to produce the final slot-attention-based intent representation and intent-attention-based slot representation.
In particular, we first use the embeddings of intent labels' descriptions to produce intent embedding matrix E I ∈ R |C intent |×d , and use the embeddings of slot labels' descriptions to produce slot embedding matrix E S ∈ R |C slot |×d , where |C intent | is the number of intents in the episode, |C slot | is the number of slots in the episode, d is the dimension of hidden state. E I and E S are initialized by pre-processing intent and slot labels' descriptions, and they are learnable and can be updated during training. Then we calculate slot-attention-based intent representation and intent-attention-based slot representation as follows.

Intent-attention-based Slot Representation
Here , and they carry the corresponding intent and slot information respectively.

Explicit Joint Learning with Prototypical Networks
Inspired by (Krone et al., 2020), we also extend the prototypical networks to perform joint intent classification and slot filling. Different from (Krone et al., 2020), when calculating the prototype of slot label, instead of only considering the words in the front, we use the window strategy to take the contextual words into account simultaneously, which seems more reasonable. In general, for each intent class or slot class, its corresponding prototype is the mean vector of the sample embeddings in that class. Given a support set, o} is the set of support data with slot label o. The prototype p c of intent label c and the prototype p o of slot label o can be computed as follows: where c i = mean(H I ) ∈ R 2d is the embedding of the i-th utterance x i . 1 2l+1 j+l k=j−l (h S k ) i is the embedding of j-th word with slot label o, which considers the contextual words simultaneously with the window size 2l + 1.
Given a query data (x * , t * , y * ) ∈ Q, we compute the conditional probability p (y = c| x * , S) to predict its intent based on negative squared Euclidean distance.
. (6) Here c * is the embedding of x * . Similarly, we can compute the conditional probability p (t j = o| x * , S) to predict the slot. Finally, we perform the cross-entropy loss on all query instances to construct the IC and SF prototypical loss functions.

Integrating with Supervised Contrastive Learning
Supervised contrastive learning has achieved great success in computer vision, which aims to maximize similarities between instances from the same class and minimize similarities between instances from different classes. Here we integrate with supervised contrastive learning to generate better intent representations and slot representations. We first construct contrastive samples for each query instance using support set. For a query instance x, we can take the support instances which have the same label with x as the positive samples, and the negative samples are those with different labels. Then for an episode, the SCL loss of IC can be written as: where z i · z j means the inner product of the two vectors. (x i , t i , y i ) is a query instance in query set Q. N y i is the total number of utterances in support set which have the same intent label y i . z i = mean(H) is the pre-processed embedding of x i . τ > 0 is an adjustable scalar parameter which can control the separation degree of classes. To analyze Eq. (9), we can do some simple formula manipulation as below.
According to the above formula, if we want to minimize L IC scl , we must maximize L scl , where we need to maximize the positive term and minimize the positive+negative term, so the negative term will be decreased. Intuitively, the supervised contrastive learning term can push samples from the same class close and samples from different classes further apart.
In a similar manner, the SCL loss of SF for an episode can be written as: where h i and h j are the embedding representations of w i and w j . h i · h j means the inner product of h i and h j . Q s represents the set of words in query set, and S s represents the set of words in support set. N t i is the total number of words in support set which have the same slot label t i . Here the same word in different utterances are considered repeatedly, and the words with slot label "Other" are ignored. Note that different from the symbol t i which represents the slot labels of each word in an utterance x i , t i represents the slot label of w i . Combining Eq. (7), (8), (9) and (10), the overall loss function of the proposed framework is: where λ, γ and δ are trade-off hyperparameters.

Episode Construction
In this section, we outline the method of sampling episodes used in (Triantafillou et al., 2020) and (Krone et al., 2020), which allows that the "way" N and the "shot" k c are variable in each episode, and can cater the unbalanced datasets and very limited labeled instances in real application scenarios. Given a data split which contains |C split | intent classes, there are two steps to construct an episode.
Step 1: Sampling the class set for each episode.
(i) We sample the class number N uniformly from the range [3, |C split |].
(ii) We sample N intent classes from the data split at random.
Step 2: Sampling the samples for each episode. (i) Computing the query set size of each class by: where C is the set of selected classes, and U (c) denotes the set of utterances belonging to class c. (ii) Computing the total support set size |S| by: where β is a scalar sampled uniformly from interval (0, 1], and U max is the maximum support set size. (iii) Computing the number of shots k c of each class by: where the parameter R c is computed by: where α c is sampled uniformly from the interval [log(0.5), log (2)).

Datasets
We conduct experiments on three benchmark datasets ATIS (Hemphill et al., 1990), SNIPS (Coucke et al., 2018), and TOP (Gupta et al., 2018). In pre-processing procedure, we follow (Krone et al., 2020) to modify slot label name by adding the associated intent label name as a prefix to each slot. We divide the dataset into train set (70%), development set (15%), and test set (15%) respectively. For the SNIPS dataset, we choose not to form a development set. This is because that there are only 7 intents in the SNIPS dataset, and we require a minimum of 3 intents per split. Table 2 provides the detailed dataset statistics.

Baselines
Following the work of Amazon AI (Krone et al., 2020), we compare our framework with some popular few-shot models: first order approximation of model agnostic meta learning (foMAML) (Finn et al., 2017)), prototypical networks (Proto), and a fine-tuning method (Fine-tune) (Goyal et al., 2018). For each model, its embedding layer could be GloVe word embeddings (GloVe), GloVe word embeddings concatenated with ELMo embeddings (ELMo), or BERT embeddings (BERT).
Furthermore, we can train the above models with two modes. One is to train and test the model on a single dataset, the other is to apply joint training approach to train the model on all the three datasets and test it on a single dataset. For example, SNIPS means we train and test the baseline on SNIPS dataset, and SNIPS (joint) means we train the baseline on all the three datasets but test it on SNIPS dataset.  The above baselines have been performed by Krone et al. (2020), we directly reuse their reported results. And as the second training mode is time consuming, we train our proposed model with the first mode.
In addition, we compare with the latest method Retriever (Yu et al., 2021), which is a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective.

Implementation Details
Parameter Settings In this paper, the dimension of hidden state is set to 1536 (d = 1536). We freeze 6 layers of BERT, and train all models using AdamW (Loshchilov and Hutter, 2019) optimizer with the initial learning rate 1 × 10 −4 and the dropout ratio 0.1. All the models are trained for 30 epochs. For hyperparameters λ, γ and δ, we use the grid searching method to determine them in the range (0, 1). For the hyperparameter τ , we set τ = 0.1 consistently.

Evaluation Metrics
We evaluate the performance of intent classification and slot filling with accuracy (Acc) and F1 score (F1), respectively. Table 3 summarizes the average IC accuracy over 100 test episodes when the maximum support set size U max = 20, where the top 2 results are highlighted in bold. We could make the following observations. (1) When comparing with the baselines that use the same word embeddings (BERT), our framework (w, w) improves upon the strong baseline BERT+Proto by nearly 4%, 22% and 10% on SNIPS, ATIS and TOP respectively, which shows the superiority of our proposed model. (2) When comparing with all the baselines, our framework (w, w) improves upon the strong baseline ELMo+Proto by nearly 15% and 12% on ATIS and TOP respectively. (3) On SNIPS dataset, our framework (w, w) performs a little worse than ELMo+Fine-tune with joint train-   ing mode. This is because that ELMo+Fine-tune with joint training mode trains the model on all the three datasets, but our framework only trains the model on SNIPS. In addition, the word embeddings of ELMo seem more suitable for SNIPS. Table 4 shows the average IC accuracy over 100 test episodes when the maximum support set size U max = 100, where the top 2 results are highlighted in bold. We could make the similar observations. (1) Our framework (w, w) performs the best when comparing with the baselines that use the same word embeddings. (2) Except for SNIPS on which ELMo+Fine-tune and ELMo+Proto get the best two results, our framework (w, w) always performs better than other baselines. Table 5 and Table 6 summarize the average SF F1 score over 100 test episodes when the maximum support set size U max = 20 and U max = 100 respectively, where the top 2 results are highlighted in bold. It can be seen that (1) When comparing with the baselines that use the same word embeddings (BERT), our framework (w, w) performs the best on all the datasets. (2) When comparing with all the baselines, our framework (w, w) can also obtain satisfactory performance in most cases.

Ablation Study
Explicit-Joint Learning To verify the effectiveness of slot-attention-based intent representation and intent-attention-based slot representation, we make the ablation study. The results when U max = 20 are shown in Table 7    sentation with pure slot representation. Similarly, we have the only intent-to-slot model. From the results, it can be seen that our framework (o, o) performs better than the other two baselines, which demonstrates the effectiveness of extracting intent and slot representations via bidirectional interaction.
Supervised Contrastive Learning Our proposed objective function includes a cross entropy (CE) term of prototypical network and supervised contrastive learning (SCL) term, the latter aims to push samples in the same class close and samples in different classes further apart. By comparing the results of our framework (w, o) with our framework (o, o) in Table 3 and Table 4, we can get that the term L IC scl brings nearly 0.1% ∼ 4.3% improvement for IC accuracy. By comparing the results of our framework (w, w) with our framework (w, o) in Table 5 and Table 6, it can be seen that the term L SF scl brings nearly 0.6% ∼ 2.9% improvement for SF F1 score. The performance improvement demonstrates the effectiveness of the SCL loss for both IC and SF tasks. Figure 2 visualizes the distribution of sentence embeddings in TOP dataset, we can observe that the original distribution is random in Pic.1. As shown in Pic.2, CE can separate the data in different classes to some extent. In Pic.3, SCL term further encourages more compact clustering of the data points in the same class. Pic.3 shows sentence embeddings via training the model with cross entropy (CE) and supervised contrastive loss (SCL). All the data are from TOP dataset. Data points with the same color come from the same class.

Conclusion
In this paper, we propose a new and practicable framework for few-shot intent classification and slot filling. The performance gains of our method come from two aspects: explicit-joint learning and supervised-contrastive learning. By explicit-joint learning, we can effectively utilize the close relationship between IC and SF tasks. By supervisedcontrastive learning, we can obtain more classindicative representations. We thoroughly evaluate our framework on few-shot IC and SF tasks and achieve impressive performance on three public datasets SNIPS, ATIS and TOP. In future work, we plan to explore more explicit-joint learning strategies and extend our framework to deal with multiple-intent classification.