Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning

In this work, we focus on a more challenging few-shot intent detection scenario where many intents are fine-grained and semantically similar. We present a simple yet effective few-shot intent detection schema via contrastive pre-training and fine-tuning. Specifically, we first conduct self-supervised contrastive pre-training on collected intent datasets, which implicitly learns to discriminate semantically similar utterances without using any labels. We then perform few-shot intent detection together with supervised contrastive learning, which explicitly pulls utterances from the same intent closer and pushes utterances across different intents farther. Experimental results show that our proposed method achieves state-of-the-art performance on three challenging intent detection datasets under 5-shot and 10-shot settings.


Introduction
Intent detection, aiming to identify intents from user utterances, is a key component in task-oriented dialog systems. In real systems such as Amazon Alexa, correctly identifying user intents is crucial for downstream tasks Ham et al., 2020). A practical challenge is data scarcity as it is expensive to annotate enough examples for emerging intents, and how to accurately identify intents in few-shot learning has raised attention.
Existing methods address the few-shot intent detection tasks mainly from two perspectives: (1) data augmentation and (2) task-adaptive training with pre-trained models. For the first category, Zhang et al. (2020a) and Mehri et al. (2020b) propose a nearest neighbor classification schema with full use of the limited training examples in both training and inference stages. Xia et al. (2020b) and Peng et al. (2020) propose to generate utterances for emerging intents based on variational * Work done while the first author was an intern at Adobe Research.
autoencoder (Kingma and Welling, 2013) and GPT-2 (Radford et al., 2019), respectively. For the second category, Casanueva et al. (2020) and Mehri et al. (2020a) conduct intent detection by leveraging related conversational pre-training models based on a few hundred million conversations. Meanwhile, they devise a task-adaptive training schema where the model is pre-trained on all relative intent datasets or the target intent datasets with mask language modeling.
However, previous methods such as data augmentation related models (Liu et al., 2021c) are inefficient for training and hard to scale to tasks with lots of intents. Moreover, these models do not tackle well the following scenarios: In real scenarios, the few-shot intent detection could be more challenging when there exist many finegrained intents, especially semantically similar intents. For instance, BANKING77 (Casanueva et al., 2020) has a single domain with 77 intents, and CLINC150 (Larson et al., 2019) has ten domains with 150 intents. Many intents in the datasets are similar. Therefore, training models is rather challenging when there are only limited examples. Inspired by the recent success of contrastive learning (He et al., 2020;Gunel et al., 2020;Radford et al., 2021;Liu et al., 2021a;Gao et al., 2021;Liu et al., 2021b), which aims to enhance discrimination abilities of models, this work proposes improving few-shot intent detection via Contrastive Pre-training and Fine-Tuning (CPFT). Intuitively, we first learn to implicitly discriminate semantically similar utterances via contrastive self-supervised pre-training on intent datasets without using any intent labels. We then jointly perform few-shot intent detection and supervised contrastive learning. The supervised contrastive learning helps the model explicitly learn to pull utterances from the same intent close and push utterances across different intents apart.
Our contributions are summarized as follows: 1) We design a simple yet effective few-shot intent detection schema via contrastive pre-training and fine-tuning. 2) Experimental results verify the state-of-the-art performance of CPFT on three challenging datasets under 5-shot and 10-shot settings.

Related Work
Since this work is related to few-shot intent detection and contrastive learning, we review recent work from both areas in this section. The few-shot intent detection task typically includes three scenarios: (1) learn a intent detection model with only K examples for each intent (Zhang et al., 2020a;Mehri et al., 2020a;Casanueva et al., 2020); (3) learn to identify both in-domain and out-of-scope queries with only K examples for each intent (Zhang et al., 2020a(Zhang et al., , 2021Xia et al., 2021b). (2) (Xia et al., 2020a(Xia et al., ,b, 2021a. In this work, we focus on the first scenario, and several methods have been proposed to tackle the challenge. Specifically, Zhang et al. (2020a) proposes a data augmentation schema, which pretrains a model on annotated pairs from natural language inference (NLI) datasets and designs the nearest neighbor classification schema to adopt the transfer learning and classify user intents. However, the training is expensive and hard to scale to tasks with hundreds of intents (Liu et al., 2020). Mehri et al. (2020b); Casanueva et al. (2020) propose the task-adaptive training, which leverages models pretrained from a few hundred million dialogues to tackle few-shot intent detection. It also includes an unsupervised mask language modeling loss on the target intent datasets and shows promising improvements.
Contrastive learning has shown superior performance on various domains, such as visual representation (He et al., 2020;Radford et al., 2021), graph representation (Qiu et al., 2020;You et al., 2020), and recommender systems (Liu et al., 2021b). Moreover, recent works also adopt contrastive learning in natural language processing tasks (Gunel et al., 2020;Liu et al., 2021a;Gao et al., 2021), which employs the contrastive learning to train the encoder. Specifically, (Gunel et al., 2020) designs a supervised contrastive learning loss for fine-tuning data. Gao et al. (2021) designs a simple contrastive learning framework through dropout and it shows state-of-the-art performance on unsupervised and full-shot supervised semantic textual similarity tasks. Liu et al. (2021a) designs self-supervised Mirror-BERT framework with two types of data augmentation: randomly erase or mask parts of the input texts; feature level augmentation through dropout.
Our work differs from them in several respects: Firstly, we specifically tackle the few-shot intent detection task rather than the general full-shot learning; Secondly, we design a schema and employ contrastive learning in both self-supervised pretraining and supervised fine-tuning stages.

CPFT Methodology
We consider a few-shot intent detection task that handles C user intents, where the task is to classify a user utterance u into one of the C classes. We set balanced K-shot learning for each intent (Zhang et al., 2020a;Casanueva et al., 2020), i.e., each intent only includes K examples in the training data. As such, there are in total C · K training examples.
In the following section, we first describe the self-supervised contrastive pre-training for utterance understanding before introducing the supervised fine-tuning for few-shot intent detection.

Self-supervised Pre-training
We retrieve the feature representation h i for the i-th user utterance through an encoder model, which in this paper is BERT (Devlin et al., 2019), i.e., h i = BERT(u i ). We implicitly learn the sentence-level utterance understanding and discriminate semantically similar utterances through the self-supervised contrastive learning method Liu et al., 2021a;Gao et al., 2021): where N is the number of sentences in a batch. τ is a temperature parameter that controls the penalty to negative samples. sim(h i ,h i ) denotes the cosine similarity between two input vectors h i and h i .h i represents the representation of sentencē u i , whereū i is from the same sentence u i but few (10%) tokens are randomly masked (Devlin et al., 2019). Specifically, we dynamically mask tokens during batch training , i.e., a sentence has different masked positions across different training epochs, and we find it is beneficial to the utterance understanding. The sentence u i andū i are inputted together to a single encoder during the batch training (Gao et al., 2021). Besides the sentence-level enhancement, we also add the mask language modeling loss (Devlin et al., 2019;) to enhance the token-level utterance understanding: where P (x m ) denotes the predicted probability of a masked token x m over the total vocabulary, and M is the number of masked tokens in each batch. Our total loss for each batch is L stage1 = L uns cl + λL mlm , where λ is a weight hyper-parameter.

Supervised Fine-tuning
Through self-supervised learning in the first stage, the model efficiently utilizes many unlabeled user utterances. The model is given very limited examples in the second stage, such as 5 and 10 examples for each intent. To better understanding user intents, especially when intents are similar to each other, we utilize a supervised contrastive learning method (Gunel et al., 2020) and train it together with an intent classification loss. We treat two utterances from the same class as a positive pair and the two utterances across different classes as a negative pair for contrastive learning. Unlike the previous work, the utterance and itself could also be a positive pair as we input them together to the single encoder. Their feature representations are different due to the dropout of BERT. The corresponding loss is shown as the following: where T is the number of pairs from the same classes in the batch. Next is the intent classification loss: where P (C j |u i ) is the predicted probability of the i-th sentence to be the j-th intent class. We jointly train the two losses together at each batch: L stage2 = L s cl + λ L intent , where λ is a weight hyper-parameter.
Evaluation Datasets To better study the more challenging fine-grained few-shot intent detection problem and compare with recent state-of-the-art baselines, we pick up three challenging intent detection datasets for evaluation, i.e., CLINC150 (Larson et al., 2019), BANKING77 (Casanueva et al., 2020) and HWU64 (Liu et al., 2019). CLINC150 contains 23,700 utterances across ten different domains, and there are in total 150 intents. BANK-ING77 contains 13,083 utterances with a single banking domain and 77 intents. HWU64 includes 25,716 utterances with 64 intents spanning 21 domains. We follow the setup of Mehri et al. (2020a), where a small portion of the training set is separated as a validation set, and the test set is unchanged. Following previous work, we repeat our few-shot learning model training five times and report the average accuracy.

Model Training and Baselines
We utilize RoBERTa with base configuration, i.e., roberta-base as the BERT encoder. We pretrain the combined intent datasets without test sets in the contrastive pre-training stage for 15 epochs, where we set the batch size to 64, τ to 0.1, and λ to 1.0. The pre-training phase takes around 2.5 hours on a single NVIDIA Tesla V100 GPU with 32GB memory. We fine-tune the model under 5-shot CLINC150 BANKING77 HWU64 Model 5-shot 10-shot 5-shot 10-shot 5-shot 10-shot RoBERTa+Classifier (Zhang et al., 2020a) 87   (Mehri et al., 2020b): it is an intent detection model based on CONVEBERT, with example-driven training based on similarity matching and observers for transformer attentions. It also conducts task-adaptive self-supervised learning with mask language modeling (MLM) on the intent detection datasets. Combine represents the best MLM+Example+Observers setting in the referenced paper. 6, DNNC (Zhang et al., 2020a): it is a discriminative nearest-neighbor model which finds the best-matched example from the training set through similarity matching. The model conducts data augmentation during training and boosts performance by pre-training on three natural language inference tasks.

Experimental Results
We show the overall comparisons on three datasets in Table 2. The proposed CPFT method achieves the best performance across all datasets under both the 5-shot and 10-shot settings. Specifically, CPFT outperforms DNNC by 1.32% and 1.57% on CLINC150 and HWU64 under the 5-shot setting, respectively. It also improves DNNC by 2. The improvements indicate that our proposed method has a better ability to discriminate semantically similar intents than the strong discriminate nearest-neighbor model with data augmentation. Moreover, the DNNC training is expensive, as when training models on a single NVIDIA Tesla V100 GPU with 32GB memory, DNNC takes more than 3 hours for 10-shot learning on CLINC150, and it needs to retrain the model for every new setting. CPFT only needs 2.5 hours for one-time pre-training, and the fine-tuning only takes five minutes for each new setting. Compared with CON-VBERT+MLM, which does a self-supervised pretraining with MLM on the intent detection datasets, CPFT improves the performance by 1.43%, 3.21%, and 2.61% on CLINC150, BANKING77, and HWU64 under 10-shot setting, respectively. CPFL also outperforms CONVBERT+Combined, which further adds examples-driven training and specific transformer attention design. We contribute the performance improvements to contrastive learning, which help the model discriminate semantically similar intents.

Ablation Study and Analysis
Is the schema with both stages necessary? We conduct ablation study to investigate the effects  Table 3: Testing accuracy (×100%) of CPFT with variants on three datasets under 5-shot and 10-shot settings.
of self-supervised contrastive pre-training and supervised contrastive fine-tuning. Table 3 shows the testing results of CPFT with model variants on three datasets. Experimental results indicate that both stages are necessary to achieve the best performance. The self-supervised contrastive pretraining on the first stage is essential as the performance drops significantly on all datasets. We hypothesize that contrastive pre-training on the intent datasets without using labels benefits the discrimination of semantically similar utterances. Additionally, the performance also drops if without supervised contrastive learning during the few-shot fine-tuning stage. Specifically, it drops by 2% on BANKING77 under the 5-shot setting; the reason is that BANKING77 is a single domain dataset with many similar intents, where supervised contrastive learning can explicitly discriminate semantically similar intents with very limited training examples. We also jointly train the first and second stages together, and compared with the proposed CPFT schema, we observe minimal improvements. The joint training is also costly as it requires retraining the model every time for new settings.
Is contrastive pre-training beneficial to the target intent dataset? Additionally, we study whether contrastive pre-training can benefit the intent detection when excluding the target datasets. Specifically, we pre-train the model on the datasets except for the HWU64 dataset on the first stage and do few-shot learning on HWU64 during the second stage. Compared to the model without contrastive pre-training on the first stage, the performances are improved by 1.98% and 1.21% under 5-shot and 10-shot settings, respectively. The improvements indicate that the contrastive pre-training is helpful to transfer knowledge to new datasets. However, there are still performance drops compared to the contrastive pre-training, including the HWU64 dataset. Which shows that it is beneficial to include the target dataset during self-supervised contrastive learning. We leave whether self-supervised contrastive pre-training only on the target intent dataset benefits as a future study.
Is the training sensitive to hyper-parameters?
We also study the effects of hyper-parameters of contrastive learning, i.e., the temperature τ and weight λ . We set τ ∈ {0.05, 0.1, 0.3, 0.5} and λ ∈ {0.01, 0.03, 0.05, 0.1}. In our primary experiments, we do not find τ has a notable influence during the self-supervised contrastive pre-training on the first stage. Besides, we found that a batch size larger than 32 works well in the pre-training phase. However, during the few-shot fine-tuning stage, when setting τ to a small value 0.05, which heavily enforces the penalty to hard negative examples and λ to a large value 0.1, which increases the weight of supervised contrastive learning loss, the performance drops significantly. In addition, the batch size influences performance on this stage. Therefore, few-shot supervised contrastive loss is sensitive to hyper-parameters when there are limited training examples. We leave more studies to future work.

Conclusion
In this paper, we improve the performance of fewshot intent detection via contrastive pre-training and fine-tuning. It first conducts self-supervised contrastive pre-training on collected intent detection datasets without using any labels, where the model implicitly learns to separate fine-grained intents. Then it performs the few-shot fine-tuning based on the joint intent classification loss and supervised contrastive learning loss, where the supervised contrastive loss encourages the model to distinguish intents explicitly. Experimental results on three challenging datasets show that our proposed method achieves state-of-the-art performance.