Effectiveness of Pre-training for Few-shot Intent Classification

This paper investigates the effectiveness of pre-training for few-shot intent classification. While existing paradigms commonly further pre-train language models such as BERT on a vast amount of unlabeled corpus, we find it highly effective and efficient to simply fine-tune BERT with a small set of labeled utterances from public datasets. Specifically, fine-tuning BERT with roughly 1,000 labeled data yields a pre-trained model -- IntentBERT, which can easily surpass the performance of existing pre-trained models for few-shot intent classification on novel domains with very different semantics. The high effectiveness of IntentBERT confirms the feasibility and practicality of few-shot intent detection, and its high generalization ability across different domains suggests that intent classification tasks may share a similar underlying structure, which can be efficiently learned from a small set of labeled data. The source code can be found at https://github.com/hdzhang-code/IntentBERT.


Introduction
Task-oriented dialogue systems have been widely deployed to a variety of sectors (Yan et al., 2017;Zhang et al., 2020c;Hosseini-Asl et al., 2020), ranging from shopping (Yan et al., 2017) to medical services (Arora et al., 2020a;Wei et al., 2018), to provide interactive experience. Training an accurate intent classifier is vital for the development of such task-oriented dialogue systems. However, an important issue is how to achieve this when only limited number of labeled instances are available, which is often the case at the early development stage.
To tackle few-shot intent detection, some recent attempts employ induction network (Geng et al., 2019), generation-based methods (Xia et  2020a,b), metric learning (Nguyen et al., 2020), or self-training (Dopierre et al., 2020). These works mainly focus on designing novel algorithms for representation learning and inference, which often comes with complicated models. Most recently, large-scale pre-trained language models such as BERT (Devlin et al., 2019;Radford et al., 2019;Brown et al., 2020) have shown great promise in many natural language understanding tasks (Wang et al., 2019), and there has been a surge of interest in fine-tuning the pre-trained language models for intent detection (Zhang et al., 2020a,b;Peng et al., 2020;Larson et al., 2019).
While fine-tuning pre-trained language models on large-scale annotated datasets has yielded significant improvements in many tasks including intent detection, it is laborious and expensive to construct large-scale annotated datasets in new application domains. Therefore, recent efforts have been dedicated to adapting pre-trained language models to a specific task such as intent detection by conducting continued pre-training (Gururangan et al., 2020;Gu et al., 2021) on a large unlabeled dialogue corpus with a specially designed optimization objective. Below we summarize the most related works in this line of research for few-shot intent detection.
• TOD-BERT  further pretrains BERT on a task-oriented dialogue corpus of 100, 000 unlabeled samples with masked language modelling (MLM) and response contrastive objectives.
• USE-ConveRT  investigates a dual encoder model trained with response selection tasks on 727 million input-response pairs.
• DNNC (Zhang et al., 2020a) pre-trains a language model with around 1 million annotated samples for natural language inference (NLI) and use the pre-trained model for intent detection.
While these methods have achieved impressive performance, they heavily rely on the existence of a large-scale corpus (Mehri et al., 2020) that is close in semantics to the target domain or consists of similar tasks for continued pre-training, which needs huge effort for data collection and comes at a high computational cost. More importantly, they completely ignore the "free lunch" -the publicly available, high-quality, manually-annotated intent detection benchmarks. For example, the dataset OOS (Larson et al., 2019) provides labeled utterances across 10 different domains. Hence, our study in this paper centers around the following research question: • Is it possible to utilize publicly available datasets to pre-train an intent detection model that can learn transferable task-specific knowledge to generalize across different domains?
In this paper, we provide an affirmative answer to this question. We fine-tune BERT using a simple standard supervised training with approximately 1,000 labeled utterances from public datasets and obtain a pre-trained model, called IntentBERT. It can be directly applied for few-shot intent classification on a target domain that is drastically different from the pre-training data and significantly outperform existing pre-trained models, without further fine-tuning on target data (labeled or unlabeled). This simple "free-lunch" solution not only confirms the feasibility and practicality of few-shot intent detection, but also provides a ready-to-use well-performing model for practical use, saving the effort in algorithm design and data collection. Moreover, the high generalization ability of In-tentBERT on cross-domain few-shot classification tasks, which are generally considered very difficult due to large domain gaps and the few data constraint, suggests that most intent detection tasks probably share a common underlying structure that could be learned from a small set of data.
Further, to leverage unlabeled data in the target domain, we design a joint pre-training scheme, which simultaneously optimizes the classification error on the source labeled data and the language modeling loss on the target unlabeled data. This joint-training scheme can learn better semantic representations and significantly outperforms existing two-stage pre-training methods (Gururangan et al., 2020). A visualization of the embedding spaces produced by strong baselines and our methods is provided in Fig. 1, which clearly demonstrates the superiority of our pre-trained models.

Methodology
We present a continued pre-training framework for intent classification based on the pre-trained language model BERT (Devlin et al., 2019).
Our pre-training method relies on the existence of a small labeled dataset D labeled source = {(x i , y i )}, where y i is the label of utterance x i . Such data samples can be readily obtained from public intent detection datasets such as OOS (Larson et al., 2019) and HWU64 (Liu et al., 2021). As will be shown in the experiments, roughly 1, 000 examples from either OOS or HWU64 are enough for the pretrained intent detection model to achieve a superior performance on drastically different target domains such as "Covid-19".
We further consider a scenario that unlabeled utterances D unlabeled target = {x i } in the target domain are available, and propose a joint pre-training scheme that is empirically proven to be highly effective.

Supervised Pre-training
we employ a simple method to fine-tune BERT. Specifically, a linear layer is attached on top of BERT as the classifier, i.e., where h i ∈ R d is the feature representation of x i given by the [CLS] token, W ∈ R N ×d and b ∈ R N are parameters of the linear layer. The model parameters θ = {φ, W, b}, with φ being the parameters of BERT, are trained on D labeled source with a cross-entropy loss: After training, the fine-tuned BERT is expected to have learned general intent detection skills, and hence we call it IntentBERT.

Joint Pre-training
Given unlabeled target data D unlabeled target , we can leverage it to further enhance our IntentBERT, by simultaneously optimizing a language modeling loss on D unlabeled target and the supervised loss in Eq. (2). The language modeling loss can help to learn semantic representations of the target domain while preventing overfitting to the source data.
Specifically, we use MLM as the language modeling loss, in which a proportion of input tokens are masked with the special token [M ASK] and the model is trained to retrieve the masked tokens. The joint training loss is formulated as: where λ is a hyperparameter that balances the supervised loss and the unsupervised loss.

Few-shot Intent Classification
After pre-training, the parameters of IntentBERT are fixed, and it can be immediately used as a feature extractor for novel few-shot intent classification tasks. The classifier can be a parametric one such as logistic regression or a non-parametric one such as nearest neighbor. A parametric classifier will be trained with the few labeled examples provided in a task and make predictions on the unlabeled queries. As will be shown in the experiments, a simple linear classifier suffices to achieve very good performance, thanks to the effective utterance representations produced by IntentBERT. Datasets. To train our IntentBERT, we continue to pre-train BERT on either of the two datasets, OOS (Larson et al., 2019) 1 and HWU64 (Liu et al., 2021), both of which contain multiple domains, providing rich resources to learn from 2 . For evaluation, we employ three datasets: BANK-ING77 (Casanueva et al., 2020) is a fine-grained intent detection dataset focusing on "Banking"; MCID (Arora et al., 2020a) is a dataset for "Covid-19" chat bots; HINT3 (Arora et al., 2020b) contains 3 domains, "Mattress Products Retail", "Fitness Supplements Retail" and "Online Gaming". Dataset statistics are summarized in Table 2. Fig. 2 visualizes the vocabulary overlap between the source training data and target test data, which is calculated as the proportion of the shared words in the combined vocabulary of any two datasets after removing stop words. It is observed that the overlaps are quite small, indicating the existence of large semantic gaps.

Experimental Setup
Evaluation. The classification performance is evaluated by C-way K-shot tasks. For each task, We randomly sample C classes and K examples per class to train the classifier, and then we sample extra 5 examples per class as queries for evaluation. The accuracy is averaged over 500 such tasks.
Baselines. We compare IntentBERT to the following strong baselines. BERT-Freeze simply freeze the off-the-shelf BERT; TOD-BERT     (Liu et al., 2019) on fake intent detection data synthesized from wikiHow 3 . All the baselines (except BERT-Freeze) adopt a second pre-training stage, but with different objectives and on different corpus. In our experiments, all the baselines (except DNNC) use logistic regression as the classifier. For DNNC, we strictly follow the original implementation 4 to pre-train a BERT-style pairwise encoder to estimate the best matched training example for a query utterance.
Training details. We use BERT base 5 (the base configuration with d = 768) as the encoder, Adam (Kingma and Ba, 2015) as the optimizer, and Py-Torch library for implementation. The model is trained with Nvidia GeForce RTX 2080 Ti GPUs.
For supervised pre-training, we use validation to control early-stop to prevent overfitting. Specifically, we use HWU64 for validation when pretraining with OOS and vice versa. The training is stopped if no improvement in accuracy is observed in 3 epochs. For joint pre-training, λ is set to 1. The number of training epochs is fixed to 10, since it is not prone to overfitting.

Main Results
The main results are provided in Table 1. First, IntentBERT (either pre-trained with OOS or HWU64) consistently outperforms all the baselines by a significant margin in most cases. Take the results of 5-way 2-shot classification on MCID for example, IntentBERT (OOS) outperforms the strongest baseline CONVBERT by an absolute margin of 9.4%, demonstrating the high effectiveness of our pre-training method. The cross-domain transferability of IntentBERT indicates that despite semantic domain gaps, most intent detection tasks probably share a similar underlying structure, which could be learned with a small set of labeled utterances. Second, IntentBERT (OOS) seems to be more effective than IntentBERT (HWU64), which may be due to the semantic diversity of the training corpus. Nevertheless, the small difference in performance between them shows that our pretraining method is not sensitive to the training corpus.
Finally, our proposed joint pre-training scheme (Section 2.2) achieves significant improvement over IntentBERT (up to 9.2% absolute margin), showing the high effectiveness of joint pre-training when target unlabeled data is accessible. Our joint pre-training scheme can also be applied to other language models such as GPT-2 (Radford et al., 2019) and ELMo (Peters et al., 2018), which is left as future work.

Analysis
Amount of labeled data for pre-training. We reduce the data used for pre-training in two dimensions: the number of domains and the number of samples per class. We randomly sample 1, 2, 4 and 8 domains for multiple times and report the averaged results in Fig. 3. It is found that the training data can be dramatically reduced without harming the performance. The model trained on 4 domains and 20 samples per class performs on par with that on 8 domains and 150 samples per class. In general, we only need around 1, 000 annotated utterances to train IntentBERT, which can be easily obtained in public datasets. This finding indicates that using small task-relevant data for pre-training may be a more effective and efficient fine-tuning paradigm.
Amount of unlabeled data for joint pretraining. We randomly sample a fraction of unlabeled utterances and re-run the joint training. As shown in Fig. 4, the accuracy keeps increasing when the number of unlabeled samples grows from 10 to 1, 000 and tends to saturate after reaching 1, 000. Surprisingly, 1, 000 utterances in BANK-ING77 can yield a comparable performance than the full dataset (13, 083 utterances). Generally, it does not need much unlabeled data to reach a high accuracy.
Ablation study on joint pre-training. First, we investigate a two-stage pre-training scheme (Gururangan et al., 2020) where we use BERT or In-tentBERT as initialization and perform MLM in the target domain (the top two rows in Table 3). It can be seen that they perform much worse than our joint pre-training scheme (the bottom row). Second, we use the source data instead of the target data for MLM in joint pre-training (the third row), Figure 4: Effect of the amount of unlabeled data used for joint pre-training in the target domain. The results are evaluated on 5-way 2-shot tasks with OOS as the source dataset. and observe consistent performance drops, which shows the necessity of a domain-specific corpus.  Table 3: Ablation study on joint pre-training. BANK denotes BANKING77. → denotes moving to the next training stage. + denotes joint optimization of both loss functions. The data used for the experiment (either from "target" or "source") is shown in the brackets. The results are evaluated on 5-way 2-shot tasks with OOS as the source dataset.

Conclusion
We have proposed IntentBERT, a pre-trained model for few-shot intent classification, which is obtained by fine-tuning BERT on a small set of publicly available labeled utterances. We have shown that using small task-relevant data for fine-tuning is far more effective and efficient than current practice that fine-tunes on a large labeled or unlabeled dialogue corpus. This finding may have a wide implication for other tasks besides intent detection.