The Devil is in the Details: On Models and Training Regimes for Few-Shot Intent Classification

In task-oriented dialog (ToD) new intents emerge on regular basis, with a handful of available utterances at best. This renders effective Few-Shot Intent Classification (FSIC) a central challenge for modular ToD systems. Recent FSIC methods appear to be similar: they use pretrained language models (PLMs) to encode utterances and predominantly resort to nearest-neighbor-based inference. However, they also differ in major components: they start from different PLMs, use different encoding architectures and utterance similarity functions, and adopt different training regimes.Coupling of these vital components together with the lack of informative ablations prevents the identification of factors that drive the (reported) FSIC performance. We propose a unified framework to evaluate these components along the following key dimensions:(1) Encoding architectures: Cross-Encoder vs Bi-Encoders;(2) Similarity function: Parameterized (i.e., trainable) vs non-parameterized; (3) Training regimes: Episodic meta-learning vs conventional (i.e., non-episodic) training. Our experimental results on seven FSIC benchmarks reveal three new important findings. First, the unexplored combination of cross-encoder architecture and episodic meta-learning consistently yields the best FSIC performance. Second, episodic training substantially outperforms its non-episodic counterpart. Finally, we show that splitting episodes into support and query sets has a limited and inconsistent effect on performance. Our findings show the importance of ablations and fair comparisons in FSIC. We publicly release our code and data.


Introduction
Intent classification deals with assigning one label from a predefined set of classes or intents to user utterances. This task is vital for task-oriented dialog (ToD) systems since the predicted intent of an utterance is an essential input to other modules (i.e., dialog management) in these systems (Ma et al., 2022;Louvan and Magnini, 2020;Razumovskaia et al., 2021). Although intent classification has been widely studied, it still represents a challenge in settings where dialogue systems, including their intent classifiers, need to have the ability to be quickly adjusted to new domains and intent classes. The main challenges in training intent classifiers in such settings lies in the costly labeling of utterances (Zhang et al., 2022a;Wen et al., 2017;Budzianowski et al., 2018;Rastogi et al., 2020;Hung et al., 2022;Mueller et al., 2022). Few-shot intent classification (FSIC), which deals with adjusting intent classifiers to new intents given only a handful of labeled instances, is thus of paramount importance for ToD systems.
Various methods ( §2) for FSIC have been proposed (Larson et al., 2019a;Casanueva et al., 2020a;Mehri et al., 2020;Krone et al., 2020;Casanueva et al., 2020b;Nguyen et al., 2020;Zhang et al., 2021;Dopierre et al., 2021;Zhang et al., 2022b). These methods are generally similar in that they utilize pretrained language models (PLMs) to encode utterances and resort to k nearest neighbors (kNN) inference: the label of a new instance is determined based on the labels of instances with which it has the highest representational similarity, as encoded by the PLM. Despite these general similarities, FSIC methods differ in design choices across several crucial dimensions, including encoding architectures, utterance similarity scoring, and training regimes. These methods tie together what are, in principle, independent design decisions across these dimensions, hindering ablations and insights into what drives the (reported) FSIC performance.
In this work, we (1) induce a framework (PLMbased utterance encoding, utterance similarity scor-ing, and nearest-neighbor-based inference) that unifies most of existing FSIC approaches ( §3); and (2) focus on three key design decisions within this framework: (1) model architecture for encoding utterances (or utterance pairs), where we contrast the less frequently adopted Cross-Encoder architecture (e.g., ) against the more common Bi-Encoder architecture 2 Krone et al., 2020;Zhang et al., 2021); (2) similarity function for scoring utterance pairs based on their joint or separate representations, contrasting the parameterized (i.e., trainable) neural scoring components against cosine similarity as the simple non-parameterized scoring function; and (3) training regimes, comparing the standard non-episodic training (adopted, e.g., by Zhang et al. (2021) or ) against the episodic metalearning training (implemented, e.g., by Nguyen et al. (2020) or Krone et al. (2020)). Our framework lets us evaluate impacts of these three dimensions for different text encoders (e.g., BERT (Devlin et al., 2019) as a vanilla PLM and SimCSE (Gao et al., 2021) as the state-of-the-art sentence encoder) under the same evaluation setup (datasets, intent splits, evaluation protocols and measures) while controlling for confounding factors that impede direct comparison between the FSIC methods.
Our extensive experimental results on seven intent classification datasets reveal three new important findings. First, a Cross-Encoder coupled with episodic training, a previously unexplored FSIC combination, consistently yields best performance across all the datasets. Second, episodic meta-learning yields robust FSIC classifiers across the board: our results demonstrate that it is much more effective for FSIC than the conventional nonepisodic training. Finally, although episodic metalearning entails splitting utterances of an episode into a support and query set during training, we show, for the first time, that this does not generally have a positive effect on the FSIC performance.
In sum, our comparative evaluation over various design choices for key components of modern FSIC approaches raise the awareness about the importance of ablations and apple-to-apple comparison between complex FSIC systems that conflate several key design decisions. We hope that our findings pave the way for more deliberation in research (and in particular evaluation) for this crucial ToD task.
2 Also known as Dual Encoder or Siamese Network.

Related Work
We focus on few-shot intent classification (FSIC) methods, which perform class inference for utterances based on the labels of nearest neighbor (kNN), either directly in the representation space of the PLM or according to a trained scorer of utterance pairs. We first describe the existing FSIC inference paradigms and explain why we focus on kNN-based methods. We then categorize the literature on FSIC approaches based on kNN-inference along the three key design dimensions.
Inference algorithms for FSIC. Classical methods (Xu and Sarikaya, 2013;Meng and Huang, 2018;Wang et al., 2019;Gupta et al., 2019) for FSIC use the maximum likelihood inference, where a vector representation of an utterance is projected by the classifier into a probability distribution over the intent classes. Training such probability distribution functions, in particular when they are modeled by neural networks, mostly requires a large number of utterances annotated with intent labels, which are infamously expensive to collect in scenarios where new intents emerge on regular basis. By relying on pretrained language models, more recent FSIC methods leverage the language competences they posses (i.e., encode) to alleviate the need for learning to produce probability distributions for a large number intent classes, commonly with a few instances. These recent FSIC methods (Krone et al., 2020;Casanueva et al., 2020b;Nguyen et al., 2020;Zhang et al., 2021;Dopierre et al., 2021;Zhang et al., 2022b) instead exploit the similarities between utterance embeddings in the representation space of the (finetuned) PLM and infer the intents for new utterances from the labels of nearest neighbors (kNN-based). Since kNN-based methods in general report stateof-the-art performance for FSIC, our comparative empirical evaluation focuses on the design choices for models that adopt this inference algorithm.
Model architectures for encoding utterance pairs. A central design decision within the kNNbased FSIC framework is the choice of the model architecture for encoding utterances. The majority of the approaches Krone et al., 2020;Zhang et al., 2021; leverage the Bi-Encoder architecture (Bromley et al., 1993;Reimers and Gurevych, 2019a;Zhang et al., 2022a). The core idea of Bi-Encoders is that, given a collection of utterances, each utterance is inde-pendently encoded by the PLM and mapped into a dense representation space. In such a space, similarities between pairs of utterances can be computed, with a parameterized (i.e., trainable) scoring function or a non-parameterized function such as dot product or cosine similarity. In contrast, some FSIC methods Wang et al., 2021;Zhang et al., 2021) use the Cross-Encoder architecture, in which the two utterances are concatenated and encoded jointly by a pretrained text encoder, e.g., BERT (Devlin et al., 2019). The idea is to represent a pair of utterances together using a PLM, where each utterance becomes a context for the other. A Cross-Encoder thus does not produce an embedding for a single utterance but for a pair of utterances. In general, Bi-Encoders are more computationally efficient than Cross-Encoders because of the Bi-Encoder's ability to cache the representations of the candidates. In return, Cross-Encoders, by allowing tokens of one utterance to attend over the tokens of the other (and vice versa), capture better the semantic associations between utterances.
Similarity scoring function. A crucial component in nearest neighbor-based methods for FSIC is the function that produces a similarity score for a pair of utterances. Concerning this dimension of analysis, we categorize FSIC methods into two groups: (1) FSIC approaches that use parameterized (i.e., trainable) neural layers to estimate the similarity score between utterances (Zhou et al., 2022;; and (2) methods that rely on non-parameterized similarity metrics such as dot product, cosine similarity, and Euclidean distance (Sauer et al., 2022;Zhang et al., 2022a;Krone et al., 2020;Zhang et al., 2022b;Xu et al., 2021;Zhang et al., 2021). Note that the Bi-Encoder architecture can be coupled with both, whereas the Cross-Encoder requires a parametrized scoring module.
Training strategy. To simulate FSIC, the best practice is to split an intent classification corpus into two disjoint sets of intent classes. In this way, one set includes high-resource intents for training of an FSIC classifier, and the other set includes lowresource intents for evaluating the classifier. Concerning the training strategy on the high-resource intents, FSIC methods can be divided into two clusters. One cluster of methods adopts meta-learning or episodic training (Zhang et al., 2022a;Nguyen et al., 2020;Krone et al., 2020). Under this training regime, the goal is to train a meta-learner that could be used to quickly adapt to any few-shot intent classification task with very few labeled examples. To do so, the set of high-resource intents are split to construct many episodes, where each episode is a few-shot intent classification task for a small number of intents. The other cluster includes methods (Zhang et al., 2021;Xu et al., 2021;Zhang et al., , 2021) that use conventional supervised (i.e., non-episodic) training. The non-episodic training simply fine-tunes the FSIC model using all samples from the high-resource intents of the training set.

Framework
We first unify formulations of the components we need for our framework. We then present their alternative configurations along our three central dimensions of comparison: (i) model architecture for encoding utterance pairs, (ii) functions for similarity scoring, and (iii) training regimes.

Nearest Neighbors Inference
Following previous work on FSIC , we cast the FSIC task as a sentence similarity task in which each intent is an implicit semantic class, captured by the representations of all the utterances associated with that intent. The task is then to find the most similar labeled utterances for the given query. During inference, the FSIC approach should deal with an N -way k-shot intent classification, where N is the number of intents and k is the number of labeled utterances given for each intent label.
Let q be a query utterance and C = {c 1 , ..., c n } be a set of its labeled neighbors. The nearest neighbor inference relies on a similarity function, nonparameterized or trainable (which is learned on high-resource intents), to estimate the similarity score s i between q and any c i . The query's labelŷ q is inferred as the ground-truth label of the neighbor with the maximum similarity score (i.e., k = 1 in k-NN inference):ŷ = y k , k = argmax({s 1 , ..., s n }).

Model Architectures for Encoding Utterance Pairs
An encoder in an FSIC model represents a pair of a query and a neighbor (i.e., a labeled utterance) into vector h (q,c i ) ∈ R d . We formulate recently used encoders: Bi-Encoder and Cross-Encoder.
Bi-Encoder (BE). BE encodes a pair of utterances independently, deriving independent representations of the query and the neighbor utterance. In particular, for each utterance x in a pair, we pass, " [CLS] x", to a BERT-like PLM and use the representation of "[CLS]" to represent x. Worth noting that the parameters of the PLM are shared in BE.
Cross-Encoder (CE). Different from BE, CE encodes a pair of query q and neighbor c i jointly.
We concatenate q with each of its neighbors to form a set of query-neighbor pairs P = {(q, c 1 ), ..., (q, c n )}. We then pass each pair from P as a sequence of tokens to a language model, which is pre-trained to represent the semantic relation between utterances. More formally, we feed a pair of utterances, "[CLS] q [SEP] c i ", to a BERTlike PLM and then use the representation of the "[CLS]" token as the representation of the pair.

Similarity Scoring Function
Given the pair representation, we compute the similarity between a query and a neighbor utterance by a parameterized or non-parameterized function.
PArameterized (PA). A neural-based parametric scoring function consists of a fully connected feedforward network (FF) that transforms a pair representation into a score, where the weight W and bias b are trainable parameters, d is the size of the vector h (q,c i ) , and σ(.) denotes the sigmoid activation function.
Non-Parameterized (NP). In contrast to PA, NP often uses vector-based similarity metrics as scoring functions, e.g., cosine similarity or Euclidean distance. Following , in this work we adopt the cosine similarity between h q and h c i .

Model Configurations
Given the aforementioned components, we illustrate ( Figure  CE +PA. In this configuration, we feed the joint encoding of the utterance pair to a parameterized similarity scoring function. We note again, due to a single representation vector for both utterances, CE cannot be coupled with a non-parameterized scoring (NP).
BE +PA. In this configuration, we represent the pair by concatenating the representations of each ut- terance with the vectors of difference and elementwise product between those representations: where ⊕ is the concatenation operation and ⊙ is the dot product. We motivate Equation 1 by the findings reported in Reimers and Gurevych (2019b). Similar to CE +PA, we use the sigmoid activation function on top of the feed-forward layer. The size of W is then 1 × 4d.
BE +NP. We use cosine similarity to estimate the similarity between input utterances during prediction. During training, we compute the dot product between the query and each neighbor representation vector to directly estimate their similarity scores s i = σ (h q ⊙ h c i ), where ⊙ indicates the dot product, and σ is the sigmoid function. We apply σ to scale s i to a value between 0 and 1.

Training Regimes
To train the aforementioned model configurations, we formulate three training techniques as follows ( Figure 2): Non-Episodic Training (NE), Episodic Training (EP) and Episodic Training with Support and Query splits (EPSQ). The training strategies rely on an identical loss function for each query.
Loss per query sample. We use the loss function defined by  for FSIC. In particular, we define a ground-truth binary vector y q for a query q given a set of neighbors C = {c 1 , ..., c n }.
If the query and its i-th neighbor belong to the same intent class, the corresponding label for the pair is y q,i = 1, otherwise y q,i = 0. Given such ground-truth label vector in consideration of the n neighbors, y q = [y q,t |t = 1, ..., n] and similarity scores estimated by a model configuration for all pairs s q = [s q,t |t = 1, ..., n], we compute the binary cross-entropy loss for the query q as follows: NE. For the NE training, the classifier learns the semantic relation between all high-resource intent classes altogether. Let D represent a batch of utterances for high-resource intent classes. Therefore, we take each utterance in D as a query q and predict its label concerning the rest of the utterances as neighbors. More formally, we estimate the loss for the NE training as follows: where l q is the loss defined in Equation 2 between ground truth label vector y q and a vector of scores s q estimated by a model configuration.
EP. An episode is a set of utterances for several intent classes. An episode formulates an N -way intent classification task, where N is the number of intent classes in the episode. The core idea behind meta-learning is to learn from a large set of highresource intent classes by chunking the set into many episodes (Lee et al., 2022). These episodes are known as training episodes (a.k.a meta-training episodes). If set I denotes the intent labels of a benchmark corpus, any N randomly selected intents from I can be used to construct a training episode. Let's refer to these selected intents for episode E by I E . Then, episode E contains utterances whose intent labels are in I E . It is worth noting that intent classes in training episodes may overlap to let a classifier learn the semantic relations between all intent labels of the benchmark. In EP, we construct M episodes from the set of utterances for high-resource intent classes D. We define the following loss function: where E i is the ith episode, y q is the ground-truth labels for the query given neighbors in the episode E i , and s q is the similarity scores between the query and any neighbor in the episode.
EPSQ. The common practice in meta-learning is to imitate the few-shot setup, an episode is split into two disjoint sets: a support and a query set (Lee et al., 2022). An episode's support set includes only a few utterances from each intent class in I E . An episode's query set includes the rest of the utterances in the episode. A classifier should classify utterances in the query set using the utterances and intent labels in the support set. Given the kNN terminology, the support set is the set of neighbors and the query set is a set of query utterances. Therefore, the main difference between EPSQ and EP is that the number of neighbors in EPSQ is limited to only a few examples of each intent in the support set. The loss function in EPSQ is defined as follows: where Q i is the query set and S i is the support set of the ith episode.

Experiments
We conduct our experiments in two different setups: (i) balabced N -way k-shot and (ii) imbalanced classes in the support sets. The former refers to the typical few-shot learning setup, where the numbers of classes and examples per class are balanced. In contrast, the imbalanced setup randomly defines the numbers of classes and examples, imitating the imbalance nature of some benchmarks for intent classification. While arguably some utterances can be annotated to transform imbalanced episodes into balanced ones, imbalanced few-shot learning is still a huge practical challenge for various expensive domains, e.g., those that require experts for annotation (Krone et al., 2020).
Datasets, splits, and episodes. Table 1 summarizes the main statistics (e.g., the number of classes per data split for each datasets) of the datasets and their splits as we use in our experiments. For the balanced N -way k-shot setup, we use Clinc (Larson et al., 2019b), Banking (Casanueva et al., 2020b), and Hwu  from Di-aloGLUE (Mehri et al., 2020) as well as Liu . For the sake of fair comparisons, we use the exact splits and episodes as used by Dopierre et al. (2021) for FSIC. For 5 folds, we randomly split intents of each dataset into three sets to construct training, valid and test episodes. We then generate 5-way k-shot episodes for each split in each fold, where k ∈ {1, 5}. For the imbalanced setup, we use ATIS (Hemphill et al., 1990), SNIPS (Coucke et al., 2018), and TOP (Gupta et al., 2018). We follow Krone et al. (2020) to construct episodes for these datasets.
Settings. We use BERT-based-uncased and SimCSE as PLMs. We fine-tune them using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 2e − 5. Both batch size and maximum sequence length are set to 64. See the Appendix for the full list of hyperparameters. For experiments on each fold of balanced datasets, we train a FSIC classifier for a maximum of 10,000 5-way K-shots episodes. We evaluate the classifier on the validation set after every 100 updates, and stop the training if validation performance does not improve over 5 consecutive evaluation steps. To alleviate the impact of random selection of fewshot samples, we report the average performance of a classifier for 600 test episodes, compatible with Dopierre et al. (2021). For the experiments on the imbalanced datasets, similar to Krone et al. (2020), we conduct the experiments over 1 fold due to the limited number of intents. The average number of shots per intent used in episodes of ATIS, SNIPS, and TOP is about 4, 5, and 4, respectively (see Appendix for details). For both N -way k-shot and imbalanced setups, the number of examples in query sets is identical for all intents in the query sets. So, for all experiments we report the accuracy metric averaged over all runs and folds.

Models in comparison.
Alongside the results of the model configurations ( §3), we report the results of the following FSIC methods to put our results in context. Random assigns a random intent class from the support set to each query utterance. BE (fixed)+NP represents a generic configuration employed by the majority of PLM-based FSIC baselines, e.g., ConvBERT (Mehri and Eric, 2021), TOD-BERT , and DNNC-BERT , inter alia. These methods use pretrained BERT and further fine-tune on other NLP tasks (e.g., NLI) or other dialogue datasets. ProtoNet (Dopierre et al., 2021) is inspired by prototypical network method (Snell et al., 2017), which has been shown to achieve the stateof-the-art accuracy among meta-learning methods for few-shot learning tasks including FSIC (Krone et al., 2020). This method is not based on instancelevel similarity. It encodes an intent class by a prototype vector, which is the mean of vector representations of a few utterances given for the intent. In any given episode, the prototype vector is computed for each intent. The probabilities of intents are then estimated based on the distances between a query vector and respective prototypes.  Table 2: BERT-based results for the balanced 5-way k-shot setup, k ∈ {1, 5}.

Results and Discussion
We compare the the configurations described in ( §3) and baselines ( §4) for balanced and imbalanced FSIC setups using BERT, as the most widely used pretrained language model, and SimCSE, the state-of-the-art model for encoding the meaning of sentences. Our main experimental findings are as follows • The Cross-Encoder architecture with parameterized similarity function and episodic training consistently yields the best FSIC accuracy.
• Episodic training yields more robust FSIC classifiers than non-episodic training for most of the examined setups and datasets.
• Splitting episode utterances into support and query (sub)sets, a commonly adopted practice in episodic training, does not give consistent performance gains. Table 2 shows accuracy of the examined FSIC approaches under comparison -based on BERT as PLM -in 1-shot and 5-shots settings. All model configurations consistently outperform the "BE (fixed)+NP" baseline. This demonstrates that finetuning BERT's parameters for intent classification using high-resource intent classes is paramount for generalization to unseen intents. For both 1-shot and 5-shots, CE +PA trained with either of the two episodic training regimes (EP and EPSQ, without and with support-query splitting, respectively), achieves a higher accuracy (29% on average) than when trained in non-episodic fashion (NE), reaching, on average, the performance of ProtoNet as the state-of-art FSIC method. Both episodic training regimes are more effective than the non-episodic training across the board, not just in combination with the CE architecture. BE +PA trained via EP achieves about 2% higher accuracy for 1-shot and 3% for 5-shots than when trained with NE. For BE +NP, episodic learning (EP) results in 3.8% higher accuracy than NE for 1-shot. The only exception to this trend is BE +NP with 5-shot where EP trails NE by 1%.

Balanced FSIC
EPSQ tends to exhibit a similar average accuracy as EP (less than 1% difference for average across all CE +PA, BE +PA, and BE +NP setups). This leads a conclusion that splitting utterances of an episode into a support and a query set -a common practices in episodic (FSIC) learning (Dopierre et al., 2021;Krone et al., 2020) -does not really have a pronounced (positive) effect on performance. So, it does not seem to increase the capability to generalize to unseen intent classes, as has been commonly believed but until now, to the best of our knowledge, empirically untested.
As expected, more shots (5-shots vs 1-shot) lead to consistently better FSIC accuracy: BE +NP trained with NE performs 16% better (and the other FSIC about 10% better on average). This makes intuitive sense: more shots help classifiers better refine the boundaries between the new intents.
Given that utterances in task-oriented dialogue systems are short texts, we next investigate how intermediate training for sentence representations (Phang et al., 2018;Reimers and Gurevych, 2019a;Gao et al., 2021) changes the performance of FSIC   models. To this end, we substitute BERT with SimCSE. Table 3 shows the results. Our three main findings hold for SimCSE-based FSIC models too. Importantly, unlike with BERT, now only CE +NP trained episodically outperforms the "BE (fixed)+NP" baseline (where PLM is not fine-tuned for intent detection). This confirms the effectiveness of coupling CE and episodic training for FSIC. It also indicates that intent detection fine-tuning is well-aligned with learning sentence representations, which is why it generally brings lower gains (or no gains) over "BE (fixed) + NP", when we start from SimCSE, pretrained exactly for encoding the meaning of sentences. Table 4 shows the results on the three imbalanced datasets. CE +PA with EP again substantially outperforms all its counterparts, confirming this neverinvestigated FSIC configuration as a very effective approach for the FSIC task. On average, episodic training (EP) again outperforms non-episodic (NE) training. The CE + PA and BE + NP configurations generally yield higher performance when trained without splitting the support utterances from query utterances (EP vs EPSQ). This questions the common belief in episodic meta-learning that splitting episodes into support and query sets is (always) beneficial. Overall, the findings from the imbalanced datasets align well with the main findings from central experiments on balanced datasets, as reported in Table 2 and Table 3.

Conclusions
We shed light on factors that contribute to performance of models for few-shot intent classification (FSIC), a crucial task in modular dialogue systems. We categorize FSIC approaches across three essential dimensions: (1) the Cross-Encoder vs. Bi-Encoder encoder architectures; (2) the parameterized (i.e., trainable) vs non-parameterized utterance similarity scoring; and (3) episodic vs non-episodic training. Our extensive evaluation, encompassing seven standard FSIC datasets, reveals that the previously unexplored combination of Cross-Encoder architecture (with parameterized utterance similarity scoring) and episodic training consistently yields the best FSIC performance. We additionally find that (i) episodic meta-learning generally outperforms the non-episodic training and (ii) that the widely adopted hypothesis in meta-learning that splitting episodes into support and query sets helps generalization and boost performance may not hold for FSIC. We hope that our findings lead to more deliberation on FSIC evaluation protocols and more insightful "apple-to-apple" comparisons between competing models and model variants.

Limitations and Ethical Concerns.
In this paper, we shed light to few-shot intent classification tasks in modular (task-oriented) dialogue systems. Dialog systems, given their direct interaction with human users, must be devoid of any negative stereotypes and must not exhibit any behaviour that could be potentially harmful to humans. That said, our work does not address the generation component of dialog systems, but merely the intent classification. As such, we do not believe it raises any ethical concerns. The main limitation of the work -conditioned primarily by the available computational resources -is the scope of our empirical comparison: we focus on FSIC methods that subscribe to pairwise similarity scoring of utterances and nearest neighbours inference. While this subsumes much of the best performing approaches in the literature, there is a fair body of recent work that does not fall in this group. Another limitation of the work is the monolingual focus on English only. We intend to extend our work to cross-lingual transfer to other languages, for which fewer labeled intent classification datasets exist.