Pre-training Intent-Aware Encoders for Zero- and Few-Shot Intent Classification

Intent classification (IC) plays an important role in task-oriented dialogue systems. However, IC models often generalize poorly when training without sufficient annotated examples for each user intent. We propose a novel pre-training method for text encoders that uses contrastive learning with intent psuedo-labels to produce embeddings that are well-suited for IC tasks, reducing the need for manual annotations. By applying this pre-training strategy, we also introduce Pre-trained Intent-aware Encoder (PIE), which is designed to align encodings of utterances with their intent names. Specifically, we first train a tagger to identify key phrases within utterances that are crucial for interpreting intents. We then use these extracted phrases to create examples for pre-training a text encoder in a contrastive manner. As a result, our PIE model achieves up to 5.4% and 4.0% higher accuracy than the previous state-of-the-art text encoder for the N-way zero- and one-shot settings on four IC datasets.


Introduction
Identification of user intentions, a problem known as intent classification (IC), plays an important role in task-oriented dialogue (TOD) systems.However, it is challenging for TOD developers to collect data and re-train models when designing new intent classes.Recent studies have aimed to tackle this challenge by applying zero-and few-shot text classification methods and leveraging the semantics of intent label names (Liu et al., 2019a;Krone et al., 2020;Burnyshev et al., 2021;Mueller et al., 2022;Zhang et al., 2022;Lamanov et al., 2022;Liu et al., 2022) .Dopierre et al. (2021a) compare various classification methods on few-shot IC tasks and find that Prototypical Networks (Snell et al., 2017) (PIE).Given an utterance x 1 from pre-training corpus, we generate a pseudo intent name y pseudo 1 using labels from the intent role labeling (IRL) tagger.Our PIE model is then optimized by pulling the gold utterance x gold 1 , gold intent y 1 , and pseudo intent y pseudo 1 close to the input utterance x 1 in the embedding space.
with transformer-based text encoders.Prototypical Networks use text encoders to construct class representations and retrieve correct classes given queries based on a similarity metric.Dopierre et al. (2021a) also stress that few-shot learning techniques and text encoders can have an orthogonal impact on classification performance.Thus, although some studies have focused on improving learning techniques for few-shot IC tasks (Dopierre et al., 2021b;Chen et al., 2022), better text encoder selection should also be considered as an impor-tant research direction.Ma et al. (2022) observe that sentence encoders pre-trained on paraphrase or natural language inference datasets serve as strong text encoders for Prototypical Networks.However, existing sentence encoders are not explicitly designed to produce representations for utterances that are similar to their intent names.Therefore, their abilities are limited in zero-and few-shot settings where predictions may heavily rely on the semantics of intent names.Pre-training encoders to align user utterances with intent names can mitigate this issue; however, it is typically expensive to obtain annotations for a diverse intent set.
In this paper, we propose a novel pre-training method for zero-and few-shot IC tasks (Figure 1).Specifically, we adopt intent role labeling (IRL) (Zeng et al., 2021), which is an approach for identifying and assigning roles to words or phrases that are relevant to user intents in sentences.Once we obtain the IRL predictions, we convert them to the pseudo intent names of query utterances and use them to pre-train the encoder in a contrastive learning fashion.This intent-aware contrastive learning aims to not only align utterances with their pseudo intent names in the semantic embedding space, but also to encourage the encoder to pay attention to the intent-relevant spans that are important for distinguishing intents.To the best of our knowledge, this work is the first to extract key information from utterances and use it as pseudo labels for pre-training intent-aware text encoders.
The contributions of our work are as follows: • First, we propose an algorithm for generating pseudo intent names from utterances across several dialogue datasets and publicly release the associated datasets.
• Second, by applying intent-aware contrastive learning on gold and pseudo intent names, we build Pre-trained Intent-aware Encoder (PIE), which is designed to align encodings of utterances with their intent names.
• Finally, experiments on four IC datasets demonstrate that the proposed model outperforms the state-of-the-art work (Dopierre et al., 2021b;Ma et al., 2022) by up to 5.4% and 4.0% on N-way zero-and one-shot settings, respectively.
2 Background: Prototypical Networks for Intent Classification Prototypical Networks (Snell et al., 2017) is a metalearning approach that enables classifiers to quickly adapt to unseen classes when only a few labeled examples are available.Several studies have demonstrated the effectiveness of Prototypical Networks when building intent classifiers with a few example utterances (Krone et al., 2020;Dopierre et al., 2021a;Chen et al., 2022).They first define a fewshot IC task, also known as an episode in the metalearning context, with K example utterances from N intent classes (i.e., K×N utterances in a single episode).At the training time, the intent classifiers are optimized on a series of these episodes.Example utterances for each intent class are called a support set, and are encoded and averaged to produce a class representation, called a prototype.This can be formulated as follows: where S n denotes the support set of the n-th intent class, x n,i denotes the i-th labeled example of the support set S n , f ϕ (•) denotes a trainable encoder, and c n denotes the n-th prototype.At the inference time, the task is to map the query utterance representation to the closest prototype in a metric space (e.g., Euclidean) among the N prototypes.When there are N intent classes and each intent class has K example utterances, this setting is called N-way K-shot intent classification.Ma et al. (2022) suggest that leveraging intent names as additional support examples is beneficial in few-shot IC tasks because the semantics of intent names can give additional hints to example utterances.When intents are used as additional support examples, the new prototype representations can be formulated as follows: where y n is the intent name of the utterance in the nth support set, and c label n is the n-th prototype using intent names as support.By using intents as support examples, it is possible to classify input utterances without example utterances in a zero-shot fashion.Specifically, the prototypes in Equation ( 2) can be calculated as c label n = f ϕ (y n ) based solely on intent names, which facilitates the zero-shot IC.

Pseudo Intent Name Generation
To pre-train an encoder f ϕ that works robustly in zero-or few-shot IC settings, a variety of predefined intent names are required.Because annotating them is expensive, we opt to automatically generate pseudo intent names from utterances in our pre-training data.To annotate pseudo intents, we employ a tagging method, intent role labeling (IRL).IRL can be considered similar to semantic role labeling (SRL), which is a task of assigning general semantic roles to words or phrases in sentences (Palmer et al., 2010).However, IRL focuses on providing an extractive summary of the intent expressed in a user's utterance, annotating important roles with respect to the goal of the user rather than a predicate.Specifically, it tags words or phrases that are key to interpret intent.
IRL was first introduced by Zeng et al. ( 2021) for discovering intents from utterances, but their tagger focuses only on Chinese utterances.In this section, we outline the process of building the IRL tagger from scratch.We provide a description of how we annotate IRL training data on English utterances (Section 3.1), the training procedure for the IRL tagger (Section 3.2), and the utilization of IRL predictions for generating pseudo intent names to pre-train our model (Section 3.3).

Annotating Intent Roles
We define six intent role labels, Action, Argument, Request, Query, Slot, and Problem, for extracting intent-relevant spans from utterances.Action is a word or phrase (typically a verb or verb phrase) that describes the main action relevant to an intent in an utterance.Argument is an argument of an action, or entity/event that is important to interpreting an intent.Request indicates a request for something such as a question or information-seeking verb.Query indicates the expected type of answer to a question or request for information, or a requested entity to be obtained or searched for.Slot is an optional/variable value provided by the speaker that does not impact the interpretation of an intent.Finally, Problem describes some problematic states or events, and typically makes an implicit request.
Based on these definitions, we manually annotate IRL labels on a subset of utterances from SGD (Rastogi et al., 2020).

Training the IRL Tagger
Using manually curated IRL annotations, we formulate IRL as a sequence tagging problem.Specifically, we assign each token in an utterance with one of the 13 IRL labels under the Beginning-Inside-Outside (BIO) scheme (e.g., B-Action, I-Action, or O).The IRL model is, then, trained to predict the correct IRL labels of the tokens using the cross entropy loss.We use RoBERTa-base (Liu et al., 2019c) as the initial model for the IRL tagger.

Generating Pseudo Intents
After obtaining the IRL tagger, we leverage it to predict IRL labels for tokens in utterances from pre-training corpus described in Section 5.2.To generate pseudo intent names, we simply concatenate all spans that have been predicted as IRL labels in each utterance.4 Intent-Aware Contrastive Learning We aim to build an encoder that produces similar representations between utterances and the corresponding intent names.In this section, we introduce the intent-aware contrastive learning approach using triples of an utterance, gold intent, and pseudo intent from various dialogue datasets.
Our training objective is designed to align the representations of utterances and their intent names in the semantic embedding space.For this purpose, we use the InfoNCE loss (van den Oord et al., 2018), which pulls positive pairs close to each other and pushes away negative pairs.The loss for the i-th sample x i is formulated as follows: where y = ⟨y 1 , y 2 , . . ., y N ⟩ are pairs of the input x i with a batch size of N , and sim(•) denotes the cosine similarity between two embeddings.Again, f ϕ (•) denotes any text encoder that represents intent names or utterances in the embedding space.Note that pairs that are not positive in a batch are treated as negative pairs.We here define three types of positive pairs, two of which are supervised and one is semi-supervised.The first of the supervised positive pairs is between the input utterances and their gold intent names annotated in the pre-training datasets.The equation used is as follows: where is the a gold intent name of x i .
The second supervised positive pair is between the input utterances and their gold utterances.We define gold utterances as randomly sampled utterances that share the same gold intent names as the input utterances: where is the gold utterance of x i .Finally, the semi-supervised positive pairs are between the input utterances and their pseudo intent names: where y pseudo i denotes the pseudo intent name of x i constructed by the IRL tagger, as described in Section 3.3 Our final loss is a combination of these three losses as follows: where λ is the weight term of the semi-supervised loss term.

Baselines
To evaluate the effectiveness of our proposed PIE model, we compare it with the following stateof-the-art approaches for few-shot IC tasks: Pro-toNet (Dopierre et al., 2021a) and ProtAugment (Dopierre et al., 2021b) as fine-tuning methods, and SBERT Paraphrase (Ma et al., 2022) as a pre-trained text encoder.
ProtoNet is a meta-training approach that finetunes encoders by using a series of episodes constructed on task-specific training sets.ProtAugment is an advanced method derived from Pro-toNet, which augments paraphrased utterances within episodes to mitigate overfitting caused by the biased distribution introduced by a limited number of training examples.The authors of ProtoNet and ProtAugment perform additional pre-training BERT-base-cased (110M) using training utterances and the language model objective, and use it as their initial model.We refer to this model as BERT TAPT , inspired by task-adaptive pre-training (TAPT) (Gururangan et al., 2020).
SBERT Paraphrase is a text encoder pre-trained on large-scale paraphrase text pairs (Reimers and Gurevych, 2019).Ma et al. (2022) discover that this pre-trained text encoder can produce good utterance embeddings without any fine-tuning on task-specific datasets.Although the authors leverage SBERT Paraphrase solely at the inference stage of Prototypical Networks, we conduct additional experiments by fine-tuning the encoder using Pro-toNet and ProtAugment as baselines.Note that we reproduce the performance of SBERT Paraphrase using paraphrase-mpnet-base-v2 1 (110M), which has the same number of parameters as BERT-basecased, for a fair comparison.We collect four dialogue datasets to pre-train and one to validate our encoder: TOP (Gupta et al., 2018)), TOPv2 (Chen et al., 2020), DSTC11-T2 (Gung et al., 2023), SGD (Rastogi et al., 2020), andMultiWOZ 2.2 (Zang et al., 2020).Dialogues in SGD and MultiWOZ 2.2 datasets consist of multi-turn utterances, and these utterances are often ambiguous when context around them is not given (e.g., 'Can you suggest something else?' is labeled as 'LookupMusic').To minimize this am-1 https://huggingface.co/sentencetransformers/paraphrase-mpnet-base-v2 biguity, we use the first-turn utterance of each dialogue from these datasets.Furthermore, the number of utterances between intents in the raw datasets is highly imbalanced.To alleviate this imbalance, we set the maximum number of utterances per intent of the TOP and DSTC11-T2 datasets to 1000 and the SGD and MultiWOZ 2.2 datasets to 100.We then annotate the IRL labels on the utterances using the IRL tagger.Based on the IRL predictions, we filter utterances when no Action, Argument, or Query labels are detected, because they are likely to lack information for interpreting user intents.Finally, we treat MultiWOZ 2.2 as the validation set for tuning the hyperparameters of the pre-training stage.Table 4 summarizes the statistics of the datasets.We evaluate our PIE model and baseline models on four IC datasets (Table 5).

Downstream Datasets
Banking77 (Casanueva et al., 2020) is an IC dataset in the banking domain.As there are many overlapping tokens between intent names (e.g.'verify top up', 'top up limits', and 'pending top up'), fine-grained understanding is required when correctly classifying intents for this dataset.HWU64 (Liu et al., 2019b) is a dataset in 21 different domains, such as alarm and calendar for a home assistant robot.Liu54 (Liu et al., 2019b) is a dataset collected from Amazon Mechanical Turk, and workers designed utterances for given intents.Clinc150 (Larson et al., 2019) is a dataset that includes a wide range of intents from ten different domains such as 'small talk' and 'travel'.Before proceeding, we examine the number of intent names that overlap between the pre-training data and downstream data (Table 6).When comparing the intent names, we first apply stemming ('restaurant reservation' → 'restaur reserve') and arrange the tokens in alphabetical order ('restaur reserv' → 'reserv restaur') for each intent name.This approach aims to maximize the recall of overlapping intent names.Consequently, we find that only 11 out of 345 intent names from the downstream data overlaps (e.g, 'reserve restaurant' intent in the pre-training data and 'restaurant reservation' intent in the downstream data).

Implementation Details
Here, we describe detailed information when pretraining our PIE model and employing it for zeroand few-shot IC tasks.
We use paraphrase-mpnet-base-v2, the same encoder used in the SBERT Paraphrase baseline, as an initial model for further pre-training in our approach.The hyperparameters are tuned based on the validation set described in Section 5.2.As a result, we set the training epochs to 1, the learning rate to 1e-6, the batch size to 50, and λ to 2.
After pre-training, we apply the model to zeroand few-shot IC tasks in the 5-way and N-way settings.The baselines we compare with only experiment with a 5-way setting where the task is predicting the correct intent from among five candidate classes.We further include an N-way setting, where N can be much larger than five because, in practice, it is often required to assign more than five intents in building TOD systems.When evaluating models on the N-way setting, we use all the intent classes in the test set as candidate intents, for example, 27-way for Banking77 and 50-way for Clinc150.We set K, which is the number of examples per intent in an episode, to 0 and 1 to experiment with zero-and one-shot IC.Finally, we treat intent labels as examples when creating prototypes for each intent.This enables experiments in the zero-shot setting and enhances performance in the few-shot setting.To denote the usage of labels as examples, we append 'L-' prefixes to the method names (e.g., L-PIE).7 shows 5-way K-shot IC performance on four test sets.The results demonstrate that PIE achieves an average accuracy of 88.5%, surpassing SBERT Paraphrase , which is considered the strongest baseline model, by 2.8% in the one-shot setting.This highlights the effectiveness of our pre-training strategy.Additionally, where intent labels are used as examples, our L-PIE model achieves 89.1% and 93.7% in the zero-shot and one-shot settings, respectively, consistently outperforming L-SBERT Paraphrase by 2.9% and 1.8%.It is worth noting that the L-PIE model also significantly outperforms L-BERT TAPT + ProtoNet, which fine-tunes an encoder on the target datasets, by a substantial margin of 4.0% and 2.9%.This shows that our proposed approach builds an effective intent classifier that performs well even prior to fine-tuning on task-specific data.Our L-PIE model shows further improvement when fine-tuned with Pro-tAugment, outperforming the strongest baseline L-SBERT Paraphrase + ProtAugment by 1.0% in zeroshot IC.

5-way K-shot intent classification Table
Table 8: N-way K-shot intent classification performance of pre-trained models with and without fine-tuning on four test set.Averaged accuracies and standard deviations across five class splits are reported.The 'L-' prefixes indicate the use of intent label names when creating prototypes.Highest scores are boldfaced.Fine-tuning is done in the 5-way setting due to memory constraints.2N-way K-shot intent classification We showcase the performance of our PIE model and the baselines in a more challenging and practical scenario (Table 8).In this scenario, the intent for user utterances needs to be classified among a significantly larger number of intent classes (e.g., 10× for Clinc150).The results show that our L-PIE model achieves 73.3% and 82.0% in zero-and one-shot settings, respectively, outperforming the baseline L-SBERT Paraphrase by 5.4% and 4.0%.These performance improvements are significantly higher than those observed in the 5-way K-shot IC task.This indicates that our PIE model performs well in practical scenarios, as stated above.We leverage dialogue datasets for building the PIE model as described in Section 5.2.Here, we perform an ablation study over the pre-training datasets on N-way K-shot IC tasks (Table 9).The result shows that using the TOP (+TOPv2) dataset, which has 31K utterances, 61 gold intents, and 23K pseudo intents, improves the performance the most over L-SBERT Paraphrase (indicated as None in Table 9).Specifically, there is an improvement of 4.8% and 3.7% in the zero-and one-shot settings, respectively.Although using other datasets, such as SGD or DSTC11-T2, does not improve the performance in comparison with using the TOP dataset, we observe that merging them further improves the overall performance on downstream tasks.As described in Section 4, our intent-aware contrastive loss comprise three sub-losses, L gold_intent ,  L gold_utterance , and L pseudo .To see the benefit of using these losses during pre-training, we ablate each loss function in the N-way K-shot IC tasks on HWU64 and Clinc150 (Table 10).The results indicate that two sub-losses, L gold_intent and L gold_utterance show relatively marignal improvements.However, it is noteworthy that L pseudo serves as the key sub-loss for PIE, highlighting the effectiveness of using pseudo intents.Specifically, removing L pseudo from the final loss results in up to 1.8% and 1.1% degradation in performance in the zero-and one-shot settings, respectively.

Varying K and N
We visualize the performance of the PIE model in challenging N-way K-shot IC task settings where the number of example utterances K or the number of candidate intent classes N varies.Plots of performance at varying K (Figure 2) show that our model has consistently higher performance than the baselines, and the performance improvement of our model is the largest when K is small (e.g., K=0).Plots of performance at varying N (Figure 3) show that the performance improvement of our model increases as the number of intents N increases (i.e.increasing from N=5 to N=50).These visualizations reveal that the PIE model can be utilized in more practical and realistic settings where many user intents are used for the TOD system and only a few utterances are available.As shown in Table 6, there are a few overlapping intent names between pre-training data and downstream data (except for Liu54).These coincidental overlaps can hinder an accurate evaluation of the generalization ability of our model.To understand the impact of intent overlaps, we also measure the performance using only non-overlapping intents (Table 11).We observe that the impact is marginal enough to be neglected, and surprisingly, removing overlapping intents rather can lead to better performance on Banking77 and HWU64.Through further analysis, we discover that this is partly because of the bias towards pairs of utterances and intent annotated in pre-training datasets.For example, an utterance 'please play my favorite song' in pre-training data has an intent 'play music'.Our model then incorrectly predicted 'play music' for a test utterance 'that song is my favorite', where the correct intent is 'music likeness'.

Conclusions
In this work, we propose a pre-training method that leverages pseudo intent names constructed using an IRL tagger in a semi-supervised manner, followed by intent-aware pre-training (PIE).Experiments on four intent classification datasets show that our model achieves state-of-the-art performance on all datasets, outperforming the strongest sentence encoder baseline by up to 5.4% and 4.0% in N-way zero-and one-shot settings, respectively.Our analysis shows that PIE performs robustly compared to the baselines in challenging and practical settings with a large number of classes and small number of support examples.In future work, we will explore the use of IRL and our PIE model in multi-label intent classification or out-of-scope detection tasks.

Limitations
One limitation of our method is that while it leverages annotations from the IRL tagger, the detection of spans for certain labels, such as 'Problem,' is not accurate enough (38.1% F1 score).This is likely due to the relatively short number of annotations of this type in the training set (45 annotations).To mitigate this limitation, we could consider annotating more instances of this label or implementing techniques for handling imbalanced labels.
Another limitation is that we currently treat all IRL labels equally when constructing pseudo intents.However, the importance of each label in interpreting intent can vary.To address this, we plan to investigate treating different labels differently when pre-training the encoder (e.g. by giving more weight to 'Action' and 'Argument' labels and less weight to 'Slot' labels).

Ethics Statement
Our proposed method is for enhancing zero-and few-shot intent classification, and it does not raise any ethical concerns.We believe that this research has valuable merits that can lead to more reliable task-oriented dialogue systems.All experiments in this study were carried out using publicly available datasets.

Figure 1 :
Figure 1: Overview of pre-training the intent-aware encoder (PIE).Given an utterance x 1 from pre-training corpus, we generate a pseudo intent name y pseudo

Figure 2 :
Figure 2: Performance on N-way K-shot intent classification with varying K. PA refers to ProtAugment.

Figure 3 :
Figure 3: Performance on N-way 0-shot intent classification with varying N.

Table 1 :
Table 1 shows the statistics and examples of each IRL label from 3,879 utterances.To train and evaluate the IRL tagger, we split annotations into training, valida-tion, and test sets with 3,121 / 379 / 379 utterances, respectively, which is approximately an 80:10:10 ratio.Statistics and examples of each IRL label from 3,879 utterances.

Table 2
shows the precision, recall, and F1 scores of each IRL label on the test set.

Table 2 :
Precision, recall, and F1 scores of each IRL label on the test set.

Table 3 :
like to open ACT a savings SLT account ARG please open savings account So I need to sign up ACT for a a savings SLT account ARG sign up savings account Can you buy ACT me some movie tickets ARG buy movie tickets buy movie tickets I am looking to book ACT movie tickets ARG book movie tickets I am looking to purchase ACT movie tickets ARG purchase movie tickets Delete ACT this song ARG from playlist ARG delete song playlist remove from playlist music Erase ACT this track ARG from the playlist ARG erase track playlist Could you remove ACT this song ARG permanently remove song Some examples of IRL predictions (boldfaced) from utterances, extracted pseudo intent names, and gold intent names annotated in the original dataset.
Table3shows some examples of IRL predictions from utterances and corresponding pseudo intent names.

Table 4 :
Pre-training datasets for the PIE model.

Table 5 :
The statistics of four IC datasets.

Table 6 :
Number of overlapping intent names between the pre-training data and downstream data.Table7: 5-way K-shot intent classification performance of pre-trained models with and without fine-tuning on four test sets.Averaged accuracies and standard deviations across five class splits are reported.The 'L-' prefixes indicate the use of intent label names when creating prototypes, enabling zero-shot evaluation.Highest scores are boldfaced.