Template-based Approach to Zero-shot Intent Recognition

The recent advances in transfer learning techniques and pre-training of large contextualized encoders foster innovation in real-life applications, including dialog assistants. Practical needs of intent recognition require effective data usage and the ability to constantly update supported intents, adopting new ones, and abandoning outdated ones. In particular, the generalized zero-shot paradigm, in which the model is trained on the seen intents and tested on both seen and unseen intents, is taking on new importance. In this paper, we explore the generalized zero-shot setup for intent recognition. Following best practices for zero-shot text classification, we treat the task with a sentence pair modeling approach. We outperform previous state-of-the-art f1-measure by up to 16\% for unseen intents, using intent labels and user utterances and without accessing external sources (such as knowledge bases). Further enhancement includes lexicalization of intent labels, which improves performance by up to 7\%. By using task transferring from other sentence pair tasks, such as Natural Language Inference, we gain additional improvements.


Introduction
User intent recognition is one of the key components of dialog assistants.With the advent of deep learning models, deep classifiers have been used throughout to recognize user intents.A common setup for the task (Chen et al., 2019;Wu et al., 2020;Casanueva et al., 2020) involves an omnipresent pre-trained language model (Devlin et al., 2018;Liu et al., 2019b;Sanh et al., 2019), equipped with a classification head, learned to predict intents.However, if the dialog assistant is extended with new skills or applications, new intents may appear.In this case, the intent recognition model needs to be re-trained.In turn, re-training the model requires annotated data, the scope of which is inherently limited.Hence, handling unseen events defies the common setup and poses new challenges.To this end, generalized zero-shot (GZS) learning scenario (Xian et al., 2018), in which the model is presented at the training phase with seen intents and at the inference phase with both seen and unseen intents, becomes more compelling and relevant for real-life setups.The main challenge lies in developing a model capable of processing seen and unseen intents at comparable performance levels.
Recent frameworks for GZS intent recognition are designed as complex multi-stage pipelines, which involve: detecting unseen intents (Yan et al., 2020), learning intent prototypes (Si et al., 2021), leveraging common sense knowledge graphs (Siddique et al., 2021).Such architecture choices may appear untrustworthy: using learnable unseen detectors leads to cascading failures; relying on external knowledge makes the framework hardly adjustable to low-resource domains and languages.Finally, interactions between different framework's components may be not transparent, so it becomes difficult to trace back the prediction and guarantee the interpretability of results.
At the same time, recent works in the general domain GZL classification are centered on the newly established approach of Yin et al. (2019), who formulate the task as a textual entailment problem.The class's description is treated as a hypothesis and the text -as a premise.The GZL classification becomes a binary problem: to predict whether the hypothesis entails the premise or not.Entailment-based approaches have been successfully used for information extraction (Haneczok et al., 2021;Lyu et al.;Sainz and Rigau, 2021) and for dataless classification (Ma et al., 2021).However, the entailment-based setup has not been properly explored for GZS intent recognition to the best of our knowledge.
This paper aims to fill in the gap and extensively evaluate entailment-based approaches for GZS intent recognition.Given a meaningful intent label, such as reset_settings, and an input utterance, such as I want my original settings back, the classifier is trained to predict if the utterance should be assigned with the presented intent or not.To this end, we make use of pre-trained language models, which encode a two-fold input (intent label and an utterance) simultaneously and fuse it at intermediate layers with the help of the attention mechanism.
We adopt three dialog datasets for GZS intent recognition and show that sentence pair modeling outperforms competing approaches and establishes new state-of-the-art results.Next, we implement multiple techniques, yielding an even higher increase in performance.Noticing that in all datasets considered, most intent labels are either noun or verb phrases, we implement a small set of lexicalizing templates that turn intent labels into plausible sentences.For example, an intent label reset_settings is re-written as The user wants to reset settings.Such lexicalized intent labels appear less surprising to the language model than intact intent labels.Hence, lexicalization of intent labels helps the language model to learn correlations between inputs efficiently.Other improvements are based on standard engineering techniques, such as hard example mining and task transferring.
Last but not least, we explore two setups in which even less data is provided by restricting access to various parts of annotated data.First, if absolutely no data is available, we explore strategies for transferring from models pre-trained with natural language inference data.Second, in the dataless setup only seen intent labels are granted and there are no annotated utterances, we seek to generate synthetic data from them by using offthe-shelf models for paraphrasing.We show that the sentence pair modeling approach to GZS intent recognition delivers adequate results, even when trained with synthetic utterances, but fails to transfer from other datasets.
The key contributions of the paper are as follows: 1. we discover that sentence pair modeling approach to GZS intent recognition establishes new state-of-the-art results; 2. we show that lexicalization of intent labels yields further significant improvements; 3. we use task transferring, training in dataless regime and conduct error analysis to investigate the strengths and weaknesses of sentence pair modeling approach.

Related Work
Our work is related to two lines of research: zeroshot learning with natural language descriptions and intent recognition.We focus on adopting existing ideas for zero-shot text classification to intent recognition.
Zero-shot learning has shown tremendous progress in NLP in recent years.The scope of the tasks, studied in GZS setup, ranges from text classification (Yin et al., 2019) to event extraction (Haneczok et al., 2021;Lyu et al.), named entity recognition (Li et al., 2020) and entity linking (Logeswaran et al., 2019).A number of datasets for benchmarking zero-shot methods has been developed.To name a few, Yin et al. (2019) create a benchmark for general domain text classification.SGD (Rastogi et al., 2020) allows for zero-shot intent recognition.
Recent research has adopted a scope of novel approaches, utilizing natural language descriptions, aimed at zero-shot setup.Text classification can be treated in form of a textual entailment problem (Yin et al., 2019), in which the model learns to match features from class' description and text, relying on early fusion between inputs inside the attention mechanism.The model can be fine-tuned solely of the task's data or utilize pre-training with textual entailment and natural language inference (Sainz and Rigau, 2021).However, dataless classification with the help of models, pre-trained for textual entailment only appears problematic due to models' high variance and instability (Ma et al., 2021).This justifies the rising need for learnable domain transferring (Yin et al., 2020) and self-training (Ye et al., 2020), aimed at leveraging unlabeled data and alleviating domain shift between seen and unseen classes.
Intent recognition Supervised intent recognition requires training a classifier with a softmax layer on top.Off-the-shelf pre-trained language models or sentence encoders are used to embed an input utterance, fed further to the classifier (Casanueva et al., 2020).Augmentation techniques help to increase the amount of training data and increase performance (Xia et al., 2020).Practical needs require the classifier to support emerging intents.Re-training a traditional classifier may turn out resource-greedy and costly.This motivates work in (generalized) zero-shot intent recognition, i.e. handling seen and unseen intents simultaneously.Early approaches to GZS intent recog-nition adopted capsule networks to learn lowdimensional representations of intents.IntentCap-sNet (Xia et al., 2018) is built upon three capsule modules, organized hierarchically: the lower module extracts semantic features from input utterances.Two upper modules execute recognition of seen and unseen intents independently from each other.ReCapsNet (Liu et al., 2019a) is built upon a transformation schema, which detects unseen events and makes predictions based on unseen intents' similarity to the seen ones.SEG (Yan et al., 2020) utilizes Gaussian mixture models to learn intent representations by maximising margins between them.One of the concurrent approaches, CTIR (Si et al., 2021) (Class-Transductive Intent Representations) learns intent representations from intent labels to model inter-intent connections.CTIR is not a stand-alone solution but rather integrates existing models, such as BERT, CNN, or CapsNet.The framework expands the prediction space at the training stage to be able to include unseen classes, with the unseen label names serving as pseudo-utterances.The current state-of-the-art performance belongs to RIDE (Siddique et al., 2021), an intent detection model that leverages common knowledge from ConceptNet.RIDE captures semantic relationships between utterances and intent labels considering concepts in an utterance linked to those in an intent label.
3 Sentence pair modelling for intent recognition

Problem formulation
Let X be the set of utterances, S = {y 1 , . . ., y k } be the set of seen intents and U = {y k+1 , . . ., y n } be the set of unseen intents.The training data consists of annotated utterances {x i , y j }.At the test time, the model is presented with a new utterance.In the GZS setup the model chooses an intent from both seen and unseen y j ∈ S ∪ U.

Our approach
A contextualized encoder is trained to make a binary prediction: whether the utterance x i is assigned with the intent y j or not.The model encodes the intent description and the utterance, concatenated by the separation token [SEP].The representation of the [CLS] token is fed into a classification head, which makes the desired prediction P (1|y j , x i ).This approach follows standard sentence pair (SP) modeling setup.
ID Template declarative templates d 1 the user wants to the user wants to book a hotel d 2 tell the user how to tell the user how to book a hotel question templates q 1 does the user want to does the user want to book a hotel q 2 how do I how do I book a hotel Given an intent y j , the model is trained to make a positive prediction for an in-class utterance x + i and a negative prediction for an out-of-class utterance x − i , sampled from another intent.At the train time, the model is trained with seen intents only y j ∈ S.
On the test time, given an utterance x test i , we loop over all intents y j ∈ S ∪ U and record the probability of the positive class.Finally, we assign to the utterance x test i such y * , that provides with the maximum probability of the positive class: Contextualized encoders.We use RoBERTa base (Liu et al., 2019b) as the main and default contextualized encoder in our experiments, as it shows superior performance to BERT (Devlin et al., 2018) in many downstream applications.RoBERTa's distilled version, Distil-RoBERTa (Sanh et al., 2019) is used to evaluate lighter, less computationally expensive models.Also, we use a pre-trained task-oriented dialogue model, TOD-BERT (Wu et al., 2020) to evaluate whether domain models should be preferred.
Negative sampling strategies include (i) sampling negative utterances for a fixed intent, denoted as (y j , x + i ), (y j , x − i ); (ii) sampling negative intents for a fixed utterance, denoted as (y + j , x i ), (y − j , x i ).Both strategies support sampling with hard examples.In the first case (i), we treat an utterance x − i as a hard negative one for intent y j , if there exists such in-class utterance x + i , so that the similarity between x + i and x − i is higher than a predefined threshold.To this end, to compute semantic similarity, we make use of SentenceBERT (Reimers and Gurevych, 2019) cosine similarity.For a given positive in-class utterance, we selected the top-100 most similar negative out-of-class utterance based on the values of cosine similarity.In the second case (ii), we use the same approach to sample hard negative intents y − j , given an utterance x i , assigned with the positive intent y + j .Again, we compute semantic similarity between intent labels and sample an intent y − j with probability based on similarity score with intent y + j .To justify the need to sample hard negative examples, we experiment with random sampling, choosing randomly (iii) negative utterances or (iv) negative intents.
Lexicalization of intent labels utilizes simple grammar templates to convert intent labels into natural-sounding sentences.For this aim, we utilize two types of templates: (i) declarative templates ("the user wants to") and (ii) question templates ("does the user want to").Most intent labels take a form of a verb phrase (VERB + NOUN + ), such as book_hotel or a noun phrase (NOUN + ), such as flight_status.We develop the set of rules that parses an intent label, detects whether it is a verb phrase or a noun phrase 1 , and lexicalizes it using one of the templates using the following expression: template + VERB + a/an + NOUN + .If the intent label is recognized as a noun phrase, the VERB slot is filled with an auxiliary verb, "get".This way, we achieve such sentences: the user wants to book a hotel and does the user want to get a flight status.The templates implemented are shown in Table 1.
Lexicalization templates were constructed from the most frequent utterance prefixes, computed for all datasets.This way, lexicalized intents sound natural and are close to the real utterances.We use declarative and question templates because the datasets consist of such utterance types.We experimented with a larger number of lexicalization templates, but as there is no significant difference in performance, we limited ourselves to two templates of each kind for the sake of brevity.
Task transferring Task transferring from other tasks to GZS intent recognition allows to estimate 1 We use a basic NLTK POS tagger to process intent labels.whether (i) pre-trained task-specific models can be used without any additional fine-tuning, reducing the need of annotated data and (ii) pre-training on other tasks and further fine-tuning is beneficial for the final performance.
There are multiple tasks and fine-tuned contextualized encoders, which we may exploit for task transferring experiments.For the sake of time and resources, we did not fine-tune any models on our own, but rather adopted a few suitable models from HuggigngFace library, which were fine-tuned on the Multi-Genre Natural Language Inference (MultiNLI) dataset (Williams et al., 2018): BERT-NLI (textattack/bert-base-uncased-MNLI), BART-NLI (bart-large-mnli), RoBERTa-NLI (textattack/roberta-base-MNLI).

Dataless classification
We experiment with a dataless classification scenario, in which we train the models on synthetic data.To this end, we used three pre-trained three paraphrasing models to paraphrase lexicalized intent labels.For example, the intent label get alarms is first lexicalized as tell the user how to get alarms and then paraphrased as What's the best way to get an alarm?.Next, we merge all sentences, paraphrased with different models, into a single training set.Finally, we train the GZS model with the lexicalized intent labels and their paraphrased versions without using any annotated utterances.
4 Datasets SGD (Schema-Guided Dialog) (Rastogi et al., 2020) contains dialogues from 16 domains and 46 intents and provides the explicit train/dev/test split, aimed at the GZSL setup.Three domains are available only in the test set.This is the only dataset, providing short intent descriptions, which we use instead of intent labels.To pre-process the SGD dataset, we keep utterances where users express an intent, selecting utterances in one of the two cases: (i) first utterances in the dialogue and (ii) an utterance that changes the dialogue state and expresses a new intent.We use pre-processed utterances from original train/dev/test sets for the GZS setup directly without any additional splitting.
MultiWoZ 2.2 (Multi-domain Wizard of Oz) (Budzianowski et al., 2018) is treated same way as SGD: we keep utterances that express an intent and we get 27.5K utterances, spanning over 11 intents from 7 different domains.We used 8 (out of 11) randomly selected intents as seen for training.30% utterances from seen intents.All utterances implying unseen intents are used for testing.Test utterances for seen intents are sampled in a stratified way, based on their support in the original dataset.
CLINC (Larson et al., 2019) contains 23,700 utterances, of which 22,500 cover 150 in-scope intents, grouped into ten domains.We follow the standard practice to randomly select 3/4 of the inscope intents as seen (112 out of 150) and 1/4 as unseen (38 out of 150).The random split was made the same way as for MultiWoZ.

Experiments
Baselines We use SEG 2 , RIDE 3 , CTIR 4 as baselines, as they show the up-to-date top results on the three chosen datasets.For the RIDE model, we use the base model with a Positive-Unlabeled classifier, as it gives a significant improvement on the SGD and MultiWoZ datasets.We used Zero-Shot DNN and CapsNets along with CTIR, since these two encoders perform best on unseen intents (Si et al., 2021).
Evaluation metrics commonly used for the task are accuracy (Acc) and F1.The F1 values are per class averages weighted with their respective support.Following previous works, we report results on seen and unseen intents separately.Evaluation for the test set overall is presented in Appendix.
We report averaged results along with standard deviation for ten runs of each experiment.
Results of experiments are presented in Table 2 (see Appendix for standard deviation estimation).
Our approach SP RoBERTa, when used with intent labels and utterances only, shows significant improvement over the state-of-the-art on all three datasets, both on seen and unseen intents, by accuracy and F1 measures.The only exception is unseen intents of CLINC, where our approach underperforms in terms of accuracy of unseen intents recognition comparing to RIDE.At the same time, RIDE shows a lower recall score in this setup.So, our method is more stable and performs well even when the number of classes is high.
Similarly to other methods, our method recognizes seen intents better than unseen ones, reaching around 90% of accuracy and F1 on the former.Next, with the help of lexicalized intent labels our approach yields even more significant improvement for all datasets.The gap between our approach and baselines becomes wider, reaching 14% of accuracy on SGD's unseen intents and becoming closer to perfect detection on seen intents across all datasets.The difference between our base approach SP RoBERTa and its modification, relying on intent lexicalization, exceeds 7% on unseen in-  tents for SGD dataset and reaches 3% on MultiWoZ ones.Notably, SP RoBERTa does not overfit on seen intents and achieves a consistent increase both on unseen and seen intents compared to previous works.

Ablation study
We perform ablation studies for two parts of the SP RoBERTa approach and present the results for unseen intents in Table 3.In all ablation experiments we use the SP approach with intent labels to diminish the effect of lexicalization.First, we evaluate the choice of the contextualized encoder, which is at the core of our approach (see the top part of Table 3).We choose between BERT base , RoBERTa base , its distilled version DistilRoBERTa, and TOD-BERT.BERT base provides poorer performance when compared to RoBERTa base , which may be attributed to different pre-training setup.At the same time, TOD-BERT's scores are compatible with the ones of RoBERTa on two datasets, thus diminishing the importance of domain adaptation.A higher standard deviation, achieved for the MultiWoZ dataset, makes the results less reliable.The performance of DistilRoBERTa is almost on par with its teacher, RoBERTa, indicating that our approach can be used with a less computationally expensive model almost without sacrificing quality.
Second, we experiment with the choice of negative sampling strategy (see the middle part of Table 3), in which we can sample either random or hard negative examples for both intents and utterances.The overall trend shows that sampling hard examples improves over random sampling (by up to 6% of accuracy for the SGD dataset).4 demonstrates the performance of SP RoBERTa with respect to the choice of lexicalization templates.Regardless of which template is used, the results achieved outperform SP RoBERTa with intent labels.The choice of lexicalization template slightly affects the performance.The gap between the best and the worst performing template across all datasets is about 2%.The only exception is q 2 , which drops the performance metrics for two datasets.In total, this indicates that our approach must use just any of the lexicalization templates, but which template exactly is chosen is not as important.What is more, there is no evidence that declarative templates should be preferred to questions or vice versa.

Choice of lexicalization templates Table
Further adjustments of intent lexicalization templates and their derivation from the datasets seem a part of future research.Other promising directions include using multiple lexicalized intent labels jointly to provide opportunities for off-theshelf augmentation at the test and train times.
Task transferring results are presented in the bottom part of Table 3. First, we experiment with zero-shot task transferring, using RoBERTa-NLI to make predictions only, without any additional fine-tuning on intent recognition datasets.This experiment leads to almost random results, except for the SGD datasets, where the model reaches about 30% correct prediction.
However, models, pre-trained with MNLI and fine-tuned further for intent recognition, gain sig- nificant improvement up to 7%.The improvement is even more notable in the performance of BART-NLI, which obtains the highest results, probably, because of the model's size.
Dataless classification results are shown in Table 5.This experiment compares training on two datasets: (i) intent labels and original utterances, (ii) intent labels and synthetic utterances, achieved from paraphrasing lexicalized intent labels.In the latter case, the only available data is the set of seen intent labels, used as input to SP RoBERTa and for further paraphrasing.Surprisingly, the performance declines moderately: the metrics drop by up to 30% for seen intents and up to 10% for unseen intents.This indicates that a) the model learns more from the original data due to its higher diversity and variety; b) paraphrasing models can re-create some of the correlations from which the model learns.
The series of experiments in transfer learning and dataless classification aims at real-life scenarios in which different parts of annotated data are available.First, in zero-shot transfer learning, we do not access training datasets at all (Table 2, Zeroshot RoBERTA NLI).Second, in the dataless setup, we access only seen intent labels, which we utilize both as class labels and as a source to create synthetic utterances (Table 5).Thirdly, our main experiments consider both seen intents and utterances available (Table 2, SP RoBERTA).In the second scenario, we were able to get good scores that are more or less close to the best-performing model.We believe efficient use of intent labels overall and to generate synthetic data, in particular, is an important direction for future research.

Analysis
Error analysis shows, that SP RoBERTa tends to confuse intents, which (i) are assigned with semantically similar labels or (ii) share a word.
For example, an unseen intent get_train_tickets gets confused with the seen intent find_trains.Similarly, pairs of seen intents play_media and play_song or find_home_by_area and search_house are hard to distinguish.
We checked whether errors in intent recognition are caused by utterances' surface or syntax features.Following observations hold for the SGD dataset.Utterances, which take the form of a question, are more likely to be classified correctly: 93% of questions are assigned with correct intent labels, while there is a drop for declarative utterances, of which 90% are recognized correctly.The model's performance is not affected by the frequency of the first words in the utterance.From 11360 utterances in the test set, 4962 starts with 3-grams, which occur more than 30 times.Of these utterances, 9% are misclassified, while from the rest of utterances, which start with rarer words, 10% are misclassified.
The top-3 most frequent 3-grams at the beginning of an utterance are I want to, I would like, I need to.
Stress test for NLI models (Naik et al., 2018) is a typology for the standard errors of sentence pair models, from which we picked several typical errors that can be easily checked without additional human annotation.We examine whether one of the following factors leads to an erroneous prediction: (i) word overlap between an intent label and an utterance; (ii) the length of an utterance; (iii) negation or double negation in an utterance; (iv) numbers, if used in an utterance.Additionally, we measured the semantic similarity between intent labels and user utterances through the SentenceBERT cosine function to check whether it impacts performance.Table 6 displays the stress test results for one of the runs of SP RoBERTa, trained with q 1 template on the SGD dataset.This model shows reasonable performance, and its stress test results are similar to models trained with other templates.The results are averaged over the test set.An utterance gets more likely to be correctly predicted if it shares at least one token with the intent label.However, the semantic similarity between intent labels and utterances matters less and is relatively low for correct and incorrect predictions.Longer utterances or utterances, which contain digits, tend to get correctly classified more frequently.The latter may be attributed to the fact that numbers are important features to intents, related to doing something on particular dates and with a particular number of people, such as search_house, reserve_restaurant or book_appointment.

Conclusion
Over the past years, there has been a trend of utilizing natural language descriptions for various tasks, ranging from dialog state tracking (Cao and Zhang, 2021), named entity recognition (Li et al., 2020) to the most recent works in text classification employing Pattern-Exploiting Training (PET) (Schick and Schütze, 2020).The help of supervision, expressed in natural language, in most cases not only improves the performance but also enables exploration of real-life setups, such as few-shot or (generalized) zero-shot learning.Such methods' success is commonly attributed to the efficiency of pretrained contextualized encoders, which comprise enough prior knowledge to relate the textual task descriptions with the text inputs to the model.
Task-oriented dialogue assistants require the resource-safe ability to support emerging intents without re-training the intent recognition head from scratch.This problem lies well within the generalized zero-shot paradigm.To address it, we present a simple yet efficient approach based on sentence pair modeling, suited for the intent recognition datasets, in which each intent is equipped with a meaningful intent label.We show that we establish new state-of-the-art results using intent labels paired with user utterances as an input to a contextualized encoder and conducting simple binary classification.Besides, to turn intent labels into plausible sentences, better accepted by pre-trained models, we utilized an easy set of lexicalization templates.This heuristic yet alone gains further improvement, increasing the gap to previous best methods.Task transferring from other sentence pair modeling tasks leads to even better performance.
However, our approach has a few limitations: it becomes resource-greedy as it requires to loop over all intents for a given utterance.Next, the intent labels may not be available or may take the form of numerical indices.The first limitation might be overcome by adopting efficient ranking algorithms from the Information Retrieval area.Abstractive summarization, applied to user utterances, might generate meaningful intent labels.These research questions open a few directions for future work.

Table 1 :
Lexicalization templates, applied to intent labels.Examples are provided for the intent label "book hotel".

Table 2 :
Comparison of different methods.SP stands for Sentence Pair modeling approach.SP RoBERTa (ours) shows consistent improvements of F1 across all datasets for seen and unseen intents.The usage of lexicalized templates improves performance.

Table 3 :
Ablation study and task transferring: comparison on unseen intents.Top: comparison of different contextualized encoders; middle: comparison of negative sampling strategies of intent sampling (IS) and utterance sampling (US); bottom: task transferring from the MNLI dataset, using various fine-tuned models.

Table 4 :
Comparison of different lexicalization templates, improving the performance of SP RoBERTa.Metrics are reported on unseen intents only.Each row corresponds to experiments with a single lexicalization template only, isolated from the others, i.e the row "d 1 templates" uses only the d 1 form.

Table 5 :
Dataless classififcation.Metrics are reported on seen and unseen intents.Fine-tuning SP-Roberta on synthetic utterances (bottom) shows moderate decline, compared to training on real utterances (top).

Table 6 :
Stress test of SP RoBERTa predictions.An utterance is more likely to be correctly predicted if it shares at least one token with the intent labels.