A Label-Aware BERT Attention Network for Zero-Shot Multi-Intent Detection in Spoken Language Understanding

With the early success of query-answer assistants such as Alexa and Siri, research attempts to expand system capabilities of handling service automation are now abundant. However, preliminary systems have quickly found the inadequacy in relying on simple classification techniques to effectively accomplish the automation task. The main challenge is that the dialogue often involves complexity in user’s intents (or purposes) which are multiproned, subject to spontaneous change, and difficult to track. Furthermore, public datasets have not considered these complications and the general semantic annotations are lacking which may result in zero-shot problem. Motivated by the above, we propose a Label-Aware BERT Attention Network (LABAN) for zero-shot multi-intent detection. We first encode input utterances with BERT and construct a label embedded space by considering embedded semantics in intent labels. An input utterance is then classified based on its projection weights on each intent embedding in this embedded space. We show that it successfully extends to few/zero-shot setting where part of intent labels are unseen in training data, by also taking account of semantics in these unseen intent labels. Experimental results show that our approach is capable of detecting many unseen intent labels correctly. It also achieves the state-of-the-art performance on five multi-intent datasets in normal cases.


Introduction
In spoken language understanding (SLU) of taskoriented dialog systems, each utterance is often interpreted as a kind of action being performed by the speaker, which we call speech or dialog acts (Abbeduto, 1983). These acts may commit speakers to some course of actions, like asking or acknowledging, along with a series of distinctive semantic notions involved in a task. Usually the system forms the semantic frames by identifying intents and slots to express dialog acts. For instance, given a sample utterance, "Are there any accidents on my route to work at 10 ?", the intent detection task will first identify intents, i.e., 'Get Info Traffic', 'Get Location Work' and then the slot-filling task will predict a slot such as (time:10). In such case, an 'intent label' for an utterance is defined as a purpose or a goal that clearly states user's act.
Dominant SLU systems have adopted several techniques to predict single intents by treating it as a multi-class classification problem Goo et al., 2018;Qin et al., 2019). However, in real world scenario, many utterances may have multiple intents (Li et al., 2018b;Rastogi et al., 2019) like the above example. Multi-intent SLU often requires more sophisticated reasoning on given utterances to disambiguate different intent natures. Gangadharaiah and Narayanaswamy (2019) first explored the joint multi-intent and slotfilling task by treating multi-intents as a single context vector, but not scalable to a large number of intents. Qin et al. (2020) further proposed a stateof-the-art model to consider each intent-slot interaction via adaptive graph attention. However, these approaches cannot successfully tackle more complex multi-intent scenarios when sentences may not have explicit conjunctions.
The second challenge in SLU intent detection is intent fluidity variation, which we refer to the extent of naturalness when a dialogue progresses. In less stylized conversations, they usually contain a less bounded set of intents which may change with dialog context/states. Thus, usually some utterances' intents may not be seen during training and this problem deteriorates in the multi-intent scenario (Xia et al., 2020). Second, there is no rigorous definition of an intent annotation format or how many intents should be defined. Therefore, conventional models trained on one dataset with a fixed set of intent labels may possibly fail to detect a new in-domain intent. We refer it to the zero-shot problem. Larson et al. (2019) suggests a two-stage process to first classify if a query is in-scope; then to assign intents. However, it cannot scale easily to unseen intents in multi-intent scenario.
To tackle the above two challenges, we found that leveraging embedded semantics in intent labels may be useful. In conventional intent classification, these systems usually classify an utterance to a label which is represented by an indexed ID like 0 (i.e. one-hot encoding). However, representing intents with indexed IDs fails to consider embedded semantics in the labels too. For instance, we can use words 'get' and 'direction' in an intent label 'get direction' to help with identifying semantically equivalent words in an utterance, i.e., I 'want direction' to SF. For a given set of intent labels within one domain, we can compare the semantic similarity between words in an utterance and words in these intents. Similarly in the zero-shot setting, even if some intents may not be visible during training, we could still compare the word semantics in these intents with a new utterance.
In this paper, we propose our new framework: Label-Aware BERT Attention Network (LABAN) in Figure. 1. We first introduce BERT to capture the multi-intent natures when utterances do not have explicit conjunctions. Then, instead of treating intent labels only for indexed IDs, we use words in each intent label in training data to construct a label embedding space. After encoding an utterance and all intents in a given training set for embeddings separately, a label-aware layer will generate scores of how likely this utterance belongs to each intent. To accommodate the zeroshot case, we could additionally introduce unseen intents' embeddings too to jointly construct the embedding space. In contrast with prior works' limited predictability only on seen intents, our model unfreezes the constraint by considering semantics in intent labels to deal with new unseen labels. The code and resources are released in https://github.com/waynewu6250/LABAN. The paper has the following contributions: 1. We extend the first use of BERT into multiintent SLU scenario with a simple but powerful label-aware approach.
2. We successfully demonstrate LABAN's effectiveness to deal with unseen multiple intents and fast harness the intent detection task by training with few data of unseen intents.
3. We compare the LABAN's performance on five extended and complex multi-intent datasets that show significant improvement over previous methods and baselines by considering the contextualized information from BERT and label semantics.

Related Work
Multi-intent Detection Intent detection mainly aims to classify a given utterance with its intents from user inputs. Different approaches such as convolutional-LSTM and capsule network have been proposed to solve the problem (Qian, 2017;Liu et al., 2017;Xia et al., 2018). Considering intents highly associated with slot-filling, many joint models (Goo et al., 2018;Li et al., 2018a;Qin et al., 2019;E et al., 2019;Liu et al., 2019b) utilize intent information like gradients or crossimpact networks to further reinforce the slot-filling prediction. However these methods do not consider multiple intent cases. Therefore, Rychalska et al. (2018) first adopted hierarchical structures to identify multiple user intents. Gangadharaiah and Narayanaswamy (2019) and Qin et al. (2020) further exploited interactive relations between intents and slots. Wu et al. (2021) leveraged the dialog context to better harness the joint tasks. Our model follows these models' paradigm and focuses on more complex cases: 1) Multi intents no longer exist in separate parts of the sentence which our BERT introduction can be beneficial and 2) Some testing intents are not available during training. Zero-shot Learning Zero-shot learning (ZSL) aims to recognize objects whose instances may not be seen during training (Lampert, 2014). Early works usually focused in the fields of computer vision (Lampert, 2014;Al-Halah et al., 2016;Norouzi et al., 2014). They adopted a two-stage approach to first identify object's attributes and estimated class posteriors based on similarity, which often suffered from domain shift between intermediate and target tasks. Recent advances in ZSL directly learned a mapping between feature and semantic spaces (Palatucci et al., 2009;Akata et al., 2016;Frome et al., 2013) or built a common intermediate space (Zhang and Saligrama, 2015;Xian et al., 2017). Similar treatment could be applied in natural language. Chen et al. (2016) proposed CDSSM to consider cosine similarity of deep semantics from utterances and intents. Xia et al. (2018) and Liu et al. (2019a)   (a) During training phase, two BERT encoders will encode both the utterance and all seen intent labels. Then the utterance embedding will be projected onto a constructed semantic embedding space T with projected weights as scores. (b) During testing phase, new unseen intents will also be encoded and participate in constructing T to generate scores based on a new utterance.
et al. (2021) proposed disentangled intent representations for multi-task training. We follow these works and extend to multi-intent detection cases with intent semantics and pretrained models.

Problem Formulation
In this section, we formally state the multi-intent detection problem in the normal and zero-shot case.
Multi-Intent Detection. Given a labeled training dataset where each sample has the following format: (x, y) where x is an utterance and y = (y 1 , y i , ..., y K ) ∈ {0, 1} K is a set of multiple binary intent labels. Each y i will belong to a set Y s of K seen intents. We aim to classify an utterance x seen in the seen intent classes Y s . Zero-shot Multi-Intent Detection. Given a labeled training dataset (x, y) where y ∈ Y s , in testing we aim to classify an utterance x unseen with its correct intent categories y unseen = (y 1 , y i , ..., y K+L ) ∈ {0, 1} K+L from the seen and unseen intent classes Y = Y s ∪ Y u . Y u will be a set of L unseen intents which is given along with Y s as domain ontology during testing, but not visible in training.

Utterance encoder
BERT is a multi-layer transformer-based encoder containing multi-head self-attention layers (Devlin et al., 2019). Models fine-tuned on BERT have achieved several benchmark results in many natural language tasks (Sun et al., 2020). Therefore, we first adopt one BERT BERT u to encode an input utterance x = (w 1 , ..., w Tu ). Here, we will pad it up to a max sequence length T u .
where h u ∈ R Tu×H is the token-level representations of x and H is the hidden size of BERT. Then, we adopt two methods to further encode them into a sentence embedding r u ∈ R H . First, we could take the hidden state h u 1 from the first time step of [CLS] as r u = h u 1 (BERT-finetune). Or to better consider the individual word importance to the overall sentence embedding, we follow the work in Lin et al. (2017) to use a self-attentive network.
a sentence representation r u h . Finally we will concatenate all heads for the final representation r u .

Adaptive label-aware attentive layer
Inspired by few-shot learning works (Snell et al., 2017;Reimers and Gurevych, 2019), instead of classifying utterance into a predefined set of intents, we instead leverage the linear approximation idea (del Pino and Galaz, 1995) to help us determine the intents of an utterance. The linear approximation problem states that let S be a Hilbert space and T be a subspace of S, given a vector z ∈ S, we would like to find the closest pointẑ ∈ T to z. It turns out that the solution ofẑ = N k=1 β k v k will be a linear combination of a basis v 1 , .
To transform the above idea into a multi-intent detection setting, we first construct an intent embedding subspace T with a basis {r l 1 , ..., r l K } given a set of K intents Y s . To obtain {r l 1 , ..., r l K }, we adopt another BERT BERT l to encode K intents. Namely, for every intent y i in a given set Y s , which could be expressed as a word sequence (w 1 , ..., w T l ), we similarly use another BERT BERT l with the self-attentive layer mentioned in section 4.1 to encode it into an intent embedding r l i . The reason to use a different BERT from BERT u is that intents often have very different syntactic structures (i.e. no subjects) compared to the utterances.
By such intent encoding, we will obtain K intent embeddings as our basis {r l 1 , ..., r l K } to construct an intent embedding space T . Then shown in Figure. 1, for an utterance r u , we can project it onto T to obtain its linear approximationr And the Gram matrix G and b are the followings: To note, we assume {r l 1 , ..., r l K } are linearly independent since each vector represents the concept of an intent which should not be a linear combination of other intent vectors. Hence, G is guaranteed positive definite and will have an inverse. Here we further time a scaling factor √ H to compute w for empirical consideration since G −1 tends to lead overall product into small values.
After obtaining w, these projection weights can be viewed as scores of how likely an utterance x belong to each intent y i . We can follow Qin et al. (2020) to treat it as a multi-label classification task and generate the logitsŷ = σ(w) by sending w into a sigmoid function σ. Finally we can have the intent detection objective as a binary cross entropy loss where N is number of samples: During testing, after obtainingŷ ∈ R K as probabilities of the utterance belong to each intent, we can set a threshold t where 0 < t < 1.0 as a hyperparameter to select the final predicted intents. For instance, if we haveŷ = {0.3, 0.6, 0.9, 0.1, 0.4} and t = 0.5, the intents are predicted as {2, 3}.

Zero-shot setting
For normal multi-intent detection, after training, for a given K seen intent set Y s , we could use the method in section 4.2 to calculate the scores of a new utterance x seen with respect to each intent. Similarly, we could easily extend it into the zeroshot setting. First we will train BERT u , BERT l with the training data of a given K seen intent set Y s . Then, during testing, given a new L unseen intent set Y u , we could also encode these intents into intent embeddings {r l 1 , ..., r l L } with the trained BERT l too. Finally, plus the seen intent set Y s , we could construct an extended intent subspace T with a basis of {r l 1 , ..., r l K , r l K+1 , ..., r l K+L } and similarly generate scores for each seen and unseen intents with a new utterance x unseen .

Datasets
We use three widely used public multi-intent singlesentence datasets: MixATIS, MixSNIPS (Qin et al., 2020;Hemphill et al., 1990;Coucke et al., 2018) and Facebook Semantic Parsing System (FSPS) dataset (Gupta et al., 2018) and two multi-intent dialogue datasets: Microsoft dialogue challenge dataset (MDC) (Li et al., 2018b) and Schema-Guided Dialogue dataset (SGD) (Rastogi et   For MDC and SGD, we treat each utterance as an individual sample with multiple user and system acts as intents for experiments. We use all datasets for normal and zero-shot multi-intent detection and include single intent detection results with ATIS (Hemphill et al., 1990) and SNIPS datasets (Coucke et al., 2018). The detailed data statistics is shown in Table. 1. For zero-shot task, we use single sentence datasets MixATIS, MixSNIPS and FSPS for experiments. We subsample each dataset 5 times with the same train/valid/test number and report the average results of 5 random splits. In each split, we simulate the situation where training data only contain a part of intent labels and test will have all intent labels. For instance, MixATIS has totally 17 labels, we maintain K < 17 possible intents seen in training set and the testing set has all 17 intents. In experiments, we set 4 possible values of K in each three datasets. For few-shot task, we add 5% and 10% testing data into the training data and predict the rest testing data performance. We also replace BERT with two variations: ALBERT, TOD-BERT as our utterance encoder for additional baselines.

Baselines
We compare the normal multi-intent detection results with three competitive baseline models: 1. Stack-Prop which uses two stacked encodedecoder structures for joint intent and slot filling tasks (Qin et al., 2019). 2. Joint MID-SF which first considers multi-intent detection task in use of BiLSTMs (Gangadharaiah and Narayanaswamy, 2019). 3. AGIF uses graph interactive framework to consider fine-grained information (Qin et al., 2020).
We also compare zero-shot multi-intent detection results with seven competitive baselines: 1. BERT-finetune uses BERT as the encoder and increases the total output size of the final fullyconnected layer on top of it (Devlin et al., 2019). 2. Zero-shot LSTM uses two LSTM encoders to encode utterances and intents; then acquires scores with dot product (Kumar et al., 2017). 3.CDSSM uses convolutional deep structured model to calculate cosine similarities between embeddings (Chen et al., 2016). 4. Zero-shot BERT uses BERT as the encoder for Zero-shot LSTM (Kumar et al., 2017) instead. 5. CDSSM BERT uses BERT as the encoder for CDSSM (Chen et al., 2016) instead. 6. ALBERT-LA uses ALBERT as encoder along with our label-aware layer (Lan et al., 2020). 7. TOD-BERT-LA uses TOD-BERT, a pretrained encoder for task-oriented dialogs, along with our label-aware attentive layer (Wu et al., 2020).

Experimental setting
We use the pretrained BERT with 12 hidden layers of 768 units and 12 self-attention heads. The model is trained for 50 epochs and saved with the best performance on the validation set. For zero/fewshot setting, we randomly pick a number of intents to be unseen in the training set, run experiments for 5 different splits and report the average. We set the threshold t as 0.5 for multi-label classification. We follow the metrics used in Qin et al. (2020) for intent accuracy and F1 score. Table. 2 shows the normal multi-intent detection results on all five datasets. We can observe that LA-BAN outperforms the baselines substantially in the multi-intent detection especially in MixATIS and FSPS. It proves the usefulness of our fine-tuning BERT to capture more precise contextualized information for the downstream task. LABAN also considers the semantics in intent labels where the improvement enlarges when the number of intents increases, i.e. larger increase in MixATIS with 17 intents compared to MixSNIPS with only 7 intents. For datasets that do not have explicit conjunction words between the sentence like FSPS, MDC, SGD, we can observe a huge increase in accuracy in our model. Second, not only in multi-intent detection, in Table. 4, we can also see LABAN outperforms other baselines dealing with just one intent.

Zero-shot Multi-intent detection
To further justify our model's main contribution in zero-shot cases, we compare LABAN with several competitive baselines. As shown in Table.     BERT-finetune by simply enlarging the neurons for unseen intents is not capable of predicting any unseen intent utterances, causing 0.00 F1-u scores. Non-BERT approaches like Zero-shot LSTM and CDSSM using dot product or cosine similarity can show improved but limited unseen intent predictability. By leveraging pretraining power, zeroshot BERT can better associate unseen and seen intents with higher F1 score; while the performance of CDSSM BERT with more complex structures degrades with model overfitting. Finally, we discover that in all datasets (FSPS, MixATIS, MixSNIPS), with our label-aware attentive layer, three models (ALBERT-LA, TOD-BERT-LA, LABAN) with a strong pretrained power successfully outperform baselines in predicting unseen labels by associating their relations with input sequences, even if these intents are never seen in training phase.
We also observe that ALBERT has relatively inferior performance among BERT-based models, which possibly results from a light version of BERT and a different pretraining objective from the conversation-oriented version: TOD-BERT. To note, the original BERT model has slightly better F1 score for seen intents. It is reasonable since it avoids the error to predict utterances with unseen labels by searching over only the seen intents. However, without sacrificing much, models with the label-aware attentive layer could significantly boost the overall F1 scores in all three datasets.
Then we comprehensively evaluate LABAN's performance in zero/few-shot setting with different seen/unseen intent ratios in Figure. 2. We mainly have four discoveries. (1) LABAN can predict unseen intents around average half correctly.
(2) When the number of seen intents decreases, F1 score reduces both for seen and unseen intent labels with model's poorer knowledge of seen intents. (3) In utterances with both seen and unseen intents, F1 score for seen intents is lower than utterances with only seen intents. The fewer seen intents are trained, the more inclined the model will predict the utterance as unseen intents frequently. (4) In the few-shot setting, with little data of unseen intents trained, both seen and unseen intent accuracy boost by a large margin especially in MixSNIPS. It indicates the fact that regardless of scarce training data with some unseen labels, LABAN could fully exploit the use of pretrained linguistic knowledge on label semantics to match the most relevant intents in current criteria.  Table 5: Ablation analysis of different components in LABAN for normal multi-intent detection results on five datasets. We report accuracy (Acc) for all intents exact match and F1 scores based on individual intent calculation.

Ablation Analysis
To better understand the effectiveness of LABAN's components on multi-intent detection, we conduct the ablation analysis by reporting two different baseline variations of our model: BERT-finetune and BERT-attn. BERT-finetune refers to using the hidden state of [CLS] head from BERT without the extra label-aware layer; BERT-attn refers to adding a self-attentive layer to encode the sentence embeddings without the label-aware layer too. And finally, LABAN refers to our final model as the BERT with the self-attentive layer and adaptive label-aware attentive layer.
In experimental results shown in Table. 5, we can first observe that BERT with the additional self-attentive layer has increased performances on all five datasets, especially in MixATIS and FSPS.
When the number of total intents increases, the selfattentive layer is beneficial in understanding each word importance to the overall intent prediction. After introducing the label-aware layer, we could see a further increase, especially in FSPS which contains the maximum number of intents (24). It does help LABAN to better match the utterance and different intent semantics, particularly in the case when intent options are more complicated. Although the increase seems subtle when the label sources are abundant, it can cause huge assistance of tackling unseen labels, without sacrificing much performance in normal cases.

Error Analysis
We demonstrate a few cases in Table.   First, we found that some words in the utterances may obfuscate LABAN's prediction. For instance, in case MA1, LABAN may predict 'atis quantity' based on the keyword 'how many' by comparing the sentence and label semantics. In case MS1, the 'play' keyword also induces the model to predict the intent 'play music', where it actually means to search and play an album list. In such sense, 'creative work' may be less relevant to 'album' for our model's sentence-label pairing.
For FSPS, we found that most errors occur when real labels are 'unsupported navigation', 'unsupported event' or 'unsupported' such as case FS1. This may be hard for the model without an external ontology to identify unsupported events (out-ofscope). Therefore, in most cases, the model will just identify 'get info traffic' and 'get location' as the closest intents. In FS2 case, the model fails to predict 'get location' correctly. Without including contexts, it may be hard for the model to associate 'ahead' with 'get location'.
Then, we show the errors in zero-shot setting. Here, the model only sees 12/17 intents in Mix-ATIS, 3/7 intents in MixSNIPS and 14/24 intents in FSPS during training. We found two distinctive phenomena: (1) The model tends to predict more labels like in case MA2 if it is uncertain with unseen intents, resulting in lower precision. (2) We found that the model can predict seen intents well regardless of other existence of unseen intents in the same sentence. For unseen intent errors, the model tends to categorize them more into other unseen classes than seen classes, which indicates that the model has a basic knowledge of what seen intents should be. Mechanisms for explicit semantic pairing may be one of reasons and show ability of separating known and unknown classes confidently.
In case MA2, 'atis cheapest' and 'atis airfare' are not seen in training phase. However, the model is still capable of predicting 'atis airfare' accurately. Moreover, 'lowest' keyword is matched with the predicted label 'atis cheapest', benefiting from our label-aware attentive layer. For case MS2, all of predicted and real labels are unseen during training. We found the model still accurately predicts 'rate book' correctly based on keyword 'stars'. And the model predicts 'search screening event' or 'search creative work' instead 'play music', which actually happen frequently in other predictions. In FSPS like FS3 case, the model tends to predict lots of unseen intents without matching any of true intents. In FS3 case, it has only seen the intent 'get estimated arrival' during training which makes it erroneously predicts the sentence to 'arrival' rather than 'departure'. The effect could be possibly alleviated by introducing external knowledge embeddings for keyword 'leave' related to 'departure', which human usually associates with.

Visualization
To better understand the classification results of LABAN, shown in Figure. 3, we perform TSNE visualization (van der Maaten and Hinton, 2008) on the projected embeddingsr u = K i=1 w i r l i of each utterance onto the intent subspace T . Here we also plot each intent embedding r l i with their intent numbers. We can observe that numerous clusters are formed with close semantic distances. And most of intent embeddings like id 0, 6, 9, 12 are close to their respective clusters. It indicates that LABAN successfully constructs an intent embedding space that illustrates the semantic relation between each of intents and helps with classification of a projected utterance embedding. To note, since some of utterances have more than one intent, to simply the graph, we randomly pick one of intents in these utterances for visualization. Therefore, we can see some of clusters like id 8 actually have two dominant sub clusters. And some of utterances on the right sub cluster have other intents like id 3, 4, 12, 17. Hence, they may be semantically close to these intent embeddings (3, 4, 12, 17) on the graph.

Conclusion
In this paper, we propose the extension of finetuning BERT and label-aware semantic interactions into the multi-intent detection task in SLU. It successfully provides the solution to zero/few-shot setting where there are unseen labels in new utterances. By considering the label semantics, we can generate scores of how likely new utterances belong to these unseen intents. We compare the performance of our approach with previous methods and obtain significant improvements over baselines. It sheds the light that constructing a label semantic space could help the model to distinguish seen and unseen intents in utterances better. It provides the guidance in the work of improving SLU zero-shot multi-intent detection by considering dialogue contexts and external knowledge learning, or a more challenging task of detecting out-of-domain (OOD) detection where unseen intents are not available.

Ethical Consideration and Impact
The work aims to unfreeze the limitation of intent granularity defined in task-oriented dialogue training datasets, which is often ill-posed in the context of modeling precise and multiple intents in many previous works (Qin et al., 2019;Goo et al., 2018).
Multi-intent detection could be applied to a wide range of applications in many industries where the scenario requires a broader understanding of user requests. For example, customer service automation often solicits clear intent identification at each utterance for flexible answer policy, where identifying single intents may increase redundant and ambiguous dialogue turns. Second, zero-shot work has long been studied to unfreeze the limitation of deep learning models requesting large amount of data. It could be applied to multiple domains where intent labels are significantly lacking and may cause time-consuming labeling. By transferring the knowledge from existing labels, the model shall be more robust in dealing with unseen labels as humans have approached new things, which will be very beneficial in dialogue system design where many of data are unlabeled. In ethical aspect, naturalness of dialog structure heavily defines the scope of intent detection and usually changes during the dialog state transition. How to capture adequate intents from user is somehow critical in SLU and the following tasks like dialog state tracking. Wrong interpretation of intents may offend users and cause unsatisfactory answers. And we should also avoid predicting sensitive labels regarding user privacy. In such sense, we mainly test our model in all public released datasets which have been widely justified as unbiased in multiple domains and are not sensitive in revealing specific user information.
Overall, we see great opportunities for research applying LABAN to investigate interactions between utterance and their latent intents. It gives good intuition how the model understands the underlying human acts and improves the transparency in decision-critical applications. To mitigate the risks associated with our model, we aim to anonymize user sensitive information in training data and focus on extracting domainagnostic knowledge for better generalization and interpretability.