GL-GIN: Fast and Accurate Non-Autoregressive Model for Joint Multiple Intent Detection and Slot Filling

Multi-intent SLU can handle multiple intents in an utterance, which has attracted increasing attention. However, the state-of-the-art joint models heavily rely on autoregressive approaches, resulting in two issues: slow inference speed and information leakage. In this paper, we explore a non-autoregressive model for joint multiple intent detection and slot filling, achieving more fast and accurate. Specifically, we propose a Global-Locally Graph Interaction Network (GL-GIN) where a local slot-aware graph interaction layer is proposed to model slot dependency for alleviating uncoordinated slots problem while a global intent-slot graph interaction layer is introduced to model the interaction between multiple intents and all slots in the utterance. Experimental results on two public datasets show that our framework achieves state-of-the-art performance while being 11.5 times faster.


Introduction
Spoken Language Understanding (SLU) (Young et al., 2013) is a critical component in spoken dialog systems, which aims to understand user's queries. It typically includes two sub-tasks: intent detection and slot filling (Tur and De Mori, 2011).
Since intents and slots are closely tied, dominant single-intent SLU systems in the literature (Goo et al., 2018;Liu et al., 2019b;E et al., 2019;Qin et al., 2019;Teng et al., 2021;Qin et al., 2021b,c) adopt joint models to consider the correlation between the two tasks, which have obtained remarkable success.
Multi-intent SLU means that the system can handle an utterance containing multiple intents, which is shown to be more practical in the real-world scenario, attracting increasing attention. To this end, * Corresponding author. Xu and Sarikaya (2013) and Kim et al. (2017) begin to explore the multi-intent SLU. However, their models only consider the multiple intent detection while ignoring slot filling task. Recently, Gangadharaiah and Narayanaswamy (2019) make the first attempt to propose a multi-task framework to joint model the multiple intent detection and slot filling.  further propose an adaptive interaction framework (AGIF) to achieve fine-grained multi-intent information integration for slot filling, obtaining state-of-the-art performance. Though achieving the promising performance, the existing multi-intent SLU joint models heavily rely on an autoregressive fashion, as shown in Figure 1(a), leading to two issues: • Slow inference speed. The autoregressive models make the generation of slot outputs must be done through the left-to-right pass, which cannot achieve parallelizable, leading to slow inference speed.
• Information leakage. Autoregressive models predict each word slot conditioned on the previously generated slot information (from leftto-right), resulting in leaking the bidirectional context information.
In this paper, we explore a non-autoregressive framework for joint multiple intent detection and slot filling, with the goal of accelerating inference speed while achieving high accuracy, which is shown in Figure 1 (b). To this end, we propose a Global-Locally Graph-Interaction Network (GL-GIN) where the core module is a proposed local slot-aware graph layer and global intent-slot interaction layer, which achieves to generate intents and slots sequence simultaneously and nonautoregressively. In GL-GIN, a local slot-aware graph interaction layer where each slot hidden states connect with each other is proposed to explicitly model slot dependency, in order to alleviate uncoordinated slot problem (e.g., B-singer followed by I-song) (Wu et al., 2020) due to the non-autoregressive fashion. A global intent-slot graph interaction layer is further introduced to perform sentence-level intent-slot interaction. Unlike the prior works that only consider the tokenlevel intent-slot interaction, the global graph is constructed of all tokens with multiple intents, achieving to generate slots sequence in parallel and speed up the decoding process.
Experimental results on two public datasets MixSNIPS (Coucke et al., 2018) and Mix-ATIS (Hemphill et al., 1990) show that our framework not only obtains state-of-the-art performance but also enables decoding in parallel. In addition, we explore the pre-trained model (i.e., Roberta (Liu et al., 2019c)) in our framework.
In summary, the contributions of this work can be concluded as follows: (1) To the best of our knowledge, we make the first attempt to explore a non-autoregressive approach for joint multiple intent detection and slot filling; (2) We propose a global-locally graph-interaction network, where the local graph is used to handle uncoordinated slots problem while a global graph is introduced to model sequence-level intent-slot interaction; (3) Experiment results on two benchmarks show that our framework not only achieves the state-of-theart performance but also considerably speeds up the slot decoding (up to ×11.5); (4) Finally, we explore the pre-trained model in our framework. With the pre-trained model, our model reaches a new state-of-the-art level.
For reproducibility, our code for this paper is publicly available at https://github.com/ yizhen20133868/GL-GIN.

Problem Definition
Multiple Intent Detection Given input sequence x = (x 1 , . . . , x n ), multiple intent detection can be defined as a multi-label classification task that outputs a sequence intent label o I = (o I 1 , . . . , o I m ), where m is the number of intents in given utterance and n is the length of utterance.
Slot Filling Slot filling can be seen as a sequence labeling task that maps the input utterance x into a slot output sequence o S = (o S 1 , . . . , o S n ).

Approach
As shown in Figure 2(a), we describe the proposed framework, which consists of a shared selfattentive encoder ( §3.1), a token-level intent detection decoder ( §3.2) and a global-local graphinteraction graph decoder for slot filling ( §3.3). Both intent detection and slot filling are optimized simultaneously via a joint learning scheme.

Self-attentive Encoder
Following Qin et al. (2019), we utilize a selfattentive encoder with BiLSTM and self-attention mechanism to obtain the shared utterance representation, which can incorporate temporal features within word orders and contextual information.
BiLSTM The bidirectional LSTM (BiL-STM) (Hochreiter and Schmidhuber, 1997) have been successfully applied to sequence labeling tasks (Li et al., 2020(Li et al., , 2021. We adopt BiLSTM to read the input sequence {x 1 , x 2 , . . . , x n } forwardly and backwardly to produce contextsensitive hidden states H = {h 1 , h 2 , . . . , h n }, by repeatedly applying the where φ emb is embedding function. Self-Attention Following Vaswani et al. (2017), we map the matrix of input vectors X ∈ R n×d (d represents the mapped dimension) to queries Q, keys K and values V matrices by using different linear projections. Then, the self-attention output C ∈ R n×d is a weighted sum of values: (1) We concatenate the output of BiLSTM and selfattention as the final encoding representation:

Token-Level Intent Detection Decoder
Inspired by Qin et al. (2019), we perform a tokenlevel multi-label multi-intent detection, where we predict multiple intents on each token and the sentence results are obtained by voting for all tokens. Specifically, we first feed the contextual encoding E into an intent-aware BiLSTM to enhance its task-specific representations: Then, h I t is used for intent detection, using: where I t denotes the intent results at the t-th word; σ denotes the sigmoid activation function; W h and W I are the trainable matrix parameters. Finally, the sentence intent results o I k can be obtained by: where I (i,k) represents the classification result of token i for o I k . We predict the label as the utterance intent when it gets more than half positive predictions in all n tokens. For example, if I 1 = {0.9, 0.8, 0.7, 0.1}, from three tokens, we get {3, 2, 1, 0} positive votes (> 0.5) for four intents respectively. Thus the index where more than half of the votes ( > 3/2 ) were obtained was o I 1 and o I 3 , we predict intents o I = {o I 1 , o I 3 }.

Slot Filling Decoder
One main advantage of our framework is the proposed global-locally graph interaction network for slot filling, which is a non-autoregressive paradigm, achieving the slot filling decoding in parallel. In the following, we first describe the slot-aware LSTM ( §3.3.1) to obtain the slot-aware representations, and then show how to apply the global-locally graph interaction layer ( §3.3.2) for decoding.

Slot-aware LSTM
We utilize a BiLSTM to produce the slot-aware hidden representation S = (s 1 , . . . , s n ). At each decoding step t, the decoder state s t calculating by: where e t denotes the aligned encoder hidden state and I t denotes the predicted intent information.

Global-locally Graph Interaction Layer
The proposed global-locally graph interaction layer consists of two main components: one is a local slot-aware graph interaction network to model dependency across slots and another is the proposed global intent-slot graph interaction network to consider the interaction between intents and slots.
In this section, we first describe the vanilla graph attention network. Then, we illustrate the local slot-aware and global intent-slot graph interaction network, respectively.
Vanilla Graph Attention Network A graph attention network (GAT) (Veličković et al., 2018) is a variant of graph neural network, which fuses the graph-structured information and node features within the model. Its masked self-attention layers allow a node to attend to neighborhood features and learn different attention weights, which can automatically determine the importance and relevance between the current node with its neighborhood.
In particular, for a given graph with N nodes, one-layer GAT take the initial node featuresH = {h 1 , . . . ,h N },h n ∈ R F as input, aiming at producing more abstract representation,H = {h 1 , . . . ,h N },h n ∈ R F , as its output. The attention mechanism of a typical GAT can be summarized as below: where W h ∈ R F ×F and a ∈ R 2F are the trainable weight matrix; N i denotes the neighbors of node i (including i); α ij is the normalized attention coefficients and σ represents the nonlinearity activation function; K is the number of heads.
Local Slot-aware Graph Interaction Layer Given slot decode hidden representations S = (s 1 , . . . , s n ), we construct a local slot-aware graph where each slot hidden node connects to other slots. This allows the model to achieve to model the dependency across slots, alleviating the uncoordinated slots problem. Specifically, we construct the graph G = (V, E) in the following way, Vertices We define the V as the vertices set. Each word slot is represented as a vertex. Each vertex is initialized with the corresponding slot hidden representation. Thus, the first layer states vector for all nodes is S 1 = S = (s 1 , . . . , s n ).
Edges Since we aim to model dependency across slots, we construct a slot-aware graph interaction layer so that the dependency relationship can be propagated from neighbor nodes to the current node. Each slot can connect other slots with a window size. For node S i , only {S i−m , . . . , S i+m } will be connected where m is a hyper-parameter denotes the size of sliding window that controls the length of utilizing utterance context.

Information Aggregation
The aggregation process at l-th layer can be defined as: where N i is a set of vertices that denotes the connected slots.

Global Slot-Intent Graph Interaction Layer
To achieve sentence-level intent-slot interaction, we construct a global slot-intent interaction graph where all predicted multiple intents and sequence slots are connected, achieving to output slot sequences in parallel. Specifically, we construct the graph G = (V, E) in the following way, Vertices As we model the interaction between intent and slot token, we have n + m number of nodes in the graph where n is the sequence length and m is the number of intent labels predicted by the intent decoder. The input of slot token feature is G [S,1] = S L+1 ={s L+1 1 , . . . , s L+1 n } which is produced by slot-aware local interaction graph network while the input intent feature is an embedding G [I,1] where φ emb is a trainable embedding matrix. The first layer states vector for slot and intent nodes is Edges There are three types of connections in this graph network.
• intent-slot connection: Since slots and intents are highly tied, we construct the intent-slot connection to model the interaction between the two tasks. Specifically, each slot connects all predicted multiple intents to automatically capture relevant intent information.
• slot-slot connection: We construct the slotslot connection where each slot node connects other slots with the window size to further model the slot dependency and incorporate the bidirectional contextual information.
• intent-intent connection: Following , we connect all the intent nodes to each other to model the relationship between each intent, since all of them express the same utterance's intent.

Information Aggregation
The aggregation process of the global GAT layer can be formulated as: where G S and G I are vertices sets which denotes the connected slots and intents, respectively.

Slot Prediction
After L layers' propagation, we obtain the final slot representation G [S,L+1] for slot prediction.
where W s is a trainable parameter and o S t is the predicted slot if the t-th token in an utterance.

Joint Training
Following Goo et al. (2018), we adopt a joint training model to consider the two tasks and update parameters by joint optimizing. The intent detection objective is: ) . (14) Similarly, the slot filling task objective is: where N I is the number of single intent labels and N S is the number of slot labels. The final joint objective is formulated as: where α and β are hyper-parameters.

Datasets
We conduct experiments on two publicly available multi-intent datasets. 1 One is the Mix-ATIS (Hemphill et al., 1990;, which includes 13,162 utterances for training, 756 utterances for validation and 828 utterances for testing. Another is MixSNIPS (Coucke et al., 2018;, with 39,776, 2,198, 2,199 utterances for training, validation and testing.

Experimental Settings
The dimensionality of the embedding is 128 and 64 on ATIS and SNIPS, respectively. The dimensionality of the LSTM hidden units is 256. The batch size is 16. The number of the multi head is 4 and 8 on MixATIS and MixSNIPS dataset, respectively. All layer number of graph attention network is set to 2. We use Adam (Kingma and Ba, 2015) to optimize the parameters in our model. For all the experiments, we select the model which works the best on the dev set and then evaluate it on the test set. All experiments are conducted at GeForce RTX 2080Ti and TITAN Xp.

Baselines
We compare our model with the following best baselines: (1) Attention BiRNN. Liu and Lane (2016) adopt a stack-propagation framework to explicitly incorporate intent detection for guiding slot filling; (6) Joint Multiple ID-SF. Gangadharaiah and Narayanaswamy (2019) propose a multi-task framework with slot-gated mechanism for multiple intent detection and slot filling; (7) AGIF Qin et al. (2020b) proposes an adaptive interaction network to achieve the fine-grained multi-  intent information integration, achieving state-ofthe-art performance.

Main Results
Following Goo et al. (2018) and , we evaluate the performance of slot filling using F1 score, intent prediction using accuracy, the sentence-level semantic frame parsing using overall accuracy. Overall accuracy measures the ratio of sentences for which both intent and slot are predicted correctly in a sentence. Table 1 shows the results, we have the following observations: (1) On slot filling task, our framework outperforms the best baseline AGIF in F1 scores on two datasets, which indicates the proposed local slot-aware graph successfully models the dependency across slots, so that the slot filling performance can be improved. (2) More importantly, compared with the AGIF, our framework achieves +2.7% and 1.2% improvements for Mix-ATIS and MixSNIPS on overall accuracy, respectively. We attribute it to the fact that our proposed global intent-slot interaction graph can better capture the correlation between intents and slots, improving the SLU performance.

Speedup
One of the core contributions of our framework is that the decoding process of slot filling can be significantly accelerated with the proposed non-autoregressive mechanism. We evaluate the speed by running the model on the MixATIS test data in an epoch, fixing the batch size to 32. The comparison results are shown in Table 2. We observe that our model achieves the ×8.2, ×10.8 and ×11.5 speedup compared with SOTA models stack-propagation, Joint Multiple ID-SF and AGIF. This is because that their model utilizes an autoregressive architecture that only performs slot filling word by word, while our non-autoregressive framework can conduct slot filling decoding in parallel. In addition, it's worth noting that as the batch size gets larger, GL-GIN can achieve better acceleration where our model could achieve ×17.2 speedup compared with AGIF when batch size is 64.

Effectiveness of the Local Slot-aware Graph Interaction Layer
We study the effectiveness of the local slot-aware interaction graph layer with the following ablation. We remove the local graph interaction layer and directly feed the output of the slot LSTM to the global intent-slot graph interaction layer. We refer it to w/o local GAL in Tabel 3. We can clearly observe that the slot F1 drops by 1.5% and 1.2% on MixATIS and MixSNIPS datasets. We attribute this to the fact that local slot-aware GAL can capture the slot dependency for each token, which helps to alleviate the slot uncoordinated problems. A qualitative analysis can be founded at Section 4.5.6.

Effectiveness of Global Slot-Intent Graph Interaction Layer
In order to verify the effectiveness of slot-intent global interaction graph layer, we remove the global interaction layer and utilizes the output of local slot-aware GAL module for slot filling. It is named as w/o Global Intent-slot GAL in Table 3. We can observe that the slot f1 drops by 0.9%, 1.3%, which demonstrates that intent-slot graph in-   teraction layer can capture the correlation between multiple intents, which is beneficial for the semantic performance of SLU system. Following , we replace multiple LSTM layers (2-layers) as the proposed globallocally graph layer to verify that the proposed global-locally graph interaction layer rather than the added parameters works. Table 3 (more parameters) shows the results. We observe that our model outperforms more parameters by 1.6% and 2.4% overall accuracy in two datasets, which shows that the improvements come from the proposed Global-locally graph interaction layer rather than the involved parameters.

Effectiveness of the Global-locally Graph Interaction Layer
Instead of using the whole global-locally graph interaction layer for slot filling, we directly leverage the output of slot-aware LSTM to predict each token slot to verify the effect of the global-locally graph interaction layer. We name the experiment as w/o Global-locally GAL in Tabel 3. From the results, We can observe that the absence of global GAT module leads to 3.0% and 5.2% overall accuracy drops on two datasets. This indicates that the global-locally graph interaction layer encourages our model to leverage slot dependency and intent information, which can improve SLU performance.

Visualization
To better understand how global-local graph interaction layer affects and contributes to the final result, we visualize the attention value of the Global intent-slot GAL. As is shown in Figure 3, we visualize the dependence of the word "6" on context and intent information. We can clearly observe that token "6" obtains information from all contextual tokens. The information from "and 10" helps to predict the slot, where the prior autoregressive models cannot be achieved due to the generation word by word from left to right.

Qualitative analysis
We conduct qualitative analysis by providing a case study that consists of two sequence slots which are generated from AGIF and our model. From Table 4, for the word "6", AGIF predicts its slot label as "O" incorrectly. This is because that AGIF only models its left information, which makes it hard to predict "6" is a time slot. In contrast, our model predicts the slot label correctly. We attribute this to the fact that our proposed global intent-slot interaction layer can model bidirectional contextual information. In addition, our framework predicts the word slot "am" correctly while AGIF predicts it incorrectly (I-airport name follows Bdepart time), indicating that the proposed local slot-  aware graph layer has successfully captured the slot dependency.

Effect of Pre-trained Model
Following Qin et al. (2019), we explore the pretrained model in our framework. We replace the self-attentive encoder by Roberta (Liu et al., 2019c) with the fine-tuning approach. We keep other components identical to our framework and follow Qin et al. (2019) to consider the first subword label if a word is broken into multiple subwords. Figure 4 gives the result comparison of AGIF, GL-GIN and two models with Roberta on two datasets. We have two interesting observations. First, the Roberta-based model remarkably well on two datasets. We attribute this to the fact that pre-trained models can provide rich semantic features, which can help SLU. Second, GL-GIN + Roberta outperforms AGIF+Roberta on both datasets and reaches a new state-of-the-art performance, which further verifies the effectiveness of our proposed framework.

Related Work
Slot Filling and Intent Detection Recently, joint models (Zhang and Wang, 2016;Hakkani-Tür et al., 2016;Goo et al., 2018;Xia et al., 2018;E et al., 2019;Liu et al., 2019b;Qin et al., 2019;Zhang et al., 2019;Wu et al., 2020;Qin et al., 2021b;Ni et al., 2021) are proposed to consider the strong correlation between intent detection and slot filling have obtained remarkable success. Compared with their work, we focus on jointly modeling multiple intent detection and slot filling while they only consider the single-intent scenario.
More recently, multiple intent detection can handle utterances with multiple intents, which has attracted increasing attention. To the end, Xu and Sarikaya (2013) and Kim et al. (2017) begin to explore the multiple intent detection. Gangadharaiah and Narayanaswamy (2019) first apply a multi-task framework with a slot-gate mechanism to jointly model the multiple intent detection and slot fill-ing.  propose an adaptive interaction network to achieve the fine-grained multiple intent information integration for token-level slot filling, achieving the state-of-the-art performance. Their models adopt the autoregressive architecture for joint multiple intent detection and slot filling. In contrast, we propose a non-autoregressive approach, achieving parallel decoding. To the best of our knowledge, we are the first to explore a non-autoregressive architecture for multiple intent detection and slot filling.
Graph Neural Network for NLP Graph neural networks that operate directly on graph structures to model the structural information, which has been applied successfully in various NLP tasks. Linmei et al. (2019) and Huang and Carley (2019) explore graph attention network (GAT) (Veličković et al., 2018) for classification task to incorporate the dependency parser information. Cetoli et al. (2017) and Liu et al. (2019a) apply graph neural network to model the non-local contextual information for sequence labeling tasks. Yasunaga et al. (2017) and Feng et al. (2020a) successfully apply a graph network to model the discourse information for the summarization generation task, which achieved promising performance. Graph structure are successfully applied for dialogue direction (Feng et al., 2020b;Fu et al., 2020;Qin et al., , 2021a. In our work, we apply a global-locally graph interaction network to model the slot dependency and interaction between the multiple intents and slots.

Conclusion
In this paper, we investigated a non-autoregressive model for joint multiple intent detection and slot filling. To this end, we proposed a global-locally graph interaction network where the uncoordinatedslots problem can be addressed with the proposed local slot-aware graph while the interaction between intents and slots can be modeled by the proposed global intent-slot graph. Experimental results on two datasets show that our framework achieves state-of-the-art performance with ×11.5 times faster than the prior work.