MISCA: A Joint Model for Multiple Intent Detection and Slot Filling with Intent-Slot Co-Attention

The research study of detecting multiple intents and filling slots is becoming more popular because of its relevance to complicated real-world situations. Recent advanced approaches, which are joint models based on graphs, might still face two potential issues: (i) the uncertainty introduced by constructing graphs based on preliminary intents and slots, which may transfer intent-slot correlation information to incorrect label node destinations, and (ii) direct incorporation of multiple intent labels for each token w.r.t. token-level intent voting might potentially lead to incorrect slot predictions, thereby hurting the overall performance. To address these two issues, we propose a joint model named MISCA. Our MISCA introduces an intent-slot co-attention mechanism and an underlying layer of label attention mechanism. These mechanisms enable MISCA to effectively capture correlations between intents and slot labels, eliminating the need for graph construction. They also facilitate the transfer of correlation information in both directions: from intents to slots and from slots to intents, through multiple levels of label-specific representations, without relying on token-level intent information. Experimental results show that MISCA outperforms previous models, achieving new state-of-the-art overall accuracy performances on two benchmark datasets MixATIS and MixSNIPS. This highlights the effectiveness of our attention mechanisms.


Introduction
Spoken language understanding (SLU) is a fundamental component in various applications, ranging from virtual assistants to chatbots and intelligent systems.In general, SLU involves two tasks: intent detection to classify the intent of user utterances, and slot filling to extract useful semantic concepts (Tur and De Mori, 2011).A common approach to tackling these tasks is through sequence classification for intent detection and sequence labeling for slot filling.Recent research on this topic, recognizing the high correlation between intents and slots, shows that a joint model can improve overall performance by leveraging the inherent dependencies between the two tasks (Louvan and Magnini, 2020;Zhang et al., 2019a;Weld et al., 2022).A number of joint models have been proposed to exploit the correlations between single-intent detection and slot filling tasks, primarily by incorporating attention mechanisms (Goo et al., 2018;Li et al., 2018;E et al., 2019;Qin et al., 2019;Zhang et al., 2019b;Chen et al., 2019;Dao et al., 2021).
However, in real-world scenarios, users may often express utterances with multiple intents, as illustrated in Figure 1.This poses a challenge for single-intent systems, potentially resulting in poor performance.Recognizing this challenge, Kim et al. (2017) is the first to explore the detection of multiple intents in a SLU system, followed by Gangadharaiah and Narayanaswamy (2019) who first propose a joint framework for multiple intent detection and slot filling.Qin et al. (2020) and Qin et al. (2021) have further explored the utilization of graph attention network (Veličković et al., 2018) to explicitly model the interaction between predicted intents and slot mentions.Recent state-of-the-art models Co-guiding (Xing and Tsang, 2022a) and Rela-Net (Xing and Tsang, 2022b) further incorporate the guidance from slot information to aid in intent prediction, and design heterogeneous graphs to facilitate more effective interactions between intents and slots.
We find that two potential issues might still persist within these existing multi-intent models: (1) The lack of "gold" graphs that accurately capture the underlying relationships and dependencies between intent and slot labels in an utterance.Previous graph-based methods construct a graph for each utterance by predicting preliminary intents and slots (Qin et al., 2021;Xing and Tsang, 2022a,b).However, utilizing such graphs to update representations of intents and slots might introduce uncertainty as intent-slot correlation information could be transferred to incorrect label node destinations.
(2) The direct incorporation of multiple intent labels for each word token to facilitate token-level intent voting (Qin et al., 2021;Xing and Tsang, 2022a,b).To be more specific, due to the absence of token-level gold intent labels, the models are trained to predict multiple intent labels for each word token in the utterance.Utterance-level intents depend on these token-level intents and require at least half of all tokens to support an intent.In Figure 1, for instance, a minimum of 12 tokens is needed to support the "atis_ground_fare" intent, while only 8 tokens (enclosed by the left rectangle) support it.Note that each token within the utterance context is associated with a specific intent.Thus, incorporating irrelevant intent representations from each token might potentially lead to incorrect slot predictions, thereby hurting the overall accuracy.
To overcome the above issues, in this paper, we propose a new joint model with an intent-slot coattention mechanism, which we name MISCA, for multi-intent detection and slot filling.Equivalently, our novel co-attention mechanism serves as an effective replacement for the graph-based interaction module employed in previous works, eliminating the need for explicit graph construction.By enabling seamless intent-to-slot and slot-to-intent information transfer, our co-attention mechanism facilitates the exchange of relevant information between intents and slots.This novel mechanism not only simplifies the model architecture, but also maintains the crucial interactions between intent and slot representations, thereby enhancing the overall performance.
In addition, MISCA also presents a label attention mechanism as an underlying layer for the coattention mechanism.This label attention mechanism operates independently of token-level intent information and is designed specifically to enhance the extraction of slot label-and intent label-specific representations.By capturing the characteristics of each intent/slot label, the label attention mechanism helps MISCA obtain a deep understanding and fine-grained information about the semantic nuances associated with different intent and slot labels.This, in turn, ultimately helps improve the overall results of intent detection and slot filling.
Our contributions are summarized as follows: (I) We introduce a novel joint model called MISCA for multiple intent detection and slot filling tasks, which incorporates label attention and intent-slot co-attention mechanisms. 1 (II) MISCA effectively captures correlations between intents and slot labels and facilitates the transfer of correlation information in both the intent-to-slot and slot-to-intent directions through multiple levels of label-specific representations.(III) Experimental results show that our MISCA outperforms previous strong baselines, achieving new state-of-the-art overall accuracies on two benchmark datasets.

Problem Definition and Related Work
Given an input utterance consisting of n word tokens w 1 , w 2 , ..., w n , the multiple intent detection task is a multi-label classification problem that predicts multiple intents of the input utterance.Meanwhile, the slot filling task can be viewed as a sequence labeling problem that predicts a slot label for each token of the input utterance.Kim et al. (2017) show the significance of the multiple intents setting in SLU.Gangadharaiah and Narayanaswamy (2019) then introduce a joint approach for multiple intent detection and slot filling, which models relationships between slots and intents via a slot-gated mechanism.However, this slot-gated mechanism represents multiple intents using only one feature vector, and thus incorporating this feature vector to guide slot filling could lead to incorrect slot predictions.
To generate fine-grained intents information for slot label prediction, Qin et al. (2020) introduce an adaptive interaction framework based on graph attention networks.However, the autoregressive nature of the framework restricts its ability to use 1 Our MISCA implementation is publicly available at: https://github.com/VinAIResearch/MISCA.bidirectional information for slot filling.To overcome this limitation, Qin et al. (2021) proposes a global-locally graph interaction network that incorporates both a global graph to model interactions between intents and slots, and a local graph to capture relationships among slots.
More recently, Xing and Tsang (2022a) propose two heterogeneous graphs, namely slot-to-intent and intent-to-slot, to establish mutual guidance between the two tasks based on preliminarily predicted intent and slot labels.Meanwhile, Xing and Tsang (2022b) propose a heterogeneous label graph that incorporates statistical dependencies and hierarchies among labels to generate label embeddings.They leverage both the label embeddings and the hidden states from a label-aware inter-dependent decoding mechanism to construct decoding processes between the two tasks.See a discussion on potential issues of these models in the Introduction.Please find more related work in Section 4.3.

Our MISCA model
Figure 2 illustrates the architecture of our MISCA, which consists of four main components: (i) Taskshared and task-specific utterance encoders, (ii) Label attention, (iii) Intent-slot co-attention, and (iv) Intent and slot decoders.
The encoders component aims to generate intentaware and slot-aware task-specific feature vectors for intent detection and slot filling, respectively.The label attention component takes these taskspecific vectors as input and outputs label-specific feature vectors.The intent-slot co-attention component utilizes the label-specific vectors and the slot-aware task-specific vectors to simultaneously learn correlations between intent detection and slot filling through multiple intermediate layers.The output vectors generated by this co-attention component are used to construct input vectors for the intent and slot decoders which predict multiple intents and slot labels, respectively.
Task-shared encoder: Given an input utterance consisting of n word tokens w 1 , w 2 , ..., w n , our task-shared encoder creates a vector e i to represent the i th word token w i by concatenating contextual word embeddings e BiLSTM word i and e SA i , and character-level word embedding e BiLSTM char.
w i (1) Here, we feed a sequence e w 1 :wn of real-valued word embeddings e w 1 , e w 2 ,... e wn into a single bidirectional LSTM (BiLSTM word ) layer (Hochreiter and Schmidhuber, 1997) and a single self-attention layer (Vaswani et al., 2017) to produce the contextual feature vectors e BiLSTM word i and e SA i , respectively.In addition, the character-level word embedding e BiLSTM char.w i is derived by applying another single BiLSTM (BiLSTM char. ) to the sequence of realvalued embedding representations of characters in each word w i , as done in Lample et al. (2016).
Task-specific encoder: Our task-specific encoder passes the sequence of vectors e 1:n as input to two different single BiLSTM layers to produce task-specific latent vectors e I i = BiLSTM I (e 1:n , i) and e S i = BiLSTM S (e 1:n , i) ∈ R de for intent detection and slot filling, respectively.These taskspecific vectors are concatenated to formulate taskspecific matrices E I and E S as follows:

Label attention
The word tokens in the input utterance might make different contributions to each of the intent and slot labels (Xiao et al., 2019;Song et al., 2022), motivating our extraction of label-specific vectors representing intent and slot labels.In addition, most previous works show that the slot labels might share some semantics of hierarchical relationships (Weld et al., 2022), e.g."fine-grained" labels toloc.city_name,toloc.state_nameand toloc.country_namecan be grouped into a more "coarse-grained" label type toloc.We thus introduce a hierarchical label attention mechanism, adapting the attention mechanism from Vu et al. (2020), to take such slot label hierarchy information into extracting the label-specific vectors.Formally, our label attention mechanism takes the task-specific matrix (here, E I from Equation 2and E S from Equation 3) as input and computes a label-specific attention weight matrix (here, A I ∈ R |L I |×n and A S,k ∈ R |L S,k |×n at the k th hierarchy level of slot labels) as follows: where softmax is performed at the row level to make sure that the summation of weights in each row is equal to 1; and in which L I and L S,k are the intent label set and the set of slot label types at the k th hierarchy level, respectively.
Here, k ∈ {1, 2, ..., ℓ} where ℓ is the number of hierarchy levels of slot labels, and thus L S,ℓ is the set of "fine-grained" slot label types (i.e.all original slot labels in the training data).
After that, label-specific representation matrices V I and V S,k are computed by multiplying the task-specific matrices E I and E S with the attention weight matrices A I and A S,k , respectively, as: Here, the j th columns v I j from V I ∈ R de×|L I | and v S,k j from V S,k ∈ R de×|L S,k | are referred to as vector representations of the input utterance w.r.t. the j th label in L I and L S,k , respectively.
To capture slot label hierarchy information, at k ≥ 2, taking v S,k−1 j , we compute the probability p S,k−1 j of the j th slot label at the (k − 1) th hierarchy level given the utterance, using a corresponding weight vector w S,k−1 j ∈ R de and the sigmoid function.We project the vector p S,k−1 of label probabilities p S,k−1 j using a projection matrix Z S,k−1 ∈ R dp×|L S,k−1 | , and then concatenate the projected vector output with each slot label-specific vector of the k th hierarchy level: The slot label-specific matrix V S,k at k ≥ 2 is now updated with more "coarse-grained" label information from the (k − 1) th hierarchy level.

Intent-slot co-attention
Given that intents and slots presented in the same utterance share correlation information (Louvan and Magnini, 2020;Weld et al., 2022), it is intuitive to consider modeling interactions between them.For instance, utilizing intent context vectors could enhance slot filling, while slot context vectors could improve intent prediction.We thus introduce a novel intent-slot co-attention mechanism that extends the parallel co-attention from Lu et al. (2016).Our mechanism allows for simultaneous attention to intents and slots through multiple intermediate layers.
Our co-attention mechanism creates a matrix S ∈ R ds×n whose each column represents a "soft" slot label embedding for each input word token, based on its task-specific feature vector: where W S ∈ R ds×(2|L S,ℓ |+1) ,U S ∈ R (2|L S,ℓ |+1)×de and 2|L S,ℓ | + 1 is the number of BIO-based slot tag labels (including the "O" label) as we formulate the slot filling task as a BIO-based sequence labeling problem.Recall that L S,ℓ is the set of "fine-grained" slot label types without "B-" and "I-" prefixes, not including the "O" label.Here, softmax is performed at the column level.
For notation simplification, the input feature matrices of our mechanism are orderly referred to as ,ℓ and Q ℓ+2 = S; and d t × m t is the size of the corresponding matrix Q t whose each column is referred to as a label-specific vector: As each intermediate layer's matrix Q t has different interactions with the previous layer's matrix Q t−1 and the next layer's matrix Q t+1 , we project Q t into two vector spaces to ensure that all labelspecific column vectors have the same dimension: where − → W t and ← − W t ∈ R d×dt are projection weight matrices; and thus We also compute a bilinear attention between two matrices Q t−1 and Q t to measure the correlation between their corresponding label types: where X t ∈ R d t−1 ×dt , and thus C t ∈ R m t−1 ×mt .Our co-attention mechanism allows the intentto-slot and slot-to-intent information transfer by computing attentive label-specific representation matrices as follows: We use ← − H 1 ∈ R d×|L I | and − → H ℓ+2 ∈ R d×n as computed following Equations 15 and 16 as the matrix outputs representing intents and slot mentions, respectively.

Decoders
Multiple intent decoder: We formulate the multiple intent detection task as a multi-label classification problem.We concatenate V I (computed as in Equation 6) and ← − H 1 (computed following Equation 15) to create an intent label-specific matrix H I ∈ R (de+d)×|L I | where its j th column vector v I j ∈ R de+d is referred to as the final vector representation of the input utterance w.r.t. the j th intent label in L I .Taking v I j , we compute the probability p I j of the j th intent label given the utterance by using a corresponding weight vector and the sigmoid function, following Equation 8.
We also follow previous works to incorporate an auxiliary task of predicting the number of intents given the input utterance (Chen et al., 2022b;Cheng et al., 2022;Zhu et al., 2023).In particular, we compute the number y INP of intents for the input utterance as: where W INP ∈ R z×|L I | and w INP ∈ R de are weight matrix and vector, respectively, and z is the maximum number of gold intents for an utterance in the training data.We then select the top y INP highest probabilities p I j and consider their corresponding intent labels as the final intent outputs.
Our intent detection object loss L ID is computed as the sum of the binary cross entropy loss based on the probabilities p I j for multiple intent prediction and the multi-class cross entropy loss for predicting the number y INP of intents.

Slot decoder:
We formulate the slot filling task as a sequence labeling problem based on the BIO scheme.We concatenate E S (computed as in Equation 3) and − → H ℓ+2 (computed following Equation 16) to create a slot filling-specific matrix H S ∈ R (de+d)×n where its i th column vector v S i ∈ R de+d is referred to as the final vector representation of the i th input word w.r.t.slot filling.We project each v S i into the R 2|L S,ℓ |+1 vector space by using a project matrix X S ∈ R (2|L S,ℓ |+1)×(de+d) to obtain output vector h S i = X S v S i .We then feed the output vectors h S i into a linear-chain CRF predictor (Lafferty et al., 2001) for slot label prediction.
A cross-entropy loss L SF is calculated for slot filling during training while the Viterbi algorithm is used for inference.

Joint training
The final training objective loss L of our model MISCA is a weighted sum of the intent detection loss L ID and the slot filling loss L SF :

Datasets and evaluation metrics
We conduct experiments using the "clean" benchmarks: MixATIS2 (Hemphill et al., 1990;Qin et al., 2020) and MixSNIPS3 (Coucke et al., 2018;Qin et al., 2020).MixATIS contains 13,162, 756 and 828 utterances for training, validation and test, while MixSNIPS contains 39,776, 2,198, and 2,199 utterances for training, validation and test, respectively.We employ evaluation metrics, including the intent accuracy for multiple intent detection, the F 1 score for slot filling, and the overall accuracy which represents the percentage of utterances whose both intents and slots are all correctly predicted (reflecting real-world scenarios).Overall accuracy thus is referred to as the main metric for comparison.

Implementation details
The ℓ value is 1 for MixSNIPS because its slot labels do not share any semantics of hierarchical relationships.On the other hand, in MixATIS, we construct a hierarchy with ℓ = 2 levels, which include "coarse-grained" and "fine-grained" slot labels.The "coarse-grained" labels, placed at the first level of the hierarchy, are label type prefixes that are shared by the "fine-grained" slot labels at the second level of the hierarchy (illustrated by the example in the first paragraph in Section 3.2).
In the encoders component, we set the dimensionality of the self-attention layer output to 256 for both datasets.the BiLSTM word , the dimensionality of the LSTM hidden states is fixed at 64 for MixATIS and 128 for MixSNIPS.Additionally, in BiLSTM char., the LSTM hidden dimensionality is set to 32, while in BiLSTM I and BiLSTM S , it is set to 128 for both datasets (i.e., d e is 128 * 2 = 256).In the label attention and intentslot co-attention components, we set the following dimensional hyperparameters: d a = 256, d p = 32, d s = 128, and d = 128.
To optimize L, we utilize the AdamW optimizer (Loshchilov and Hutter, 2019) and set its initial learning rate to 1e-3, with a batch size of 32.Following previous work, we randomly initialize the word embeddings and character embeddings in the encoders component.The size of character vector embeddings is set to 32.We perform a grid search to select the word embedding size ∈ {64, 128} and the loss mixture weight λ ∈ {0.1, 0.25, 0.5, 0.75, 0.9}.
Following previous works (Qin et al., 2020(Qin et al., , 2021;;Xing and Tsang, 2022a,b), we also experiment with another setting of employing a pretrained language model (PLM).Here, we replace our task-shared encoder with the RoBERTa base model (Liu et al., 2019).That is, e i from Equation 1 is now computed as e i = RoBERTa base (w 1:n , i).
For both the original (i.e.without PLM) and with-PLM settings, we train for 100 epochs and calculate the overall accuracy on the validation set after each training epoch.We select the model checkpoint that achieves the highest overall accuracy on the validation set and use it for evaluation on the test set.

Baselines
For the first setting without PLM, we compare our MISCA against the following strong baselines: (1) AGIF (Qin et al., 2020): an adaptive graph interactive framework that facilitates fine-grained intent information transfer for slot prediction; (2) GL-GIN (Qin et al., 2021): a non-autoregressive globallocal graph interaction network; (3) SDJN (Chen et al., 2022a): a weakly supervised approach that utilizes multiple instance learning to formulate multiple intent detection, along with self-distillation techniques; (4) GISCo (Song et al., 2022): an integration of global intent-slot co-occurrence across the entire corpus; (5) SSRAN (Cheng et al., 2022): a scope-sensitive model that focuses on the intent scope and utilizes the interaction between the two intent detection and slot filling tasks; (6) Rela-Net (Xing and Tsang, 2022b): a model that exploits label typologies and relations through a heterogeneous label graph to represent statistical dependencies and hierarchies in rich relations; and (7) Co-guiding (Xing and Tsang, 2022a): a two-stage graph-based framework that enables the two tasks to guide each other using the predicted labels.
For the second setting with PLM, we compare MISCA with the PLM-enhanced variant of the models AGIF, GL-GIN, SSRAN, Rela-Net and Coguiding.We also compare MISCA with the following PLM-based models: (1) DGIF (Zhu et al., 2023), which leverages the semantic information
5 Experimental results

Main results
Results without PLM: In general, aligning the correct predictions between intent and slot labels is challenging, resulting in the overall accuracy being much lower than the intent accuracy and the F 1 score for slot filling.Compared to the baselines, MISCA achieves better alignment between the two tasks due to our effective co-attention mechanism, while maintaining competitive intent accuracy and slot filling F1 scores (here, MISCA also achieves the highest and second highest F 1 scores for slot filling on Mix-ATIS and MixSNIPS, respectively).Compared to the previous model Rela-Net, our MISCA obtains a 0.8% and 1.8% absolute improvement in overall accuracy on MixATIS and MixSNIPS, respectively.In addition, MISCA also outperforms the previous model Co-guiding by 1.7% and 0.4% in overall accuracy on MixATIS and MixSNIPS, respectively.The consistent improvements on both datasets result in a substantial gain of 1.1+% in the average

Model
MixA MixS AGIF + RoBERTa base 50.0 80.7 SLIM (Cai et al., 2022)  overall accuracy across the two datasets, compared to both Rela-Net and Co-guiding.We find that MISCA produces a higher improvement in overall accuracy on MixATIS compared to MixSNIPS.One possible reason is that MISCA leverages the hierarchical structure of slot labels in MixATIS, which is not present in MixSNIPS.For example, semantically similar "fine-grained" slot labels, e.g.fromloc.city_name,city_name and toloc.city_name,might cause ambiguity for some baselines in predicting the correct slot labels.However, these "fine-grained" labels belong to different "coarse-grained" types in the slot label hierarchy.Our model could distinguish these "finegrained" labels at a certain intent-to-slot information transfer layer (from the intent-slot co-attention in Section 3.3), thus enhancing the performance.
State-of-the-art results with PLM: Following previous works, we also report the overall accuracy with PLM on the test set.the baselines as well as our MISCA.For example, RoBERTa helps produce an 6% accuracy increase on MixATIS and and an 8% accuracy increase on MixSNIPS for Rela-Net, Co-guiding and MISCA.
Here, MISCA+RoBERTa also consistently outperforms all baselines, producing new state-of-the-art overall accuracies on both datasets: 59.1% on Mix-ATIS and 86.2% on MixSNIPS.

Ablation study
We conduct an ablation study with two ablated models: (i) w/o "slot" label attention -This is a variant where we remove all the slot label-specific representation matrices V S,k .That is, our intentslot co-attention component now only takes 2 input matrices of Q 1 = V I and Q 2 = S. (ii) w/o coattention -This is a variant where we remove the mechanism component of intent-slot co-attention.That is, without utilizing ← − H 1 and − → H ℓ+2 , we only use E S from the task-specific encoder for the slot decoder, and employ V I from the label attention component for the multiple intent decoder (i.e. this can be regarded as a direct adoption of the prior multiple-label decoding approach (Vu et al., 2020)).For each ablated model, we also select the model checkpoint that obtains the highest overall accuracy on the validation set to apply to the test.
Table 3 presents results obtained for both ablated model variants.We find that the model performs substantially poorer when it does not use the slot label-specific matrices in the intent-slot co-attention mechanism (i.e.w/o "slot" label attention).In this case, the model only considers correlations between intent labels and input word tokens, lacking slot label information necessary to capture intent-slot co-occurrences.We also find that the largest decrease is observed when the intent-slot co-attention mechanism is omitted (i.e.w/o coattention).Here, the overall accuracy drops 9% on MixATIS and 5.2% on MixSNIPS.Both findings strongly indicate the crucial role of the intent-slot co-attention mechanism in capturing correlations and transferring intent-to-slot and slot-to-intent information between intent and slot labels, leading to notable improvements in the overall accuracy.
Figure 3 showcases a case study to demonstrate the effectiveness of our co-attention mechanism.The baseline MISCA w/o co-attention fails to recognize the slot airline_name for "alaska airlines" and produces an incorrect intent atis_airline.However, by implementing the intent-slot coattention mechanism, MISCA accurately predicts both the intent and slot.It leverages information from the slot toloc.city_name to enhance the probability of the intent atis_flight, while utilizing intent label-specific vectors to incorporate information about airline_name.This improvement is achievable due to the effective co-attention mechanism that simultaneously updates intent and slot information without relying on preliminary results from one task to guide the other task.

Conclusion
In this paper, we propose a novel joint model MISCA for multiple intent detection and slot filling tasks.Our MISCA captures correlations between intents and slot labels and transfers the correlation information in both forward and backward directions through multiple levels of label-specific representations.Experimental results on two benchmark datasets demonstrate the effectiveness of MISCA, which outperforms previous models in both settings: with and without using a pre-trained language model encoder.

Limitations
It should also be emphasized that our intent-slot co-attention mechanism functions independently of token-level intent information.This mechanism generates |L I | vectors for multiple intent detection (i.e.multi-label classification).In contrast, the token-level intent decoding strategy bases on token classification, using n vector representations.Recall that L I is the intent label set, and n is the number of input word tokens.Therefore, integrating the token-level intent decoding strategy into the intent-slot co-attention mechanism is not feasible.
This work primarily focuses on modeling the interactions between intents and slots using intermediate attentive layers.We do not specifically emphasize leveraging label semantics or the meaning of labels in natural language.However, although MISCA consistently outperforms previous models, its performance may be further enhanced by estimating the semantic similarity between words in an utterance and in the labels.

Figure 1 :
Figure 1: An example of utterance with multiple intents and slots.

Figure 2 :
Figure 2: Illustration of the architecture of our joint model MISCA.

Table 1 :
Obtained results without PLM.The best score is in bold, while the second best score is in underline.
Table 2 presents obtained results comparing our MISCA+RoBERTa with various strong baselines.We find that the PLM notably helps improve the performance of