X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented Compositional Semantic Parsing

Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries and serves as an essential component of virtual assistants. Current TCSP models rely on numerous training data to achieve decent performance but fail to generalize to low-resource target languages or domains. In this paper, we present X2Parser, a transferable Cross-lingual and Cross-domain Parser for TCSP. Unlike previous models that learn to generate the hierarchical representations for nested intents and slots, we propose to predict intents and slots separately and cast both prediction tasks into sequence labeling problems. After that, we further propose a fertility-based slot predictor that first learns to detect the number of labels for each token, and then predicts the slot types. Experimental results illustrate that our model can significantly outperform existing strong baselines in cross-lingual and cross-domain settings, and our model can also achieve a good generalization ability on target languages of target domains. Furthermore, we show that our model can reduce the latency by up to 66% compared to the generation-based model.


Introduction
Virtual assistants can perform a wide variety of tasks for users, such as setting reminders, searching for events, and sending messages.Task-oriented compositional semantic parsing (TCSP) which comprehends users' intents and detects the key information (slots) in the utterance is one of the core components in virtual assistants.Existing TCSP models highly rely on large amounts of training data that usually only exist in high-resource domains and languages (e.g., English), and they 1 The code will be released in https://github.com/zliucr/X2Parser.generally fail to generalize well in a low-resource scenario.Given that collecting enormous training data is expensive and time-consuming, we aim to develop a transferable model that can quickly adapt to low-resource target languages and domains.
The traditional semantic parsing can be treated as a simple joint intent detection and slot filling task (Liu and Lane, 2016;Goo et al., 2018;Zhang et al., 2019), while compositional semantic parsing has to cope with complex nested queries, which requires more sophisticated models.Current state-ofthe-art TCSP models (Rongali et al., 2020;Li et al., 2020a) are generation-based models that learn to directly generate the hierarchical representations which contain nested intent and slot labels. 2 We argue that the hierarchical representations are relatively complex, and the models need to learn when to generate the starting intent or slot label, when to copy tokens from the input, and when to generate the end of the label.Hence, large quantities of train- ing data are necessary for the models to learn these complicated skills (Rongali et al., 2020), while they cannot generalize well when large datasets are absent (Li et al., 2020a).Moreover, the inference speed of generation-based models will be greatly limited by the output length.
In this paper, we propose a transferable crosslingual and cross-domain parser (X2Parser) for TCSP.Instead of generating hierarchical representations, we convert the nested annotations into flattened intent and slot representations (as shown in Figure 2) so that the model can learn to predict the intents and slots separately.We cast the nested slot prediction problem into a special sequence labeling task where each token can have multiple slot labels.To tackle this task, our model first learns to predict the number of slot labels, which helps it capture the hierarchical slot information in user queries.Then, it copies the corresponding hidden state for each token and uses those hidden states to predict the slot labels.For the nested intent prediction, we cast the problem into a normal sequence labeling problem where each token only has one intent label since the nested cases for intents are simpler than those for slots.Compared to generation-based models (Li et al., 2020a), X2Parser simplifies the problem by flattening the hierarchical representations and tackles the task in a non-autoregressive way, which strengthen its adaptation ability in lowresource scenarios and greatly reduce the latency.
As shown in Figure 1, we conduct experiments on three low-resource settings: cross-lingual, crossdomain, and a combination of both.Results show that our model can remarkably surpass existing strong baselines in all the low-resource scenarios by more than 10% exact match accuracy, and can reduce the latency by up to 66% compared to generation-based models.We summarize the main contributions of this paper as follows: • We provide a new perspective to tackle the TCSP task, which is to flatten the hierarchical representations and cast the problem into several sequence labeling tasks.
• X2Parser can significantly outperform existing strong baselines in different low-resource settings and notably reduce the latency compared to the generation-based model.
• We conduct extensive experiments in different few-shot settings and explore the combination of cross-lingual and cross-domain scenarios.
2 Related Work

Task-Oriented Semantic Parsing
The majority of works on task-oriented semantic parsing focused on non-compositional user queries (Mesnil et al., 2013;Liu and Lane, 2016;Goo et al., 2018;Zhang et al., 2019), which turns the parsing task into a combination of intent detection and slot filling.Recently, Gupta et al. (2018) Figure 3: The architecture of X2Parser.We consider the TCSP task as a combination of the coarse-grained intent classification, fine-grained intent prediction, and slot filling tasks.
introduced a new dataset, called TOP, annotated with complex nested intents and slots and proposed to use the hierarchical representations to model the task.After that, Rongali et al. (2020) showed that leveraging a sequence-to-sequence model based on a copy mechanism (See et al., 2017)

Task Decomposition
In this section, we first introduce the intuition of decomposing the compositional semantic parsing into intent predictions and slot filling.Then, we describe how we construct intent and slot labels.

Intuition of Task Decomposition
We argue that hierarchical representations containing nested annotations for intents and slots are relatively complex.We need large enough training data to train a good model based on such representations, and the model's performance will be greatly limited in low-resource scenarios.Therefore, instead of incorporating intents and slots into one representation, we propose to predict them separately so that we can simplify the parsing problem and enable the model to easily learn the skills for each decomposed task, and finally, our model can achieve a better adaptation ability in low-resource scenarios.As illustrated in Figure 2, we obtain the coarse-grained intent, flattened fine-grained intents and flattened slot labels from the hierarchical representations, and train the model based on these three categories in a multi-task fashion.Note that we can always reconstruct the hierarchical representations based on the labels in these three categories, which means that the decomposed labels and the hierarchical labels are equivalent.

Label Constructions
Slot Labels We extract nested slot labels from the hierarchical representations and assign the labels to corresponding tokens based on the BIO (begin-inside-outside) structure.As we can see from Figure 2, there could exist multiple slot labels for one token, and we consider the order of the labels so as to reconstruct the hierarchical representations.Specifically, we put the more fine-grained slot label at the later position.For example, "message" (in Figure 2) has B-TODO and B-METHOD-MESSAGE labels, and B-METHOD-MESSAGE comes after B-TODO since it is a more fine-grained slot label.
Intent Labels Each data sample has one intent label for the whole user utterance, and we extract it as an individual coarse-grained intent label.For the intents expressed by partial tokens (i.e., finegrained intents), we use the BIO structure to label the corresponding tokens.We notice that we only need to assign one intent label to each token since the nested cases for intents are relatively simple. 3herefore, the fine-grained intent classification becomes a sequence labeling task.

X2Parser
The model architecture of our X2Parser is illustrated in Figure 3.To enable the cross-lingual ability of our model, we leverage the multilingual pre-trained model XLM-R (Conneau et al., 2020) as the sequence encoder.Let us define X = {x 1 , x 2 , ..., x n } as the user utterance and H = {h 1 , h 2 , ..., h n } as the hidden states (denoted as Emb in Figure 3) from XLM-R.

Slot Predictor
The slot predictor consists of a fertility classifier, a slot encoder, and a slot classifier.Inspired by Gu et al. (2018), the fertility classifier learns to predict the number of slot labels for each token, and then it copies the corresponding number of hidden states.Finally, the slot classifier is trained to conduct the sequence labeling based on the slot labels we constructed.The fertility classifier not only helps the model identify the number of labels for each token but also guides the model to implicitly learn the nested slot information in user queries.It relieves the burden of the slot classifier, which needs to predict multiple slot entities for certain tokens.
Fertility Classifier (FC) We add a linear layer (FC) on top of the hidden states from XLM-R to predict the number of labels (fertility), which we formulate as follows: where FC is an n-way classifier (n is the maximum label number) and ) is a positive integer representing the number of labels for x i .
Slot Filling After obtaining the fertility predictions, we copy the corresponding number of hidden states from XLM-R: (2) Then, we add a transformer encoder (Vaswani et al., 2017) (slot encoder (SE)) on top of H to incorporate the sequential information into the hidden states, followed by adding a linear layer (slot classifier (SC)) to predict the slots, which we formulate as follows: where P slot is a sequence of slots that has the same length as the sum of the fertility numbers.

Intent Predictor
Coarse-Grained Intent The coarse-grained intent is predicted based on the hidden state of the "[CLS]" token from XLM-R since it can be the representation for the whole sequence, and then we add a linear layer (coarse-grained intent classifier (CGIC)) on top of the hidden state to predict the coarse-grained intent: where p cg is a single intent prediction.
Fine-Grained Intent We add a linear layer (finegrained intent classifier (FGIC)) on top of the hidden states H to produce the fine-grained intents: where P fg is a sequence of intent labels that has the same length as the input sequence.Cross-Lingual Setting In the cross-lingual setting, we use English as the source language and the other languages as target languages.In addition, we consider a zero-shot scenario where we only use English data for training.

Cross-Domain Setting
In the cross-domain setting, we only consider training and evaluation in English.We choose ten domains as source domains and the other domain as the target domain.Different from the cross-lingual setting, we consider a few-shot scenario where we first train the model using the data from the ten source domains, and then we fine-tune the model using a few data samples (e.g., 10% of the data) from the target domain.
We consider the few-shot scenario because zeroshot adapting the model to the target domain is extremely difficult due to the unseen intent and slot types, while zero-shot to target languages is easier using multilingual pre-trained models.
Cross-Lingual Cross-Domain Setting This setting combines the cross-lingual and cross-domain settings.Specifically, we first train the model on the English data from the ten source domains, and then fine-tune it on a few English data samples from the other (target) domain.Finally, we conduct the zero-shot evaluation on all the target languages of the target domain.

Baselines
Seq2Seq w/ XLM-R Table 3: Exact match accuracies (averaged over three runs) for the cross-lingual cross-domain setting.The result for each domain is the averaged performance over all target languages.We use 10% of training samples in the English target domain, and do not use any data in the target languages.
Figure 4: Full cross-lingual cross-domain results (across all target languages of target domains) for Table 3.
layered model (Ju et al., 2018),5 while keeping the other modules the same.Unlike our fertility-based slot predictor, NLM uses several stacked layers to predict entities of different levels.We use this baseline to verify the effectiveness of our fertility-based slot predictor.

Training Details
We use XLM-R Large (Conneau et al., 2020) as the sequence encoder.For a word (in an utterance) with multiple subword tokens, we take the representations from the first subword token to predict the labels for this word.The transformer encoder (slot encoder) has one layer with a head number of 4, a hidden dimension of 400, and a filter size of 64.
We set the fertility classifier as a 3-way classifier since the maximum label number for each token in the dataset is 3.We train X2Parser using the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2e-5 and a batch size of 32.We follow Li et al. (2020a) and use the exact match accuracy to evaluate the models.For our model, the prediction is considered correct only when the predictions for the coarse-grained intent, fine-grained intents, and the slots are all correct.To ensure a fair comparison, we use the same three random seeds to run each model and calculate the averaged score for each target language and domain.

Main Results
Cross-Lingual Setting As we can see from Table 1, X2Parser achieves similar performance in English compared to Seq2Seq-based models, while it significantly outperforms them in the zero-shot cross-lingual setting, with ∼10% accuracy improvement on average.In the English training process, the Seq2Seq-based models can well learn the specific scope of tokens that need to be copied and assigned to a specific label type based on numerous training data.However, these models will easily lose effectiveness when the input sequences are in target languages due to the inherent variances across languages and the difficulty of generating hierarchical representations.X2Parser separates the TCSP task into predicting intents and slots individually, which lowers the task difficulty and boosts its zero-shot adaptation ability to target languages.Interestingly, we find that compared to Seq2Seq w/ XLM-R, X2Parser greatly boosts the performance on target languages that are topologically close to English (e.g., French (fr)) with more than 10% scores, while the improvements for languages that are topologically distant from English (e.g., Thai (th) and Hindi (hi)) are relatively limited.We argue that the large discrepancies between English and Thai make the representation alignment quality between English and Thai (Hindi) in XLM-R relatively low, and their different language patterns lead to unstable slot and intent predictions.These  factors limit the improvement for X2Parser on the adaptation to topologically distant languages.
From Table 1, although NLM achieves marginally lower performance in English compared to Seq2Seq w/ XLM-R, it produces significant improvements in target languages.This can be attributed to the fact that NLM leverages the same task decomposition as X2Parser, which further indicates the effectiveness of decomposing the TCSP task into intent and slot predictions for lowresource scenarios.Additionally, X2Parser surpasses NLM by ∼2% exact match accuracy on average in target languages.We conjecture that the stacked layers in NLM could make the model confused about which layer needs to generate which entity types, and this confusion is aggravated in the zero-shot cross-lingual setting where no training data are available.However, our fertility-based method helps the model implicitly learn the structure of hierarchical slots by predicting the number of labels for each token, which allows the slot classifier to predict the slot types more easily in the cross-lingual setting.2, X2Parser and NLM notably surpass the Seq2Seq model, with ∼15% improvements on the averaged scores.This can be largely attributed to the effectiveness of our proposed task decomposition for low-resource scenarios.Seq2Seq models need to learn when to generate the label, when to copy to-kens from the inputs, and when to produce the end of the label to generate hierarchical representations.This generation process requires a relatively large number of data samples to learn, which leads to the weak few-shot cross-domain performance for the Seq2Seq model.Furthermore, X2Parser outperforms NLM, with a ∼2% averaged score.We conjecture that our fertility classifier guides the model to learn the inherent hierarchical information from the user queries, making it easier for the slot classifier to predict slot types for each token.However, the NLM's slot classifier, which consists of multiple stacked layers, needs to capture the hierarchical information and correctly assign slot labels of different levels to the corresponding stacked layer, which requires relatively larger data to learn.

Cross-Domain Setting As shown in Table
Cross-Lingual Cross-Domain Setting From Table 3 and Figure 4, we can further observe the effectiveness of our proposed task decomposition and X2Parser in the cross-lingual cross-domain setting.X2Parser and NLM consistently outperform the Seq2Seq model in all target languages of the target domains and boost the averaged exact match accuracy by ∼20%.Additionally, from Table 3, X2Parser also consistently outperforms NLM on all 11 domains and surpasses it by 3.84% accuracy on average.From Figure 4 Seq2Seq 56.21 29.38 48.11 32.83 46.02 20.25 37.84 22.30 33.27 13.56 44.29 23.66 NLM 65.65 41.95 61.02 42.91 56.90 37.94 36.48 24.36 34.15 15.70 50.84 32.57 X2Parser 66.69 39.19 63.45 44.28 58.43 39.71 42.64 28.55 35.96 16.67 53.43 33.68 Table 4: Zero-shot cross-lingual exact match accuracies for nested and non-nested (NN) cases.
slot prediction that enables X2Parser to have a good zero-shot cross-lingual performance after it is finetuned to the target domain.

Few-shot Analysis
We conduct few-shot experiments using different sample sizes from the target domain for the crossdomain and cross-lingual cross-domain settings.
The few-shot results on the Event, News, and Recipe target domains for both settings6 are shown in Figure 5 and Figure 6.We find that the performance of the Seq2Seq model is generally poor in both settings, especially when only 1% of data samples are available.With the help of the task decomposition, NLM and X2Parser remarkably outperform the Seq2Seq model in various target domains for both the cross-domain and cross-lingual cross-domain settings across different few-shot scenarios (from 1% to 10%).Moreover, X2Parser consistently surpasses NLM for both the cross-domain and cross-lingual cross-domain settings in different few-shot scenarios, which further verifies the strong adaptation ability of our model.Interestingly, we observe that the improvement of X2Parser over Seq2Seq grows as the number of training samples increases.For example, in the cross-lingual cross-domain setting of the event domain, the improvement goes from 20% to 30% as the training data increases from 1% to 10%.We hypothesize that in the low-resource scenario, the effectiveness of X2Parser will be greatly boosted when a relatively large number of data samples are available, while the Seq2Seq model needs much larger training data to achieve good performance.

Analysis on Nested & Non-Nested Data
To further understand how our model improves the performance, we split the test data in the MTOP dataset (Li et al., 2020a) into nested and non-nested samples.We consider the user utterances that do  not have fine-grained intents and nested slots as the non-nested data sample and the rest of the data as the nested data sample.As we can see from Table 4, X2Parser significantly outperforms the Seq2Seq model on both nested and non-nested user queries with an average of ∼10% accuracy improvement in both cases.In addition, X2Parser also consistently surpasses NLM on all target languages in both the nested and non-nested scenarios, except for the Spanish nested case, which further illustrates the stable and robust adaptation ability of X2Parser.

Latency Analysis
We can see from Figure 7 that, as the output length increases, the latency discrepancy between the Seq2Seq-based model (Seq2Seq) and sequence labeling-based models (NLM and X2Parser) becomes larger, and when the output length reaches 40 tokens (around the maximum length in MTOP), X2Parser can achieve an up to 66% reduction in latency compared to the Seq2Seq model.This can be attributed to the fact that the Seq2Seq model has to generate the outputs token by token, while X2Parser and NLM can directly generate all the outputs.In addition, the inference speed of X2Parser is slightly faster than that of NLM.This is because NLM uses several stacked layers to predict slot entities of different levels, and the higher-level layer has to wait for the predictions from the lower-level layer, which slightly decreases the inference speed.
In this paper, we develop a transferable and nonautoregressive model (X2Parser) for the TCSP task that can better adapt to target languages and domains with a faster inference speed.Unlike previous TCSP models that learn to generate hierarchical representations, we propose to decompose the task into intent and slot predictions so as to lower the difficulty of the task, and then we cast both prediction tasks into sequence labeling problems.After that, we further propose a fertility-based method to cope with the slot prediction task where each token could have multiple labels.Results illustrate that X2Parser significantly outperforms strong baselines in all low-resource settings.Furthermore, our model is able to reduce the latency by up to 66% compared to the generation-based model.

A Intent Label Construction
In this section, we further describe how we convert the fine-grained intent prediction into a sequence labeling task (each token has only one label).We use a few examples to illustrate our intent label construction method.As illustrated in Figure 8, when there are no nested intents in the input utterance, we follow the BIO structure to give intent labels.We can see from Figure 9 that "call Grandma" is a CREATE-CALL intent and "Grandma" is a GET-CONTACT intent.Hence, the GET-CONTACT intent is nested in the CREATE-CALL intent.We use a special intent label (with "NESTED") for the "GET-CONTACT" intent (B-GET-CONTACT-NESTED) to represent that this intent is nested in another intent, and hence, the scope of the CREATE-CALL intent is automatically expanded from "call" to "call Grandma". 7ote that we cannot apply this labeling method to the slot prediction since one token in the user utterance could be the starting token for more than one slot entity.If that is the case, we have to use more than one slot label for this token to denote the starting position for each slot entity.Given that in the MTOP dataset, one token will not be the starting token of more than one intent, we can apply this method for the intent label construction.In the future, when more complex and sophisticated datasets are collected for the task-oriented compositional semantic parsing task, where there could exist more than one intent label for each token, we can always use the fertility-based method (currently applied for the slot prediction) for the intent prediction.

B Data Statistics
The data statistics for MTOP are shown in Table 5.

C Few-shot Cross-Domain Results
Full few-shot cross-domain results across all 11 target domains are shown in Figure 10 and Table 6.

D Few-shot Cross-Lingual Cross-Domain Results
Full few-shot cross-lingual cross-domain results across all 11 target domains are shown in Figure 11 and Tables 7, 8, 9, 10, and 11.

Figure 1 :
Figure 1: Illustration of the cross-lingual task, crossdomain task, and the combination of both (X2 task).

Figure 2 :
Figure 2: One data example with the illustration of our proposed flattened intents and slots representations, as well as the hierarchical representations used in Li et al. (2020a).
to directly generate the hierarchical representations was effective at parsing the nested queries.Taking this further, Chen et al. (2020) and Li et al. (2020a) extended the TOP dataset into multiple domains and multiple languages, and Li et al. (2020a) conducted zero-shot cross-lingual experiments using the combination of the multilingual pre-trained models (Conneau et al., 2020; Tran et al., 2020) and the copy mechanism method proposed in Rongali et al. (2020).Lately, Babu et al. (2021) and Shrivastava et al. (2021), which are concurrent works of X2Parser, proposed to tackled the TCSP task in a non-autoregressive way.Different from them, we propose to flatten the hierarchical representations and cast the problem into several sequence labeling tasks.

Figure 5 :
Figure 5: Few-shot exact match results on the cross-domain setting for Event, News and Recipe target domains.

Figure 6 :
Figure 6: Few-shot exact match results on the cross-lingual cross-domain setting for Event, News and Recipe target domains.The results are averaged over all target languages.
, X2Parser greatly improves on NLM in topologically distant languages (i.e., Hindi and Thai).It illustrates the powerful transferability and robustness of the fertility-based

Figure 7 :
Figure 7: Averaged latencies for our model and baselines on different output lengths of the MTOP dataset.

Figure 8 :
Figure 8: A labeling example for non-nested intent.

Figure 9 :
Figure 9: A labeling example for nested intent.

Figure 10 :
Figure 10: Few-shot exact match accuracies for the cross-domain setting across all 11 target domains.

Figure 11 :
Figure 11: Few-shot Exact match accuracies for the cross-lingual cross-domain setting across all 11 target domains.The results are averaged over all target languages.

Table 2 :
Exact match accuracies (averaged over three runs) for the cross-domain setting in English.The scores represent the performance for the corresponding target domains.We use 10% of training samples in the target domain."Seq2Seq" denotes the "Seq2Seq w/ XLM-R" baseline (same for the following tables and figures).
Li et al. (2020a)t the experiments on the MTOP dataset proposed byLi et al. (2020a), which contains six languages: English (en), German (de), French (fr), Spanish (es), and Thai (th), and 11 domains: alarm, calling, event, messaging, music, news, people, recipes, reminder, timer, and weather.The data statistics are reported in the Appendix B.
ModelAlarm Call.Event Msg.Music News People Recipe Remind Timer Weather Avg.
(Conneau et al., 2020)roposed a sequence-to-sequence (Seq2Seq) model using a pointer-generator network(See et al., 2017)to handle nested queries, and achieved new state-ofthe-art results in English.Li et al. (2020a)adopted this architecture for zero-shot cross-lingual adaptation.They replaced the encoder with the XLM-R(Conneau et al., 2020)and used a customized decoder to learn to generate intent and label types and copy tokens from the inputs.4

Table 5 :
Data statistics of the MTOP dataset.The data are roughly divided into a 70:10:20 percent split for train, eval and test

Table 6 :
Complete results of the cross-domain setting.

Table 7 :
Complete results of the cross-lingual cross-domain setting in Spanish.

Table 8 :
Complete results of the cross-lingual cross-domain setting in French.

Table 9 :
Complete results of the cross-lingual cross-domain setting in German.

Table 10 :
Complete results of the cross-lingual cross-domain setting in Hindi.

Table 11 :
Complete results of the cross-lingual cross-domain setting in Thai.