MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue

Task-oriented dialogue (TOD) systems have been widely deployed in many industries as they deliver more efficient customer support. These systems are typically constructed for a single domain or language and do not generalise well beyond this. To support work on Natural Language Understanding (NLU) in TOD across multiple languages and domains simultaneously, we constructed MULTI3NLU++, a multilingual, multi-intent, multi-domain dataset. MULTI3NLU++ extends the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages (Spanish, Marathi, Turkish and Amharic), in two domains (BANKING and HOTELS). Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals, and therefore allows us to measure the realistic performance of TOD systems in a varied set of the world's languages. We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the NLU tasks of intent detection and slot labelling for TOD systems in the multilingual setting. The results demonstrate the challenging nature of the dataset, particularly in the low-resource language setting, offering ample room for future experimentation in multi-domain multilingual TOD setups.


Introduction
Task-oriented dialogue (TOD) systems (Gupta et al., 2006;Young et al., 2013), in which conversational agents assist human users to achieve their specific goals, have been used to automate telephone-based and online customer service tasks in a range of domains, including travel (Raux et al., 2003(Raux et al., , 2005)), finance and banking (Altinok, 2018), and hotel booking (Li et al., 2019).
ToD systems are often implemented as a pipeline of dedicated modules (Raux et al., 2005;Young et al., 2013).The Natural Language Understanding (NLU) module performs two crucial tasks: Figure 1: Example from the MULTI 3 NLU ++ dataset for the BANKING domain demonstrating the complex NLU tasks of multi-label intent detection across multiple languages.Intent labels consist of generic and domain-specific intents.
1) intent detection and 2) slot labelling.In the intent detection task the aim is to identify or classify the goal of the user's utterance from several pre-defined classes (or intents) (Tur et al., 2010).These intents are then used by the policy module (Gašić et al., 2012;Young et al., 2013) to decide the next conversational move by the conversational agent in order to mimic the flow of a human-human dialogue.In the slot labeling task each token in an utterance is assigned a label describing the type of semantic information represented by the token.During this process relevant information e.g.named entities, times/dates, quantities etc., defining the crucial information of the user's utterance, is identified.
Although intent detection models can reach impressive performance and have been deployed in many commercial systems (Altinok, 2018;Li et al., 2019), they are still unable to fully capture the variety and complexity of natural human interactions, and as such do not meet the requirements for deployment in more complex industry settings (Casanueva et al., 2022).This is due in part to the limitations of existing datasets for training and evaluating TOD systems.As highlighted by Casanueva et al. (2022) they are 1) predominantly limited to detecting a single intent, 2) focused on a single domain, and 3) include a small set of slot types (Larson and Leach, 2022).Furthermore, the success of task-oriented dialogue is 4) often evaluated on a small set of higher-resource languages (i.e., typically English) which does not test how generalisable systems are to the diverse range of the world's languages (Razumovskaia et al., 2022a).
Arguably one of the most serious limitations, hindering their deployment to more complex conversational scenarios, is the inability to handle multiple intents.In many real-world scenarios, a user may express multiple intents in the same utterance, and TOD systems must be able to handle such scenarios (Gangadharaiah and Narayanaswamy, 2019).For example, in Figure 1, the user expresses two main intents: (i) informing that they have forgotten their pin and thus, (ii) they would like to request a new debit card instead.A single-intent detection system can detect either of the two intents (but not both), resulting in partial completion of the user's request.Casanueva et al. (2022) recently proposed a multi-label intent detection dataset to capture such complex user requests.They further propose using intent modules as intent labels that can act as sub-intent annotations.In this example, "pin" and "don't_know" compose the first intent while "request_info", "new", "debit", and "card" compose the second.The use of intent modules, due to their combinatorial power, can support more complex conversational scenarios, and also allows reusability of the annotations across multiple domains.For example, "request_info", "new", and "membership" can be reused for gyms, salons, etc. to request information about new memberships at the respective institutions.
Furthermore, ToD systems are typically constructed for a single language.Their extension to other languages is restricted by the lack of available training data for many of the world's languages.Whilst the construction of multilingual TOD datasets has been given some attention (Razumovskaia et al., 2022a;Majewska et al., 2022;Xu et al., 2020), these datasets often include synthetic translations in the form of post-edited Machine Translation output (Ding et al., 2022;Zuo et al., 2021;Bellomaria et al., 2019).Post-editing may introduce undesirable effects and result in texts that are simplified, normalised, or exhibit interference from the source language as compared with manually translated texts (Toral, 2019).
To address all of the limitations discussed above, we propose MULTI 3 NLU ++ , a multilingual, multi-intent, multi-domain for training and evaluating TOD systems.
MULTI 3 NLU ++ extends the recent monolingual English-only dataset NLU++, which is a multi-intent, multi-domain dataset for the BANKING and HOTELS domains.MULTI 3 NLU ++ adds the element of multilinguality and thus enables simultaneous cross-domain and cross-lingual training and experimentation for TOD NLU as its unique property.
MULTI 3 NLU ++ includes expert manual translations of the 3,080 utterances in NLU++ to four languages of diverse typology and data availability: Spanish, Marathi, Turkish, and Amharic.The selection of languages covers a range of language families and scripts and includes high, medium, and low-resource languages.Capturing language diversity is particularly important if we wish to design multilingual TOD systems that are robust to the variety of expressions used across languages to represent the same value or concept.Using MULTI 3 NLU ++ we demonstrate the challenges involved in extending existing state-of-the-art Machine Translation systems and multilingual language models for NLU in TOD systems.MULTI 3 NLU ++ is publicly available at https://huggingface.co/ datasets/uoe-nlp/multi3-nlu.
2 Background and Related Work NLU++ (Casanueva et al., 2022), which serves as the base for MULTI 3 NLU ++ , covers two domains: BANKING and HOTELS.The intent and slot ontologies include intents and slots which are general, cross-domain types, as well as domainspecific ones.The intent ontology includes a total of 62 intents of which 23 and 14 intents are BANK-ING and HOTELS specific, respectively.The slot ontology includes 17 slot types, of which three and four slot types are BANKING and HOTELS specific, respectively.Such intent and slot ontology construction allows for the extension to new domains more easily.In this work, we inherit monolingual NLU++'s core benefits over previous datasets with the additional layer of multilinguality: 1) it is based on real customer data which addresses the issue of low lexical diversity in crowd-sourced data (Larson and Leach, 2022), 2) it supports the requirement for production systems to capture multiple intents from a single utterance, and 3) it is also slot-richit combines a large set of fine-grained intents with a large set of fine-grained slots to facilitate more insightful evaluation of models that jointly perform the two NLU tasks (Chen et al., 2019;Gangadharaiah and Narayanaswamy, 2019).
Multilingual NLU Datasets.Prior work has demonstrated the importance and particular challenges posed by low-resource languages (Goyal et al., 2022;Magueresse et al., 2020;Xia et al., 2021), while an increasing number of multilingually pretrained models (Xue et al., 2021;Conneau et al., 2020;Feng et al., 2022;Liu et al., 2020) enable significant improvements in processing them.While some tasks (e.g., NER (Adelani et al., 2021) or NLI (Ebrahimi et al., 2022)) already have benchmarks to evaluate on low-resource languages, dialogue Natural Language Understanding (NLU) is still lagging behind in this respect.The reasons for this include the high cost of data collection, as well as specific challenges posed by dialogue, e.g., colloquial speech or tone used in live conversations.Additionally, while we have observed increased interest in few-shot methods for multilingual dialogue NLU (Bhathiya and Thayasivam, 2020;Moghe et al., 2021;Feng et al., 2022;Razumovskaia et al., 2022b, among others), none of the existing datasets (Xu et al., 2020;FitzGerald et al., 2022;van der Goot et al., 2021) allow for reproducible comparison between few-shot methods.In other words, no dataset to date has provided predefined splits for few-shot experiments.
Until recently, resources for multilingual NLU have been scarce (Razumovskaia et al., 2022a).The vast majority were built upon the English ATIS dataset (Price, 1990), which has been extended to ten target languages (Upadhyay et al., 2018;Dao et al., 2021;Xu et al., 2020).However, ATIS covers only one domain (airline booking) and has been claimed to be almost solved already, more than a decade ago (Tur et al., 2010).More recent datasets cover multiple domains and broader linguistic geography (Majewska et al., 2022;Schuster et al., 2019;FitzGerald et al., 2022), including domains such as music or alarm, which are in frequent use in production systems.
All of the existing multilingual NLU datasets label every user utterance with a single intent, although current production-level systems often rely on multi-intent labelling (Gangadharaiah and Narayanaswamy, 2019;Qin et al., 2020), allowing for faster development cycles and updates if a new, previously unseen intent is observed (Casanueva et al., 2022).MULTI 3 NLU ++ is the first multilingual, multi-intent dataset with modular intent annotations.In comparison to other multilingual NLU datasets presented in Table 1, MULTI 3 NLU ++ has a larger intent set and is natively multi-intent/label.Unlike xSID, which is an evaluation-only dataset, it contains both training and evaluation data for all languages.Further, while MASSIVE is based on utterances specifically generated for the dataset (Bastianelli et al., 2020), MULTI 3 NLU ++ is based on real user inputs to a system in industrial settings (Casanueva et al., 2022).MULTI 3 NLU ++ enables systematic comparisons of dialogue NLU systems in few-shot setups for cross-lingual and cross-domain transfer for low-, medium-and highresource languages.

Dataset Collection
Our data collection process focuses on creating datasets with natural and realistic conversations, avoiding many artefacts that arise from crowdsourcing or automatic translation.We ask professional translators to manually translate each sourcelanguage utterance into the four target languages in the dataset; this promotes equal opportunity for future research for all four languages, and enables comparative cross-language analysis.
Choice of Base Dataset.This work aims to create a multilingual dataset which would be useful for testing production-like systems in multiple languages.Thus, we chose NLU++ (Casanueva et al., 2022) as our base dataset because i) it consists of realworld examples; and ii) every example is labelled with multiple intents which has proven useful in production settings (Qin et al., 2020;Casanueva et al., 2022).
Prior work on English multi-intent classification often uses MixSNIPS and MixATIS as benchmarks (Qin et al., 2020).However, the multi-intent examples in these datasets are synthetic, i.e., they are obtained through a simple concatenation of two single-intent examples, e.g., "Play this song and book a restaurant".This process leads to repetitive content and unnatural examples in the datasets.In contrast, NLU++ consists of intrinsically multiintent examples such as "I cannot login to my account because I don't remember my pin.", and is much more semantically variable.
Languages.We selected Spanish, a widely-spoken Romance language, as our high-resource language.Marathi, an Indo-Aryan language predominantly spoken in the Indian state of Maharashtra, is our medium-resource language.Turkish, an agglutinative Turkic language, may be regarded as a lowresource language from a Machine Translation per- spective, or as a medium-resource language based on the amount of training data in XLM-R (Conneau et al., 2020). 1 Amharic, an Ethiopian Semitic language belonging to the Afro-Asiatic language family, is our low-resource language.Spanish and Turkish are written in Latin script, Marathi is written in Devanagari, and Amharic in Ge'ez script.
Manual Translation.The use of crowd-sourcing agencies to collect multilingual datasets has often resulted in crowd workers using a Machine Translation (MT) API (plus post-editing) to complete the task, or even simply transliterating the given sentence (Goyal et al., 2022).In the case of postediting MT output, translations may exhibit posteditese -post-edited sentences are often simplified, normalised, or exhibit a higher degree of interference from the source language than manual translations (Toral, 2019).In our case, we wish to preserve the register of the original utterances, in particular with respect to the colloquial nature of many of the utterances in the original English NLU++ dataset.Furthermore, we wish to collect high-quality, natural translations.We, therefore, opted to recruit professional translators to perform manual translation, via two online platforms: Proz.com 2 and Blend Express. 3 We instructed our translators to treat the task as a creative writing task (Ponti et al., 2020) and maintain the colloquial nature of the utterances.We also provided instructions to annotate the spans in the generated translations.We provide the instructions given to our translators in Appendix B. We recruited three translators per 1 The amount of Turkish data is higher than, e.g., Marathi (which we consider to be a medium-resource language).
2 www.proz.com/;Proz.com is a platform for recruiting freelance translators who self-quote their remuneration.
3 www.getblend.com/online-translation/language; Spanish translators were recruited via Proz.com, and translators for the remaining languages were assigned to us by Blend Express.We first conducted a pilot task in which we asked professional translators to translate and annotate 50 sentences per domain.We conducted an in-house evaluation by native speakers of the respective language to verify that the translations were colloquial, that named entities were appropriately translated, and that the translation was of high quality.These evaluations were communicated to the translators. 4fter the pilot, we asked the same translators to complete the translation of the remaining utterances in the dataset.We ran an automatic checker to ensure that the slot values marked by the translators were present within their translated sentences.Further corrections such as incorrect annotations were also communicated to the translators.
Duration and Cost.Data collection was carried out over five months and involved (i) selecting translation agencies, (ii) running the pilot task, (iii) providing feedback to the translators, and (iv) the full-fledged data collection phase.The professional translators were compensated at £0.06/word for Spanish and £0.07/word for the remaining languages.The total cost of MULTI 3 NLU ++ is £7,624; see Appendix C for further details.
Slot Span Verification.Initially, the translators performed slot labelling simultaneously with translation.To ensure the quality of the annotation, the translated data was revised and annotated by three native speakers of the target language.Similar to Majewska et al. (2022), we used an automated inter-annotator reliability method to automatically verify the annotation quality.We conducted slot span labelling for 200 examples in Spanish for both BANKING and HOTELS domains from three native Spanish annotators.The accuracy score5 for a given sub-sample was at 86.8% revealing high agreement between the annotators.

Baseline Experiments
We benchmark several state-of-the-art approaches on MULTI 3 NLU ++ to 1) provide reference points for future dataset use and 2) demonstrate various aspects of multilingual dialogue NLU systems which can be evaluated using the dataset, such as crosslingual and cross-domain generalisation.We provide baseline numbers for intent detection and slot labelling and analyse the performance across languages and methods, and for different sizes of training data.We follow the main experimental setups from Casanueva et al. (2022) where possible, but we extend them to multilingual contexts.
Training Data Setups.We follow an N-fold crossvalidation setup following the setup in Casanueva et al. (2022).The experiments were run in three setups: low, mid and large.The low data setup corresponds to 20-fold cross-validation, where the model is trained on 1 20 th of the dataset and tested on the remaining 19 folds.The mid and large setups correspond to 10-fold cross-validation, where in mid setup the model is trained on 1 10 th of the data and tested on the other nine folds, and vice versa for the large setup.Domain Setups.MULTI 3 NLU ++ contains training and evaluation data for two domains: BANKING and HOTELS.It thus enables evaluation of NLU systems in three domain setups: in-domain, crossdomain and all-domain.In the in-domain setup, the model is trained and evaluated on the same domain, that is, we are testing how well the model can generalise to unseen user utterances while operating in the same intent space as the training data and without any domain distribution shift.In the crossdomain setup, the model is trained on one domain and tested on the other domain.We evaluate on the union of label sets of two domains rather than on the intersection as done by Casanueva et al. (2022).In this setup, we are testing how well a model can generalise to a new, unseen domain including intents unseen in training.In the all-domain setup, we train and evaluate the models on data from both domains.In this setup, we are testing how models perform on the larger label set (where some labels are shared between the domains) when examples are provided for all classes.
Language Setups.MULTI 3 NLU ++ , offering comparable sets of annotated data points across languages, allows for systematic comparisons of multilingual dialogue NLU systems on languages with different amounts of resources and diverse typological properties.We evaluate NLU systems in two setups: in-language and cross-lingually.Crosslingual benchmarking is conducted with the established approaches: (i) direct transfer using multilingually pretrained large language models (e.g., XLM-R; Conneau et al. ( 2020)), and (ii) transfer via translation, i.e., when either the test utterances are translated into the source language (Translate-Test).We source our translations from the M2M100 translation model (Fan et al., 2021).

Classification-Based Methods
Experimental Setup and Hyperparameters.We evaluate two standard classification approaches to intent detection: (i) MLP-based with a fixed encoder; and (ii) full-model fine-tuning.Prior work has demonstrated that strong intent detection results can be attained without fine-tuning the full encoder model both in monolingual (Casanueva et al., 2020) and multilingual setups (Gerz et al., 2021).The idea is to use a fixed efficient sentence encoder to encode sentences and train only the multi-layer perceptron (MLP) classifier on top of that to identify the intents.As we are dealing with multi-label classification, a sigmoid layer is stacked on top of the classifier.Intent classes for which the probability scores are higher than a (predefined) threshold are considered active.As in (Casanueva et al., 2022), we use the threshold of 0.3.In the experiments we evaluate two state-of-the-art multilingual sentence encoders: 1) mpnet, a multilingual sentence encoder trained using multilingual knowledge distillation (Reimers and Gurevych (2020)), and 2) LaBSE, a language-agnostic BERT sentence encoder (Feng et al., 2022) which was trained using dual-encoder training.LaBSE was especially tailored to produce improved sentence encodings in low-resource languages.The models were loaded from the sentence-transformers library (Reimers and Gurevych, 2019).
In the full fine-tuning setup, which is the current Table 3: In-language in-domain intent detection performance for Amharic and Spanish (F 1 × 100).Results for other languages are provided in Appendix D, Table 12.
All MLP-based models were trained with the same hyperparameters, following the suggested values from Casanueva et al. (2022).The MLP classifier comprises one 512-dimensional hidden layer and tanh as non-linear activation.The learning rate is 0.003 for MLP-based models and 2e-5 for full model fine-tuning, with linear weight decay in both setups.For all setups, the models were trained with AdamW (Loshchilov and Hutter, 2019) for 500 and 100 epochs for intent detection and slot labelling, respectively.We used a batch size of 32.The evaluation metric is micro-F 1 .
Results and Discussion.Table 2 presents the comparison between the full fine-tuning based on XLM-R and MLP-based classification approaches.The results demonstrate that for the in-domain in-language 10-fold setup, the MLP-based approach works consistently better than full finetuning across domains and languages.We assume that the reason is that the MLP-based approach is more parameter-efficient than full fine-tuning, making it more suited for such low-data setups.Due to their computational efficiency combined with competitive performance, we focus on MLP-based models for the remainder of this section.
The main MLP-based intent detection results are presented in Tables 3 and 4 for in-domain intent detection for the in-language and zero-shot crosslingual setups, respectively.When we compare the performance on low-and high-resource languages, although the sizes and content of the training data are the same across languages, we observe a large gap between the performance of the same models on Spanish and Amharic.In addition, the results in Table 3 reveal the properties of the multilingual sentence encoders with respect to the intent detection task.While mpnet performs consistently better on Spanish (our high-resource language), LaBSE is a much stronger encoder for low-resource Amharic.The differences are especially pronounced in the low-data setups (20-Fold and 10-Fold).While for Spanish this difference can be recovered with more training data (cf.Large setup), for Amharic these differences persist and are amplified across large training data setups.
We consider zero-shot transfer from two source languages: English (the most commonly used high-resource source language) and Amharic (our low-resource option).Surprisingly, the results in Table 4 suggest that using English as a default source language, as typically done in work on crosslingual transfer, might not be optimal.In fact, using Amharic as a source language leads to stronger transfer results across languages.One trend we observed is that the lower resource the source language is, the stronger the performance on diverse target languages. 6We speculate that by training on lower-resource language data we might 'unearth' the sentence encoders' multi-lingual capabilities.
The main results for slot labelling are presented in Table 5 for the in-domain in-language and zeroshot cross-lingual setups.The comparison of inlanguage results for high-resource (en) and lowresource (am) demonstrate a similar trend to intent classification: in-language performance on the high-resource language is stronger than that on the low-resource language.However, unlike for intent detection, the high-resource language serves as a stronger source for cross-lingual transfer than the low-resource language.

QA-Based Methods
Experimental Setup and Hyperparameters.Reformulating TOD tasks as question-answering (QA) problems has achieved state-of-the-art performance (Namazifar et al., 2021).As these methods were the best performing for the NLU++ dataset, we now investigate this approach in the multilingual setting for both NLU tasks.To formulate intent detection as an extractive QA task, the utterance is appended with "yes.no.[UTTER-ANCE]" and acts as the context.Intent labels are converted into questions as "Is the intent to ask about [INTENT_DESCRIPTION]" where IN-TENT_DESCRIPTION is the free-form description of the intent.The QA model must learn to predict the span as "yes" across all the questions corresponding to the specific intents in the utterance and "no" otherwise.During the evaluation of transfer performance to other languages, the utterance is in the target language while the question is in the source language.7For slot labelling, the template is "none.[UTTERANCE]".We follow the strict evaluation approach for slot labelling -a span is marked as correct only if it exactly matches the ground truth span (Namazifar et al., 2021).
To build an extractive QA model for these tasks, we first fine-tune a multilingual language model mDeBERTa (He et al., 2021) with a generalpurpose QA dataset such as SQuADv2 (Rajpurkar et al., 2018) and then with the respective languages from our dataset.We fine-tune for 10 epochs (5 for the Large setting) with a learning rate of 1e − 5, weight decay is 0, and batch size is 4.
Results and Discussion.We report the results for intent detection in the 20-Fold setup using English and Amharic in Table 6, and the remaining results are in Appendix F. We also compare cross-lingual transfer with translation-based methods.We find that mDeBERTa-based QA models have comparable results for English multi-intent detection even with the monolingual models in Casanueva et al. (2022).We notice a drastic drop in the zero-shot transfer performance across all languages and domains.This is in line with recent findings that zeroshot transfer is harder for dialogue tasks (Ding et al., 2022;Hung et al., 2022;Majewska et al., 2022) as opposed to other cross-lingual tasks (Hu et al., 2020).Unlike the observation in §4.1, using a higher resource language is beneficial for transfer learning in the QA setup.We find that improvement in transfer performance depends more on the matching of the script of the source and tar- get language followed by the amount of training data present per language during the pre-training of the mDeBERTa model.The exception is Amharic as the source language, where transfer performance is poor across the board.The scores also indicate QA models are better at this task than MLP methods, cf.Table 5.While comparing zero-shot cross-lingual transfer with Translate-Test, we find that the latter is poorer (except in the case of EN-MR), opposing the findings in Majewska et al. (2022).We attribute this to the poor quality of the translations as repetition/nonsensical generation is rampant in them.Further, translation-based methods have an additional overhead of translating the sentences.
We find that the standard deviation is quite variable across both tasks.We check for standard deviations across different data setups and find that the trend is consistent irrespective of the underlying training data.There is also no visible trend between the language similarity or the amount of training data present during pre-training of mDeBERTa.
Table 7 reports the slot labelling results.Unlike intent detection results, slot labelling results are comparable across the languages in the same data settings.Similar to the intent detection results, using mDeBERTa QA models yields better results than those reported in Table G for high resource languages.Overall, we find that QA-based methods are indeed a promising research avenue (Liu et al., 2021b), even in the multilingual setting (Zhao and Schütze, 2021).8 5 Discussion and Future Work High-Resource and Low-Resource Languages.Language models such as mDeBERTa and XLM-R are pretrained on ∼100 languages.However, their representational power is uneven for high-and low-resource languages (Lauscher et al., 2020;Ebrahimi et al., 2022;Wu et al., 2022).MULTI 3 NLU ++ includes the same training and evaluation data for all languages, allowing us to systematically analyse model performance on dialogue NLU for both high-and low-resource languages.We compare the performance for different languages in the in-domain setup.As seen in Figure 2a, the overall trend in performance is the same across languages: with more training data, we gain higher performance overall.Interestingly, the absolute numbers are indicative of the resources available in pre-training for a given language.For instance, Amharic (am) has the lowest performance while Spanish (es) has the highest performance.9 Cross-Domain Generalisation.We now consider the intent detection task in the cross-domain inlanguage setting.The results in Figure 2b corroborate the findings from the in-domain experiments: the lower-resource the language is, the lower the performance on the task is.It is noticeable that the performance in the cross-domain setup is much lower than for the in-domain setup, additionally exposing the complexity of MULTI 3 NLU ++ .Additionally, in Figure 2b we observe that for the cross-domain setup, high-resource languages benefit more from the increase in training data size than low-resource languages.This shows that the gap in performance on low-and high-resource languages is rooted not only in the amount of in-task training data available but also in the representational power of multilingual models for low-resource languages.
Future Directions.Our work focuses on collecting high-quality parallel data through expert translators.With recent advances in MT (Kocmi et al., 2022), it would be worth investigating if the quality of datasets collected with machine translation + human post-editing (Hung et al., 2022;Ding et al., 2022) is on par with the human translators, especially for the higher resource language pairs.Another possible direction for future work would be to further diversify the dataset by including additional languages, i.e. with a focus on increasing coverage of language families, branches, and/or scripts, or properties that pose particular challenges in multilingual settings (e.g.free word order in Machine Translation, etc.).
We believe our resource will have an impact beyond multilingual multi-intent multi-domain systems.We hope the community addresses interesting questions in data augmentation (generating paraphrases with multiple intents), analysing representation learning in multilingual models, translation studies, and MT evaluation.

Conclusion
We collected MULTI 3 NLU ++ , a dataset that facilitates multilinugual, multi-label, multi-domain NLU for task-oriented dialogue.Our dataset incorporates core properties from its predecessor NLU++ (Casanueva et al., 2022): a multi-intent and slot-rich ontology, a mixture of generic and domain-specific intents for reusability, and utterances that are based on complex scenarios.We investigated these properties in a multilingual setting for Spanish, Marathi, Turkish, and Amharic.
We implemented MLP-based and QA-based baselines for intent detection and slot labelling across different data setups, transfer learning setups, and multilingual models.From a wide set of observations, we highlight that (i) there is a significant drop in performance across all languages as compared to NLU++ with performance drops increasing as we progress from high-to low-resource languages; (ii) zero-shot performance improves when the source language has lower resources in the MLP setup; (iii) cross-lingual transfer in the QA-based intent detection is dependent on matching the script of the source language and amount of data during pretraining setup.
We hope that the community finds this dataset valuable while working on advancing research in multilingual NLU for task-oriented dialogue.

Limitations
MULTI 3 NLU ++ , like NLU++ on which it was based, comprises utterances extracted from real dialogues between users and conversational agents as well as synthetic human-authored utterances constructed with the aim of introducing additional combinations of intents and slots.The utterances, therefore, lack the wider context that would be present in a complete dialogue.As such the dataset cannot be used to evaluate systems with respect to discourse-level phenomena present in dialogue.
Our source dataset includes scenarios and names that are Anglo-centric and will not capture the nuances of intent detection at the regional and cultural level.Future efforts in collecting multilingual datasets should focus on appropriate localisation of the mentioned domains, intents, and scenarios (Liu et al., 2021a;Majewska et al., 2022).

Ethics Statement
Our dataset builds on the NLU++ dataset (Casanueva et al., 2022) which was collected by removing any personal information.All the names in the dataset are also collected by randomly combining names and surnames from the list of the top 10K names from the US registry.
Our data collection process was thoroughly reviewed by the School of Informatics, University of Edinburgh under the number 2019/59295.Our translators are on legal contracts with the respective translation agencies (See Appendix C) for details.
Although we have carefully vetted our datasets to exclude problematic examples, the larger problem of unethical uses and unfairness in conversational systems cannot be neglected (Dinan et al., 2021).Our work also uses multilingual language models that are shown to harm marginalised populations (Kaneko et al., 2022).Our dataset is publicly available under Creative Commons Attribution 4.0 International (CC-BY-4.0).

B Translation Guidelines
We are an academic team interested in evaluating modern automatic machine translation systems for building better multilingual chatbots.The sentences that you will create should be colloquial.We do not want exact translations but rather what would have been your utterance if you were to say the given content in Spanish.Please do not use any form of machine translation during the process.The sentences in this document are spoken to a customer service bot that provides banking services (e.g., making transfers, depositing cheques, reporting lost cards, requesting mortgage information) and hotel 'bell desk' reception tasks (e.g., booking rooms, asking about pools or gyms, requesting room service).
Please translate the sentences under 'source_text'.
Please write the translation under 'target_text'.In the translation please: • maintain the meaning and style as close to English text as possible; Example: "Exactly, it was declined" It is important that the colloquial word "Exactly" is reflected in the translation.
• if the example includes pronouns (e.g., "this one"), maintain the pronouns in the translation as well.
• you are encouraged to translate the proper names and time values in the most natural form for the target language.
Example: "cancel the one at 2:35 p.m. on Dec 8th" The "2:35 p.m." can be translated in any way time is usually expressed in the target language, e.g., "25 minutes to 3" or "2:35 in the day", if "p.m." is non-existent in the target language.
• If there is no exact translation for a concept or the concept is absent from the culture, feel free to substitute it with a description of the concept or a similar concept familiar to the population.
Example: "book a hotel via Booking.com"If there is no access to Booking com, feel free to substitute it with "book a hotel by phone".
You may observe that some of the sentences have some spans under slot_1, slot_2, slot_3 and so on.After you have written the Spanish sentence in the target_text, please replace the values under this column with the corresponding spans in your Spanish sentence in the respective slot columns 1.For example, in the sentence "how much more have I spent on take out since last week?",there is "last week" under slot_1 and "take out" under slot_2.We would expect you to copy the corresponding phrases (i,e: the translations for last week and take out respectively) from your written Spanish sentence in the 'slot_1' and 'slot_2'.If there is no exact phrase that matches the span from English, copy its equivalent in the columns.
2. Please do not change the order of values while writing the corresponding value columns in the target sentence.

C Dataset Collection Details
Our annotators for Spanish are based in Spain, Amharic are based in Ethiopia, Marathi are based in India, and Turkish are based in Turkey.The rates for translation were fixed by the translators or the translation agencies, ensuring fair pay for the translators.Our internal annotators were compensated  with £15/hour for their work.We provide details on the dataset collection costs in Table 9.

D Full Results for MLP-Based Intent Detection Baselines
We provide full results for the MLP-based baseline in tables 12 to 17.Our implementation uses the Transformers library (Wolf et al., 2019) and the SentenceBERT library (Reimers and Gurevych, 2019).The models were trained on NVIDIA Titan xP GPUs.Approximate training times for every fold are provided in Table 10.

E In-Domain Cross-Lingual Results
We compare the in-domain cross-lingual results using mpnet and LaBSE as an underlying encoder for all domains in Figures 3 and 4, respectively.

F Full Results for QA-Based Intent Detection
We provide full results for the QA-based baseline in tables 18 to 22.For both slot labelling and intent detection, we performed the hyperparameter search over fold 0 of the HOTELS domain in the 20-fold setup.The hyperparameters varied include learning rate [1e-5, 2e-5, 3e-5], batch size [4,8,16], epochs [5,10,15] and multilingual models of XLM-R and mDe-BERTa.The other hyperparameters are the same as (Casanueva et al., 2022).Approximate training times for every fold on NVIDIA GeForce RTX 2080 are provided in Table 11.Our implementation of the QA model uses the QA fine-tuning scripts from the Transformers Library (Wolf et al., 2019).

G Full Results for Slot Labelling Baselines
We provide full results for slot labelling for token classification setup in Tables 23 and 24.The results for the QA-based setup are provided in Tables tables 25 and 26 Table 13: Cross-lingual in-domain results for intent detection with MLP-based and LaBSE as the fixed sentence encoder (F 1 × 100).Table 15: In-language cross-domain results for intent detection with MLP-based setup (F 1 × 100).

Figure 2 :
Figure 2: Comparison of in-language (a) in-domain and (b) cross-domain results for intent detection (F 1 ).The model is the MLP-based baseline with mpnet as the underlying encoder.

Figure 3 :
Figure 3: Comparison of in-domain in-lingual results.The results are for MLP-based baseline with mpnet as the underlying encoder.English is used as the source language in the experiments.

Figure 4 :
Figure 4: Comparison of in-domain in-lingual results.The results are for MLP-based baseline with LaBSE as the underlying encoder.English is used as the source language in the experiments.

Table 1 :
Statistics of representative multilingual dialogue NLU datasets.MultiATIS++ contains some utterances with more than one intent label but the vast majority are single-label.

Table 2 :
Comparison between Full fine-tuning and MLP-based setups for intent detection (F 1 × 100).The experiments are run for the 10-Fold in-language in-domain setup.H is HOTELS domain and B is BANKING domain.

Table 4 :
Cross-lingual in-domain results for intent detection with MLP-based and LaBSE as the fixed sentence encoder (F 1 × 100).The results are presented for transfer from English and Amharic for the 20-Fold setup, while the results for other source languages and training data setups are available in Appendix D, Table15.

Table 5 :
Cross-lingual in-domain results for slot labelling with full fine-tuning of XLM-R (F 1 × 100).The results are presented for transfer from English and Amharic for the 20-Fold setup, while the results for other source languages and training data setups are available in Appendix G, Table24.

Table 6 :
F 1 × 100 for the QA-based intent detection models in the 20-Fold setup using EN and AM for training.The remaining results are in Appendix F.

Table 7 :
F 1 × 100 for cross-lingual zero-shot results of QA-based slot labelling models when trained with English data for the 20-Fold setup.The results for other languages and data setups are in Appendix G.

Table 8 :
Language codes used in the paper

Table 9 :
Dataset collection cost breakdown

Table 10 :
Approximate GPU training times for MLPbased setup for every fold (in mins).

Table 11 :
Approximate GPU training times for QAbased setup for every fold (in hours).

Table 14 :
Cross-lingual in-domain results for intent detection with MLP-based and mpnet as the fixed sentence encoder (F 1 × 100).

Table 16 :
Cross-lingual cross-domain results for intent detection with MLP-based and LaBSE as the fixed sentence encoder (F 1 × 100).

Table 17 :
Cross-lingual cross-domain results for intent detection with MLP-based and mpnet as the fixed sentence encoder (F 1 × 100).

Table 18 :
In-language in-domain results for intent detection with QA-based setup and mDeBERTa as the base model (F 1 × 100).

Table 19 :
Cross-lingual in-domain results for intent detection with QA-based setup and mDeBERTa as the base model (F 1 × 100).

Table 20 :
Translate-test in-domain results for intent detection with QA-based setup and mDeBERTa as the base model (F 1 × 100).

Table 22 :
Cross-lingual cross-domain results for intent detection with QA-based setup and mDeBERTa as the base model (F 1 × 100).

Table 25 :
In-domain in-language slot labelling with QA models (F 1 × 100)