Multilingual and Cross-Lingual Intent Detection from Spoken Data

We present a systematic study on multilingual and cross-lingual intent detection (ID) from spoken data. The study leverages a new resource put forth in this work, termed MInDS-14, a first training and evaluation resource for the ID task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties. Our key results indicate that combining machine translation models with state-of-the-art multilingual sentence encoders (e.g., LaBSE) yield strong intent detectors in the majority of target languages covered in MInDS-14, and offer comparative analyses across different axes: e.g., translation direction, impact of speech recognition, data augmentation from a related domain. We see this work as an important step towards more inclusive development and evaluation of multilingual ID from spoken data, hopefully in a much wider spectrum of languages compared to prior work.


Introduction and Motivation
A crucial functionality of Natural Language Understanding (NLU) components in task-oriented dialogue systems is intent detection (ID) (Young et al., 2002;Tür et al., 2010;Coucke et al., 2018). In order to understand the user's current goal, the system must classify their utterance into several predefined classes termed intents. 1 Scaling dialogue systems in general and intent detectors in particular to support a multitude of new dialogue tasks and domains is a challenging, timeconsuming, and resource-intensive process (Wen et al., 2017;Rastogi et al., 2019). This problem is further exacerbated in multilingual setups: it is * Both authors equally contributed to this work. 1 For instance, in the banking domain utterances referring to cash withdrawal or currency exchange rates should be classified to the respective intent classes . An error in intent detection is typically the first point of failure for any task-oriented dialogue system. extremely expensive to annotate sufficient task data in each of more than 7,000 languages (Bellomaria et al., 2019;Xu et al., 2020). As a consequence, the current ID work has been largely constrained only to English, and standard ID benchmarks also exist only in English (Hemphill et al., 1990;Larson et al., 2019;Liu et al., 2019b;Larson et al., 2020, inter alia). The need to widen the reach of dialogue technology to other languages has been recognised only recently, and thus even text-based multilingual ID datasets are still few and far between: Schuster et al. (2019) provide NLU data in three languages (English, Spanish, Thai), while a more recent MultiATIS++ dataset (Xu et al., 2020) manually translates the ATIS dataset (Hemphill et al., 1990) from English to 8 target languages, extending the work of Upadhyay et al. (2018) which translated portions of the English ATIS data to Hindi and Turkish. 2 Despite these efforts, there are still prominent gaps remaining: 1) a large number of (even major) languages is still uncovered, 2) there are no multilingual data for specialized and well-defined domains such as e-banking, and 3) most importantly, all intent detection datasets to date are text-based. In other words, current work completely ignores the fact that many conversational systems are inherently voice-based, and that telephony quality and errors in automatic speech recognition (ASR) even prior to intent detection may have fundamental impact on the final intent detection performance. Consequently, the impact of ASR on multilingual intent detection has not been studied before.
Contributions. Inspired by the current gaps, 1) we present the MINDS-14 dataset (Multilingual Intent Detection from Speech), a first multilingual evaluation resource for ID from spoken data. It originates from the use of a commercial voice assistant and real-life industry needs: it covers 14 intents in the banking domain in 14 different language varieties, making it the most comprehensive multilingual ID dataset to date. 2) We present a systematic evaluation and comparison of current state-of-the-art multilingual and cross-lingual ID models, which rely on machine translation and current cutting-edge multilingual sentence encoders, multilingual USE (Chidambaram et al., 2019) and LaBSE (Feng et al., 2020). 3) We provide additional analyses to further profile the potential and current ID gaps in multilingual voice-based contexts, including augmentation with data from a similar domain, target-only versus multilingual training, and aggregations of n-best ASR hypotheses.
Our results demonstrate that strong ID results can be achieved for all languages represented in MINDS-14, but we also indicate the crucial importance of in-domain model fine-tuning and fewshot learning, reporting strong gains over zero-shot transfer models. In hope to motivate and inspire further work on multilingual and voice-based ID and future extensions to lower-resource languages, we release MINDS-14. The release includes the original speech data as well as the ASR data, and is available online at: s3://poly-public-data/ MInDS-14/MInDS-14.zip.

MINDS-14: Dataset Collection
Final Dataset and Languages Covered. The final MINDS-14 dataset covers 14 intents in the banking domain with accompanying spoken and "ASR-ed" text utterances. The intents were sampled from a set of 90+ fine-grained intents used by a commercial banking voice assistant, so that all intents have a clear and non-overlapping semantics, and are easy to understand by non-experts, i.e., crowd- Disclaimer: We acknowledge that our language sample is typologically less diverse than in some recent evaluation sets for text-based multilingual language understanding (Ponti et al., 2020;Hu et al., 2020): we consider the proposed dataset as only a first step towards more equitable research in this area, and our goal in this work was establishing and validating the data collection and benchmarking methodology with higher-resource languages before extending the focus to lower-resource ones.
Spoken Data Collection. The spoken data has been collected via crowdsourcing, relying on the Prolific platform (www.prolific.co/). We have experimented with two different data collection protocols, which eventually yield very similar data quality. With both protocols, human subjects are first provided with the particular intent class, a description of the intent, and three examples for the intent class. The task is then to provide new spoken utterances associated with the intent class.
As the first collection protocol, we implement a full-fledged phone-based voice assistant that participants could call and talk to. This approach makes the data collection setup as realistic as possible: it is affected by the (phone) audio quality and directly captures the way people would speak on the phone. IT data and parts of DE, PT, PL, and EN-AU data have been collected via this approach. The second, simpler study design instead relies on an online recording software. We use Phonic (www.phonic.ai/) to collect the recordings, where data collection for each intent class is set up as a dedicated task on Prolific. We collect all the other data items via this approach. 3 4 3 In order to ensure native pronunciation data quality with both data collection protocols, the pool of participants has been restricted to native speakers from the relevant regions. A detailed task description with a consent form was provided to all human participants: it informed the participants that the results of the data collection will be used for experimental research purposes, and that their participation is voluntary and will remain fully anonymous (PolyAI is ISO27k-certified and fully GDPR-compliant). The participants were offered a fair compensation, pro rata around the average hourly wage in the UK. After the initial collection step, the data were additionally inspected and cleaned manually to remove empty, nonsensical, and extremely long utterances. We also manually removed all personal names and other content that might contain private or sensitive information. 4 The dataset is open-sourced to the research community to facilitate the progress of multilingual NLU research, there are no IP-related issues.

Multilingual ID: Methodology
A standard transfer learning paradigm (Ruder et al., 2019) fine-tunes a pretrained language model such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019a) on the annotated task data. For the intent classification task in particular,  have recently shown that full finetuning of the large pretrained model is not needed at all. In contrast, they propose a more efficient feature-based approach to intent detection. Here, fixed universal sentence encoders such as USE (Cer et al., 2018;Chidambaram et al., 2019) or ConveRT  are used "off-the-shelf" to encode utterances, and a standard multi-layer perceptron (MLP) classifier is then learnt on top of the sentence encodings.  demonstrate that the feature-based approach to intent classification yields performance on-par with the full-model finetuning, while offering improved training efficiency. Therefore, due to the large number of executed experiments and comparisons in this work, and preliminary results which corroborated the findings from prior work , we opt for this efficient approach to ID.
We evaluate two widely used state-of-the-art multilingual sentence encoders, but remind the reader that decoupling MLP from the encoder allows for a wider exploration of other available multilingual sentence encoders (Reimers and Gurevych, 2020;Litschko et al., 2021, inter alia).
In what follows, we provide only brief descriptions of each encoder in our evaluation; for more details we refer the reader to the original work.
mUSE  is a multilingual version of the Universal Sentence Encoder (USE) model for English (Cer et al., 2018). It relies on a standard dual-encoder neural framework (Henderson et al., 2019;Reimers and Gurevych, 2019;Humeau et al., 2020), features 16 languages, and learns a shared cross-lingual semantic space via translationbridging tasks (Chidambaram et al., 2019).
ID Model. For ID, we pass the sentence encoding s x through a 2-layer MLP. We first apply dropout (Srivastava et al., 2014) on the encoding, followed by one layer with ReLU as nonlinear activation (Nair and Hinton, 2010), yielding the hidden representation h = ReLU (W 1 s dp + b 1 ), where W 1 is a trainable weight matrix, s dp is the encoding after applying dropout, and b 1 denotes bias parameters.
We then detect the intent using a sigmoid (σ) activation and softmax: p intent = sof tmax(σ(W 2 h + b 2 )), where W 2 is another trainable weight matrix, and b 2 are bias parameters.

Experimental Setup
Speech Transcription. For all language variants, we run the respective Google ASR model 6 to obtain n-best written transcriptions (i.e., ASR hypotheses). Unless noted otherwise, we work with the top (i.e., 1-best) transcription.
Auxiliary English Data. We also conduct experiments where we leverage additional English data from the related banking domain (termed AUX-EN henceforth). It comprises a total of 660 English utterances, extracted from a commercial voice assistant, and annotated with the same 14 intent classes. It allows us to run cross-lingual transfer and training data augmentation experiments and analyses later in §5. It also helps us establish the extent to which related-domain data can be reused to bootstrap a conversational system prior to any in-task data collection efforts.
Monolingual versus Multilingual Training. We then train and run the ID models from §3 in the following setups. First, in translate-to-EN, for all "non-English" languages, we translate the transcriptions into English via Google Translate (GT). This effectively enables us to train and evaluate monolingual models directly in English (Hu et al., 2020). 7 LaBSE uses standard self-supervised objectives used in pretraining of mBERT and XLM: masked and translation language modeling (Conneau and Lample, 2019 The second approach works directly in the native language of the transcriptions, and we discern between two variants: a) target-only uses only the data available in the current language to train the ID model; b) multilingual setup leverages the multilinguality of mUSE and LaBSE and trains on the transcribed data of all languages, while we evaluate on the test data of each individual language.
Training and Evaluation Data and Setups. We can also translate the auxiliary AUX-EN dataset (see §2) to other languages via Google Translate, yielding AUX-TARGET data. We then discern between the following training data setups. In a) aux-only we use only the AUX-TARGET (or AUX-EN) data to train the ID models; this setup allows us to estimate the ID performance before any additional in-language data collection. In b) the standard setup, we do 3-fold cross-validation, where we randomly split the transcribed data (translate-to-EN, target-only, or multilingual) into 60% training data and 40% test data, and always add the auxiliary data as the training subset. 8 We also evaluate the c) no-aux setup, where we train only on the 60% of the in-domain data, without any auxiliary data. A simple illustration of these different setups is provided in Figure 3 in the Appendix. Note that we always use cross-validation for all setups, and always test on randomly generated 2020), as well as in other language understanding tasks (Hu et al., 2020;Ponti et al., 2021). 8 In the aux-only variant we still sample 40% of the entire dataset for testing. For multilingual training, in order to maintain the same multilingual training set for all test languages, we also sample 60% of all transcribed data in all languages, and use that plus all AUX-TARGET data for training, and the remaining 40% in each language for testing. splits of the collected data of the same size in order to ensure a fair comparison across the setups.
ID: Hyperparameters. We train with Adam (Kingma and Ba, 2015) relying on the learning rate of 0.001, in batches of size 32, for 10,000 steps. The dropout rate is set to 0.3. We report accuracy as the main evaluation measure for all experimental runs, always averaged over 3 independent runs.

Results and Discussion
The main results are summarised in Table 1, while additional per-intent are available in the Appendix. First, the results confirm LaBSE as a stronger multilingual encoder across the board, extending its superiority over mUSE from cross-lingual sentence matching tasks (Feng et al., 2020) also to the multilingual ID task. More importantly, the results indicate very high absolute accuracy scores for all target languages, confirming the validity of MT-based approaches to multilingual ID, at least for major languages with developed MT. For instance, the results for all languages are > 95% (except for KO and PL) with LaBSE in the no-aux translate-to-EN setup. In other words, we empirically demonstrate the viability of the simple "ASR-then-translate" approach when dealing with voice-based input, at least for MINDS-14 languages, all considered reasonably high-resource in NLP terms. 9 Our findings suggest that even this simple, easyto-build, and efficient sentence encoder-based approach may offer competitive ID from spoken data in different languages. Future work will investigate the extent of performance drops once the focus is shifted to lower-resource languages where reasonably performing ASR and MT models cannot be guaranteed (Conneau et al., 2020;Pratap et al., 2020), as well as to finer-grained intent classes and other domains.
Different Setups. A comparison of different setups reveals that even small in-domain training data (without any external data augmentation, the noaux setup) are sufficient to learn strong intent detectors. In fact, the best overall results are achieved with the no-aux translate-to-EN setup with LaBSE.
The aux-only setup fall substantially behind indomain trained models, validating the crucial importance of collecting additional in-domain examples: using even the small portions of fully indomain training data to boost performance. Limited usefulness of aux-en data, beyond a slight domain and style mismatch, may also be attributed to the actual data content: it covers very-specific cases with repetitive sentences, which may also misguide classifiers trained with such repetitive data. The peak scores on average are achieved in the "translate-to-EN" scenario. However, the differences when using LaBSE are slight, and there are some languages with higher scores achieved in the other two scenarios. 10 Impact of ASR. We also evaluate whether including additional ASR hypotheses might make intent detectors more robust: adding more transcriptions from the n-best list may be seen as a form of data augmentation. The results are provided in  The scores suggest that relying on more transcriptions (n = 5 and n = 10) does yield slight gains on average, but the trend is not present in all the test languages (cf., Spanish). This might stem from the fact that the transcriptions are highly similar, and there is limited additional information available down the n-best ASR list.
Impact of Additional Translations. Another approach to improving ID robustness is generating more than one (machine) translation per transcription. We achieve that by passing each transcription through GT plus another translation service: DeepL (www.deepl.com/). The results are provided in Figure 2. They indicate that this "augmentation via translation" step indeed yields slightly improved ID: we hit 1-2% performance gains with both encoders (cf., Figure 2 and Table 1) compared to using only 1 translation per transcription.

Conclusion and Future Work
We have presented a first study focused on multilingual and cross-lingual intent detection (ID) from spoken data. To this end, we have presented MINDS-14, a first training and evaluation resource for the task with spoken data, covering 14 intents extracted from a commercial system in the e-banking domain, with spoken examples available in 14 language varieties. Our key results have revealed that it is possible to build accurate ID models in all target languages relying on a simple yet efficient paradigm based on current state-of-theart multilingual sentence encoders such as LaBSE and machine translation. In future work we plan to expand the MINDS-14 dataset and put more focus on similar evaluations for truly low-resource languages, where reliable ASR, MT, and even sentence encoders cannot be guaranteed. In the long run, we hope that our initiative will foster future developments and evaluation of multilingual ID from spoken data, as one of the first steps towards truly multilingual voice-based dialogue systems.