Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling

Predicting user intent and detecting the corresponding slots from text are two key problems in Natural Language Understanding (NLU). Since annotated datasets are only available for a handful of languages, our work focuses particularly on a zero-shot scenario where the target language is unseen during training. In the context of zero-shot learning, this task is typically approached using representations from pre-trained multilingual language models such as mBERT or by fine-tuning on data automatically translated into the target language. We propose a novel method which augments monolingual source data using multilingual code-switching via random translations, to enhance generalizability of large multilingual language models when fine-tuning them for downstream tasks. Experiments on the MultiATIS++ benchmark show that our method leads to an average improvement of +4.2% in accuracy for the intent task and +1.8% in F1 for the slot-filling task over the state-of-the-art across 8 typologically diverse languages. We also study the impact of code-switching into different families of languages on downstream performance. Furthermore, we present an application of our method for crisis informatics using a new human-annotated tweet dataset of slot filling in English and Haitian Creole, collected during the Haiti earthquake.


Introduction
A cross-lingual setting is typically described as a scenario in which a model trained for a particular task in one language (e.g.English) should be able to generalize well to a different language (e.g.Japanese).While a semi-supervised solution (Xiao and Guo, 2013;Muis et al., 2018) assumes some target language data is available, a zero-shot solution (Eriguchi et al., 2018;Srivastava et al., 2018;Xu et al., 2020) assumes none is available at training time.This is particularly significant in real world problems such as extracting relevant information during a new disaster (Nguyen et al., 2017;Krishnan et al., 2020) and hate speech detection (Pamungkas and Patti, 2019;Stappen et al., 2020), where the target language might be of low-resource or unknown.In such scenarios, it is crucial that models can generalize well to unseen languages.
Intent prediction and slot filling are important NLU tasks and significant for real world problems.They are studied extensively for goal-oriented dialogue systems currently, such as Amazon's Alexa, Apple's Siri, Google Assistant, and Microsoft's Cortana.Finding the 'intent' behind the user's query and identifying relevant 'slots' in the sentence to engage in a dialogue are essential for an effective conversational assistance.For example, users might want to 'play music' given the slot labels 'year' and 'artist' (Coucke et al., 2018), or they may want to 'book a flight' given the slot labels 'airport' and 'locations' (Price, 1990).A strong correlation between the two tasks has made jointly trained models successful (Goo et al., 2018;Haihong et al., 2019;Hardalov et al., 2020;Chen et al., 2019).In a cross-lingual setting, the model should be able to learn this joint task in one language and transfer knowledge to another (Upadhyay et al., 2018;Schuster et al., 2019;Xu et al., 2020).This is the premise of our work.
Highly effective multilingual models such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020a) have shown success across several multilingual tasks in recent years.In the zero-shot cross-lingual transfer setting with an unknown target language, a typical solution is  In the above code-switching example, the chunks are in Chinese, Punjabi, Spanish, English, Arabic, and Russian.'atis_airfare' represents an intent class where the user seeks price of a ticket.
to use pre-trained transformer models and finetune to the downstream task using the monolingual source data (Xu et al., 2020).However, previous work (Pires et al., 2019) has shown that existing transformer-based representations may exhibit systematic deficiencies for certain language pairs.Previous work (Pires et al., 2019) has shown that existing transformer-based representations may exhibit systematic deficiencies for certain language pairs.Figure 1 shows that the representations across the 12 multi-head attention layers of mBERT are still clustered according to the languages.This leads to a fundamental challenge that we address in this work: enhancing the language neutrality so that the fine-tuned model is generalizable across languages for the downstream task.To this goal, we introduce a data augmentation method via multilingual code-switching, where the original sentence in English is code-switched into randomly selected languages.For example, chunk-level codeswitching creates sentences with phrases in multiple languages as shown in Figure 2. We show that this can lead to a better performance in the zeroshot setting such that mBERT can be fine-tuned for all languages (not just one) with a monolingual source data.
Further, we show how code-switching with different language families impact the model's per-formance on individual target languages.Crosslingual study of language families largely remains unexplored for NLU tasks.For instance, while it might be intuitive that Sino-Tibetan language family can aid a task in Hindi, results indicating that Turkic language family may help Japanese can reveal intriguing inter-family relationships and how they are aligned in the underlying language model's vector space.
Contributions: a) We present a data augmentation method via multilingual code-switching to enhance the language neutrality of mBERT for fine-tuning to a downstream NLU task of intent prediction and slot filling.b) By code-switching into different language families, we show that potential relationships between a family and a target language can be identified and studied; which could help foster zero-shot cross-lingual research in low-resource languages.c) We release a new human-annotated tweet dataset, collected during Haiti earthquake disaster, for intent prediction and slot filling in English and Haitian Creole.
Advantages: With enhanced generalizability, our model can be deployed with an out-of-thebox functionality.Previous methods of first machine translation of the source data into the known target language, followed by fine-tuning (referred 'translate-train') (Xu et al., 2020;Yarowsky et al., 2001;Shah et al., 2010;Ni et al., 2017) require a separate model to be trained for each language.
2 Related Work
Monolingual models for joint slot filling and intent prediction have used methods such as attentionbased RNN (Liu and Lane, 2016) and attentionbased BiLSTM with a slot gate (Goo et al., 2018) on benchmark datasets such as ATIS (Price, 1990) and SNIPS (Coucke et al., 2018).These methods have shown that a joint method can enhance both tasks and slot filling can be conditioned on the learned intent.An interrelated mechanism was introduced (Haihong et al., 2019) to iteratively learn the relationship between the two tasks.Recently, BERT-based approaches (Hardalov et al., 2020;Chen et al., 2019) have shown improved results.On the other hand, cross-lingual versions of this joint task include a low-supervision based approach for Hindi and Turkish (Upadhyay et al., 2018), new dataset for Spanish and Thai (Schuster et al., 2019), and the most recent work of MultiATIS++ (Xu et al., 2020) creating a comprehensive dataset in 9 languages; which is used to benchmark our results.
The joint task mentioned above in a pure zeroshot learning is the motivation of our work.Zeroshot is described as the setting where the model sees a new distribution of examples during test time (Xian et al., 2017;Srivastava et al., 2018;Romera-Paredes and Torr, 2015).It is common for machine translation based methods to translate source data to the target language before training.We assume that target language is unknown during training, so that our model is generalizable across languages.

Code-Switching
Linguistic code-switching is a phenomenon where multilingual speakers alternate between languages.Recently, monolingual models have been adapted to code-switched text in several tasks such as entity recognition (Aguilar and Solorio, 2019), part-ofspeech tagging (Soto and Hirschberg, 2018;Ball and Garrette, 2018), sentiment analysis (Joshi et al., 2016), and language identification (Mave et al., 2018;Yirmibeşoglu and Eryigit, 2018;Mager et al., 2019).Recently, KhudaBukhsh et al., 2020 have proposed a pipeline to sample code-mixed documents using minimal supervision.Qin et al., 2020 allows randomized code-switching to include the target language.In our context, if the target language is German, we ensure that there is no codeswitching to German during training.We consider this distinction essential to evaluate a true zero-shot learning scenario and prevent any bias.Another recent work by Yang et al., 2020 presents a nonzero-shot approach that performs code-switching to target languages.Jiang et al., 2020 presents presents a code-switching based method to improve the ability of multilingual language models for factual knowledge retrieval.Code-switching is usually done at the word-level.However, our results favor chunk-level switching over word-level as the latter may bring more noise to the codeswitched version when compared to the original meaning of the sentence.Code-switching and other data augmentation techniques have been applied to the pre-training stage in recent works (Chaudhary et al., 2020;Dufter and Schütze, 2020), however we do not address pre-training in this work.Pretrained models such as XLM-R is also likely to be exposed to code-switched data, as it is trained using common-crawl.In this work, we specifically focus on mBERT which largely remain monolin-gual at the sentence level to identify the impact of code-switching during fine-tuning, in addition to study the impact of language-family-based augmentations.

Methodology
This section first describes our problem for zeroshot cross-lingual transfer setting, followed by a novel data augmentation method using multilingual code-switching of monolingual source to enhance language neutrality.We then describe language families, followed by the joint training setup.

Problem Definition
Given a source (S) and a set of target (T) languages, the goal is to train a classifier using data only in the source language and predict examples from the completely unseen target languages.We assume the target language is unknown during training time, which makes direct translation to target infeasible.In this context, we use code-switching (cs) to augment the monolingual source data.Thus, the input and output of our problem can be defined as: where X ut represents sentences, y i their ground truth intent classes, and y sl the slot labels for the words in those sentences.An example sentence, its intent class, and slot labels are shown in Figure 2.

Multilingual Code-Switching
Multilingual masked language models, such as mBERT (Devlin et al., 2019), are trained using large datasets of publicly available unlabeled corpora such as Wikipedia.Such corpora largely remain monolingual at the sentence level because the presence of intra-sentence code-switched data in written texts is likely scarce.The masked words that needed to be predicted usually are in the same language as their surrounding words.We study how code-switching can enhance the language neutrality of such language models by augmenting it with artificially code-switched data for fine-tuning it to a downstream task.Algorithm 1 explains this codeswitching process at the chunk-level.When using slot filling datasets, slot labels that are grouped by BIO (Ramshaw and Marcus, 1999) tags constitute natural chunks.To summarize the algorithm, we take a sentence, take each chunk from that sentence, perform a translation into a random language using Algorithm 1: Data Augmentation via Multilingual Code-Switching (Chunk-Level) Input: X en ut , y en i , y en sl Output: X cs ut , y cs i , y cs sl Google's NMT system (Wu et al., 2016), and align the slot labels to fit the translation.At the chunklevel, we use a direct alignment.i.e., the BIOtagged labels are recreated for the translated phrase based on the word tokens.More complex methods can be applied here to improve the alignment of the slot labels such as fast-align (Dyer et al., 2013) or soft-align (Xu et al., 2020).Code-Switching at the word-level essentially translates every word randomly, while at the sentence-level translates the entire sentence.During the experimental evaluation process, to build a language neutral model using monolingual source of English data, all 8 target languages are excluded from the code-switching procedure to avoid unfair model comparisons, i.e. remove target languages from lset in Algorithm 1.

Complexity:
The augmentation process is repeated k times per sentence producing a new augmented dataset of size k × n, where n is the size of the original dataset, i.e. space complexity of O(k × n).Algorithm 1 has a runtime complexity of O(k × n × translations/sentence) steps assuming constant time for alignment.Word-level requires as many translations as the number of words but sentence-level requires only one.An increase in the dataset size also increases the training time, but an advantage is one model fits all languages.

Language Families
A language family is defined as a group of related languages that are likely coming from the same parent.For example, Portuguese, Spanish, French, Italian, and Romanian are daughter languages derived from Latin (Rowe and Levine, 2017).We use language families to study their impact on the target languages.We augment the source language with code-switching from a particular language family.For instance, code-switching the English dataset with Turkic language family and testing on Japanese can reveal how closely the two are aligned in the vector space of a pre-trained multilingual model.From a set of 5 distinct language families, we select a total of 6 groups of languages: Afro-Asiatic (Voegelin and Voegelin, 1976), Germanic (Harbert, 2006), Indo-Aryan (Masica, 1993), Romance (Elcock and Green, 1960), Sino-Tibetan and Japonic (Shafer, 1955;Miller, 1967), and Turkic (Johanson and Johanson, 2015).Germanic, Romance, and Indo-Aryan are branches of the Indo-European language family.Language groups and their selected daughter languages are shown in Table 1.Each group is selected based on a target language in the dataset and Afro-Asiatic family is added as an extra group.In experiments, lset in Algorithm 1 will be assigned languages from a specific family.

Joint Training
Joint training is traditionally used for intent prediction and slot filling to exploit the correlation between the two tasks.This is done by feeding the feature vectors of one model to another or by sharing layers of a neural network followed by training the tasks together.So, a standard joint model loss can be defined as a combination of intent (L i ) and slot (L sl ) losses.i.e., L = αL i + βL sl , where α and β are corresponding task weights.Prior works (Goo et al., 2018;Schuster et al., 2019;Liu and Lane, 2016;Haihong et al., 2019) that use BiL-STM or RNN are now modified to BERT-based implementations explored in more recent works (Chen et al., 2019;Hardalov et al., 2020;Xu et al., 2020).A standard Joint model consists of BERT outputs from the final hidden state (classification (CLS) token for intent and m word tokens for slots) fed to linear layers to get intent and slot predictions.Table 3: Performance evaluation of code-switching with setting k = 5.CS: Code-Switching.Reported scores are average of 5 independent runs (including a separate code-switched data for each run).m = number of distinct models to be trained.*: modified BERT-based implementations (Chen et al., 2019;Xu et al., 2020).
♠ : The difference is significant with p < 0.05 using Tukey HSD (conducted between Joint en−only + CCS versus Joint en−only Baseline for each language).
Assuming h cls represents the CLS token and h m represents a token from the remaining word-level tokens, the BERT model outputs are defined as (Chen et al., 2019;Xu et al., 2020): with a multi-class cross-entropy loss 3 for both intent (L i ) and slots (L sl ).We will use this model as our baseline for joint training.Our goal will be to show that code-switching on top of joint training improves the performance.The output of Algorithm 1 will be the input used for joint training on BERT for code-switched experiments.

Benchmark Dataset
We use the latest multilingual benchmark dataset of MultiATIS++ (Xu et al., 2020), which was created by manually translating the original ATIS (Price, 1990) dataset from English (en) to 8 other languages: Spanish (es), Portuguese (pt), German (de), French (fr), Chinese (zh), Japanese (ja), Hindi (hi), and Turkish (tr).The dataset consists of utterances for each language with an 'intent' label for 'flight intent' and 'slot' labels for the word tokens in BIO (Ramshaw and Marcus, 1999) format.A sample datapoint in English is shown in Figure 2.

New Dataset for Disaster NLU
We construct a new intent and slot filling dataset of tweets collected during natural disasters, in two languages: English and Haitian Creole.The tweets originally were released by Appen4 .For English, a language expert coded the tweets, and for Haitian Creole, we used Amazon Mechanical Turk with five annotators.Intent classes include: Joint en Joint cs Joint TT 05:04:49 1:31:32 00:11:50 01:06:50 00:11:04 Table 5: Runtime on Google Colab (K80 GPU for training joint models).M T T : Machine Translation to Target.Note that M T T and J T T are for one target language (averaged).

Experimental Setup
We use the traditional cross-lingual task setting where each experiment consists of a source language and a target language.A model is trained on the source data (English) and evaluated on the target data (8 other languages).For codeswitching experiments, an English text is augmented with multilingual code-switching before training.Our implementation is in PyTorch (Paszke et al., 2019) and we use the pre-trained bert-basemultilingual-uncased (Devlin et al., 2019) with BertForSequenceClassification (Wolf et al., 2020) as the mBERT model.Maximum epoch is set to 25 with an early stopping patience of 5, batch size of 32, and Adam optimizer (Kingma and 2014) with a learning rate of 5e−5.We select the best model on the validation set.Consistent with the metrics reported for intent prediction and slot filling evaluation in the past, we also use accuracy for intent and micro F1 to measure slot performance.

Baselines & Upper Bound
Since we assume that target language is not known before hand, Translate-Train (TT) (Xu et al., 2020) method is not a suitable baseline.Rather, we set this to be an upper bound, i.e. translating to the target language and fine-tuning the model should intuitively outperform a generic model.Additionally, we add code-switching to this TT model to assess if augmentation negatively impacts its performance.The zero-shot baselines for the codeswitching experiments use an English-Only (Xu et al., 2020) model, which is fine-tuned over the pre-trained mBERT separately for each task and an English-only Joint model (Chen et al., 2019).

Effect of Multilingual Code-Switching
Table 3 describes performance evaluation on the MultiATIS++ dataset.When compared to the stateof-the-art jointly trained English-only baseline, we see a +4.2% boost in intent accuracy and +1.8% boost in slot F1 scores on average by augmenting the dataset via multilingual code-switching without requiring the target language.From the significance tests, except for Spanish and German, all other languages were helped by code-switching for intent detection.For slot filling, improvement on Portuguese and French went insignificant.This suggests that code-switching primarily helped languages that are morphologically more different as compared to the source language (English).For example, Hindi and Turkish have the highest intent performance improvement of +16.1% and +9.8% respectively.And for slots, Hindi and Chinese with +6.0% and +4.3% respectively.Japanese showed +4% improvement for intent and +3.4% for slots.
The running time of the models in Table 5 show that code-switching is expensive which can take up to 5 hours for k = 5.Its training is also expensive because there is k times more data as compared to the monolingual source data.Increasing the number of code-switchings (k) for a sentence from 5 to 50 improved the performance by +1%, while increasing the run-time by a large margin.So, parameter k should be picked appropriately.Albeit this time cost is for training, with benefits at the prediction stage for real world problems.
In the translate-train (upper bound) scenario, it is not immediately clear if augmentation can help, because data in the same language as the target is always preferred over other languages, or codeswitched.However, we show in Table 3 that augmentation did not hinder the performance.
For both intent and slot performance, chunklevel model remained robust across the languages.For intent, difference between word-level and sentence-level was insignificant.For slot, sentencelevel was in par with chunk-level on average.Thus, we think that code-switching at chunk-level is safer for avoiding semantic discrepancies (as in the wordlevel) while also capturing better intra-sentence language neutrality.

Evaluation on Disaster Dataset
We found that disaster data is more challenging when compared to the ATIS dataset for transfer   4. Code-Switching improved intent accuracy by +12.5% and slot F1 by +2.3%, which is promising considering that they are tweets.Joint training added +0.9% improvement to intent accuracy, however did not seem to help slot F1.This might imply a lack of strong correlation between the two tasks, i.e. a mention of 'food' or 'shelter' in a tweet may not always mean that it is a 'request' or vice-versa.The upper bound of translate-train method did not perform any better than the randomly code-switched model which seemed counterintuitive.This might be due to the lack of strong representation for Haitian Creole in the pre-trained model, although it is similar to French.

Impact of Language Families
Results of language family analysis are shown in Figure 3.The input in English is independently code-switched using 6 different language families.Note that the target language is always excluded from the group when evaluating on the same, i.e.Hindi is excluded from Indo-Aryan family when that family is being evaluated on it.Translate-train model is provided as a frame of reference and upper bound.We dropped French and Portuguese from the chart as they fall in to Romance fam- ily similar to Spanish.Results show the language families helped their corresponding languages, i.e.Romance helped Spanish, Germanic helped German, and so on; with the exception of Chinese and Japanese.In both cases, Turkic language family helped better than others.

Error Analysis
Selecting intent classes with support > 10, Figure 4 shows how each class is positively or negatively impacted by code-switching.Improvement was primarily on 'airfare', 'distance' 'capacity', 'airline', and 'ground_service' which had longer sentences such as 'Please tell me which airline has the most departures from Atlanta' when compared to 'abbreviations' and 'airport' classes that included very short phrases like 'What does EA mean?'However, note that, Spanish and German did not improve much; aligning with our results in Table 3.For slot labels in Figure 5, we selected the ones with support > 50 and that have different characteristics, e.g.'name', 'code', etc.The overall trend in slot performance shows improvements for labels such as 'day_name', 'airport_code', and 'city_name' and slight variations in labels such as 'fight_number' and 'period_of_day'; implying textual slots benefiting over numeric ones.

Hyperparameter Tuning
For joint training with same task weights, we tuned α and β using grid search to see the strength of correlation between the tasks.For intent, the (α, β) combination of (1.0, 0.6) performed well, while (1.0, 1.0) for slots.This suggests that intent benefiting slot might be slightly more than slot benefiting intent.Additionally, during fine-tuning, freezing the layers of the transformer affected the model performance as shown in Figure 6.Keeping the first 8 layers frozen gave the best performance.By freezing the earlier layers, the transformer can retain its most fundamental feature information gained from the massive pre-training step, and by unfreezing some top layers, it can undergo fine-tuning.

Conclusion & Future Work
This study shows that augmenting the monolingual input data with multilingual code-switching via random translations helps a zero-shot model to be more language neutral when evaluated on unseen languages.This approach enhanced the generalizability of pre-trained mBERT when finetuning for downstream tasks of intent detection and slot filling.We presented an application of this method using a new annotated dataset of disaster tweets.Further, we studied code-switching with language families and their impact on specific target languages, which can be used to enhance the zero-shot generalizability of models created for low-resource languages.Expanding to XLM-R and similar approaches to improve masked language model training by addressing code-switching during pre-training and releasing a larger dataset of annotated disaster tweets in more languages are planned for future work.

Acknowledgement
We thank U.S. National Science Foundation grants IIS-1815459 and IIS-1657379 for partially supporting this research.We also thank Ming Sun and Alexis Conneau for giving valuable insights on multilingual model training.We also acknowledge ARGO team as the experiments were run on ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University.

Figure 1
Figure 1: t-SNE plot of embeddings across the 12 multi-head attention layers of multilingual BERT.Parallelly translated sentences of MutiATIS++ dataset are still clustered according to the languages: English (black), Chinese (cyan), French (blue), German (green), and Japanese (red).

Figure 2 :
Figure 2: An original example in English from MultiATIS++ dataset and its multilingually code-switched version.In the above code-switching example, the chunks are in Chinese, Punjabi, Spanish, English, Arabic, and Russian.'atis_airfare' represents an intent class where the user seeks price of a ticket.

Figure 3 :
Figure 3: Impact of different language groups on the target languages.

Figure 4 :
Figure 4: Impact of code-switching on intent classes.

Figure 5 :
Figure 5: Impact of code-switching on slot labels.

Figure 6 :
Figure 6: Freezing earlier layers and unfreezing a few at the top of the transformer appear to be most optimal.

Table 1 :
Selected language families to evaluate their impact on a target language.

Table 2 :
Datasets and statistics.

Table 4 :
Performance on disaster data in Haitian Creole (ht).CS = Code-Switching.Reported scores are average of 5 independent runs (*: modified BERT-based).