El Volumen Louder Por Favor: Code-switching in Task-oriented Semantic Parsing

Being able to parse code-switched (CS) utterances, such as Spanish+English or Hindi+English, is essential to democratize task-oriented semantic parsing systems for certain locales. In this work, we focus on Spanglish (Spanish+English) and release a dataset, CSTOP, containing 5800 CS utterances alongside their semantic parses. We examine the CS generalizability of various Cross-lingual (XL) models and exhibit the advantage of pre-trained XL language models when data for only one language is present. As such, we focus on improving the pre-trained models for the case when only English corpus alongside either zero or a few CS training instances are available. We propose two data augmentation methods for the zero-shot and the few-shot settings: fine-tune using translate-and-align and augment using a generation model followed by match-and-filter. Combining the few-shot setting with the above improvements decreases the initial 30-point accuracy gap between the zero-shot and the full-data settings by two thirds.


Introduction
Code-switching (CS) is the alternation of languages within an utterance or a conversation (Poplack, 2004). It occurs under certain linguistic constraints but can vary from one locale to another (Joshi, 1982). We envision two usages of CS for virtual assistants. First, CS is very common in locales where there is a heavy influence of a foreign language (usually English) in the native "substrate" language (e.g., Hindi or Latin-American Spanish). Second, for other native languages, the prevalence of English-related tech words (e.g., Internet, screen) or media vocabulary (e.g., movie names) is very common. While in the second case, a model using contextual understanding should be able to parse the utterance, the first form of CS, which is our focus in this paper, needs Cross-Lingual(XL) capabilities in order to infer the meaning.
There are various challenges for CS semantic parsing. First, collecting CS data is hard because it needs bilingual annotators. This gets even worse considering that the number of CS pairs grows quadratically. Moreover, CS is very dynamic and changes significantly by occasion and in time (Poplack, 2004). As such, we need extensible solutions that need little or no CS data while having the more commonly-accessible English data available. In this paper, we first focus on the zero-shot setup for which we only use EN data for the same task domains (we call this in-domain EN data). We show that by translating the utterances to ES and aligning the slot values, we can achieve high accuracy on the CS data. Moreover, we show that having a limited number of CS data alongside augmentation with synthetically generated data can significantly improve the performance.
Our contributions are as follows: 1) We release a code-switched task-oriented dialog data set, CSTOP 1 , containing 5800 Spanglish utterances and a corresponding parsing task. To the best of our knowledge, this is the first Code-switched parsing dataset of such size that contains utterances for both training and testing. 2) We evaluate strong baselines under various resource constraints. 3) We introduce two data augmentation techniques that improve the code-switching performance using monolingual data. Oriented Parsing dataset released by Schuster et al. (2019a) as our EN monolingual dataset. We release a similar dataset, CSTOP, of around 5800 Spanglish utterances over two domains, Weather and Device, which are collected and annotated by native Spanglish speakers. An example from the CSTOP alongside its annotation is shown in Fig. 1. Note that the intent and slot lables start with IN : and SL :, respectively. Our task is to classify the sentence intent, here IN:GET WEATHER as well as the label and value of the slots, here SL:DATE TIME corresponding to the span para next Friday. Moreover, other words are classified as having no label, i.e., O class. We discuss the details of this dataset in the next section.
One of the unique challenges of this task, compared with common NER and language identification CS tasks, is the constant evolution of CS data. Since the task is concerned with spoken language, the nature of CS is very dynamic and keeps evolving from domain to domain and from one community to another. Furthermore, cross-lingual data for this task is also very rare. Most of the existing techniques, either combine monolingual representations (Winata et al., 2019a) or combine the datasets to synthesize code-switched data . Lack of monolingual data for the substrate language (very realistic if you replace ES with a less common language) would make those techniques inapplicable.
In order to evaluate the model in a task-oriented dialog setting, we use the exact-match accuracy (from now on, accuracy) as the primary metric. This is simply defined as the percentage of utterances for which the full parse, i.e., the intent and all the slots, have been correctly predicted.

CSTOP Dataset
In this section, we provide details of the CSTOP dataset. We originally collected around 5800 CS utterances over two domains; Weather and Device. We picked these two domains as they represent complementary behavior. While Weather contains slot-heavy utterances (average 1.6 slots per utterance), Device is an intent-heavy domain with only average 0.8 slots per utterance. We split the data into 4077, 1167, and 559 utterances for training, testing, and validation, respectively.
CS data collection proceeded in the following steps: 1. One of the authors, who is a native speaker of Spanish and uses Spanglish on a daily basis, generated a small set of CS utterances for Weather and Device domains. Additionally, we also recruited bilingual EN/ES speakers who met our Spanglish speaker criteria guidelines, established following Escobar and Potowski (2015).
2. We wrote Spanglish data creation instructions and asked participants to produce Spanish-English CS utterances for each intent (i.e. ask for the weather, set device brightness, etc).
3. Next, we filter out utterances from this pool to only retain those that exhibited true intrasentential CS.
4. The collected utterances were labeled by two annotators, who identified the intent and slot spans. If the two annotators disagreed on the annotation for an utterance, a third annotator would resolve the disagreement to provide a final annotation for it. Table. 1 shows the number of distinct intents and slots for each domain and the number of utterances in CSTOP for each domain. We have also shown the most 15 common intents in the training set and a representative Spanglish example alongside its slot values for those intents in Table. 2. The first value in a slot tuple is the slot label and the second is the slot value. We can see that while most of the verbs and stop words are in Spanish, Nouns and slot values are mostly in English. We further calculate the prevalence Spanish and English words by using a vocabulary file of 20k for each language. Each token in the CSTOP training set is assigned to the language for which that token has a lower rank. The ratio of the Spanglish to English tokens is around 1.34 which matches our previous anecdotal observation. This ratio was consistent when increasing the vocabulary size to even 40k. .

Model
Our base model is a bidirectional LSTM with separate projections for the intent and slot tagging (Yao et al., 2013). We use the aligned word embedding MUSE (Lample et al., 2018) with a vocabulary size of 25k for both EN and ES. Our experiments showed that for the best XL generalization, it's best to freeze the word embeddings when the training data contains only EN or ES utterances. We refer to this model as simply MUSE.
These models are pre-trained via Masked Language Modeling (MLM) (Devlin et al., 2019) on massive multilingual data. They share the word-piece token representation, BPE (Sennrich et al., 2016) and Senten-cePiece (Kudo and Richardson, 2018), as well as a common MLM transformer for different languages. Moreover, while XLM is pre-trained on Wikipedia, XLM-R is trained on crawled web data which contains more non-English and possibly CS data. In order to adapt these models for the joint intent classification and slot tagging task, we use the method described in Chen et al. (2019). For classification, we add a linear classifier on top of the first hidden state of the Transformer. A typical slot tagging model feeds the hidden states, corresponding to each token, to a CRF layer (Mesnil et al., 2015). To make this compatible with XLM and XLM-R, we use the hidden states corresponding to the first sub-word of every token as the input to the CRF layer. Table 3 shows the accuracy of the above models on CSTOP. We also have listed the performance when the models were first fine-tuned on the EN data (CS+EN). We observe that in-domain finetuning can almost halve the gap between XLM-R and XLM, which is around 50% faster during the inference than XLM-R during inference. The training details for all our models and the validation results are listed in the Appendix.

Zero-shot performance
Bottom part of Table 3 shows the CS test accuracy when using only the in-domain monolingual data. Our EN dataset is the task-oriented parsing dataset (Schuster et al., 2019a) described in the previous section. Since the original TOP dataset did not include any utterances belonging to the Device domain, we also release a dataset of around thousand EN Device utterances for the experiments using the EN data. In order to showcase the effect of monolingual ES data, we also experiment with using the in-domain ES dataset, i.e. ES Weather and Device queries.
We observe that having monolingual data of both languages yields very high accuracy, only a few points shy of training directly on the CS data. Moreover, in this setting, even simpler models such as MUSE can yield competitive results with XLM-R while being much faster. However, the advantage of XL pre-training becomes evident when only one of the languages is present. As such, having only the substrate language (i.e., ES) is almost the same as having both languages for XLM-R.
Note that we do not use ES data for other results in this paper. Obtaining semantic parsing dataset in another language is expensive and often only EN data is available. Our experiments show a huge performance gap when only using the EN data, and thus in this paper, we will be focusing on using the EN data alongside zero or a few CS instances.

Effect of XL Embeddings
Here, we explore how much of the zero-shot performance can be attributed to the XL embeddings as opposed to the shared XL representation. As such, we experiment with replacing MUSE embeddings with other embeddings in the LSTM model explained in the previous section. We experiment with the following strategies:: (1) Random embedding: This learns the ES and EN word embeddings from the scratch (2) Randomly-initialized Sen-tencePiece (Kudo and Richardson, 2018) (RSP): Words are represented by wordpiece tokens that are learned from a huge unlabeled multilingual corpus.
(3) Pre-trained XLM-R sentence piece (XLSP). These are the 250k embedding vectors that are learned during the pre-trainig of XLM-R.
We have shown the effects of using the aforementioned embeddings in the zero-shot setting in Table 4. We can see that by having monolingual datasets from both languages, even random   We can also see that When ES data is available, RSP provides some codeswitching generalizability, as compared with the Random strategy, but not when only EN data is available. We hypothesize that the common sub-word tokens are more helpful to generalize the slot values (which in the codeswitched data are mostly in EN) than the nonslot queries which are more commonly in ES. This is also verified by the observation that most of the gains for the RSP vs Random for the ES only scenario come from the slot tagging accuracy as compared with the intent detection.
As a final note, we observe that between 20 − 30% of the XLM-R gains can be captured by using  the pre-trained sentence-piece embeddings while the rest are coming from the shared XL representation pre-trained on massive unlabeled data. In the rest of the paper, we focus on the XLM-R model.

Data Augmentation Approaches
In this section, we discuss two data augmentation approaches. The first one is in a zero-shot setting and only uses EN data to improve the performance on the Spanglish test set. In the second approach, we assume having a limited number of Spanglish data and use the EN data to augment the few-shot setting.

Translate and Align
We explore creating synthetic ES data from the EN dataset using machine translation. Since our task is a joint intent and slot tagging task, creating a synthetic ES corpus consists of two parts: a) Obtaining a parallel EN-ES corpus by machine translating utterances from EN to ES, b) Projecting gold annotations from EN utterances to their ES counterparts via word alignment (Tiedemann et al., 2014;Lee et al., 2019b). Once the words in both languages are aligned, the slot annotations are simply copied The image on the right shows fast-align, which allow a many-to-many alignment. Hence percent is correctly aligned with por ciento.
over from EN to ES by word alignment. For word alignment, we explore two methods that are explained below. In some cases, word alignment may produce discontinuous slot tokens in ES, which we handle by introducing new slots of the same type, for all discontinuous slot fragments.
Our first method leverages the attention scores (Bahdanau et al., 2015) obtained from an existing EN to ES NMT model. We adopt a simplifying assumption that each source word is aligned to one target language word (Brown et al., 1993). For every slot token in the source language, we select the highest attention score to align it with a word in the target language.
Our next approach to annotation projection makes use of unsupervised word alignment from statistical machine translation. Specifically, we use the fast-align toolkit (Dyer et al., 2013) to obtain alignments between EN and ES tokens. Since fastalign generates asymmetric alignments, we generate two sets of alignments, EN to ES and ES to EN and symmetrize them using the grow-diagnol-finaland heuristic (Koehn et al., 2003) to obtain the final alignments.
In Table 5, we show the CS zero-shot accuracy when fine-tuning on the newly generated ES data (called ES * .) alongside the original EN data. We can see that unsupervised alignment results in around 2.5 absolute point accuracy improvement. On the other hand, using attention alignment ends up hurting the accuracy, which is perhaps due to the slot noise that it introduces. The assumption that a single source token aligns with a single target token leads to incorrect data annotations when the length of a translated slot is different in EN and ES. Figure 3 shows an example utterance where at-tention alignment produces an incorrect annotation compared to unsupervised alignment.

Generate by Match-and-Filter in the Few-shot Setting
Here, we assume having a limited number of highquality in-domain CS data and as such, we construct the CSTOP 100 dataset of around 100 utterances from the original training set in the CSTOP. We make sure that every individual slot and intent (but not necessarily the combination) is presented in CSTOP 100 and randomly sample the rest. We perform our sampling three times and report the few-shot results on the average performance. This setting is of paramount importance for bringing up a domain in a new locale when the EN data is already available. The first column in Table 6 shows the CS Few-Shot (FS) performance alongside the fine-tuning on the EN data and the aligned translated data, when average over three sampling of CSTOP 100 . In order to improve the FS performance, we perform data augmentation on the CSTOP 100 dataset. Unlike methods such as Pratapa et al. (2018), we seek generic methods that do not need extra resources such as constituency parsers. Instead, we explore using pre-trained generative models while taking advantage of the EN data.
We use BART (Lewis et al., 2020), a denois-   ing autoencoder trained on massive amount of web data, as the generative model. Our goal is to generate diverse Spanglish data from the EN data. Even though BART was trained for English, we found it very effective for this task. We hypothesize this is due to the abundance of the Spanish text among EN web data and the proximity of the word-piece tokens among them. We also experimented with multilingual BART (Liu et al., 2020a) but found it very challenging to fine-tune for this task.
First, we convert the data to a bracket format (Vinyals et al., 2015), which is called the seqlogical form in . Examples of this format are shown in Fig. 3. In the seqlogical form, we include the intent (i.e., sentence label) at the beginning and for each slot, we first include the label and text in brackets.
We perform our data augmentation technique in the following steps: 1. Find the top K closest EN neighbors to every CS query in the CSTOP 100 . We enforce the neighbors to have the same parse as the CS utterance, i.e., same intent and same slot labels, and use the Levenshtein distance to rank the EN sequences.
2. Having this parallel corpus, i.e., top-K EN neighbors as the source and the original CS query as the target, Fine-tune the BART model. We use K=10 in our experiments to increase the parallel data size to around 650.
3. During the inference, Use the beam size of 5 to decode CS utterances from the same EN source data. Since both the source and target sequences are in the seqlogical form, the CS generated sequences are already annotated.
In Fig. 3, we have shown the closest EN neighbor corresponding to the original CS example in Fig. 1. The CS utterance can be seen as a rough translation of the EN sentence. We have also shown the top three generated CS utterances from the EN example.
In order to reduces the noise, we filter the generated sequences that either already exist in CSTOP 100 , are not valid trees, or have a semantic parse different from the original utterance. We augment CSTOP 100 with the data, and fine-tune the XLM-R baseline.
In the second column of Table 6, we have shown the average data augmentation improvement over the three CSTOP 100 samples for the few-shot setting. We can see that even after fine-tuning on the EN monolingual data (the second row), the augmentation technique improves this strong baseline. In the last row, we first use the translation alignment of the previous section to obtain ES * . After fine-tuning on this set combined with the EN data, we further fine-tune on the CSTOP 100 . We can see that the best model enjoys improvements from both zero-shot (translation alignment) and the fewshot (generate and filter) augmentation techniques. We also note that the p-value corresponding to the second and third row gains are 0.018 and 0.055, respectively.
(2019) demonstrate the effectiveness of these on sequence labeling tasks. Separately, Liu et al. (2020a) introduce mBART, a sequence-to-sequence denoising auto-encoder pretrained on monolingual corpora in many languages using a denoising autoencoder objective (Lewis et al., 2020).

Code-Switching
Following the ACL shared tasks, CS is mostly discussed in the context of word-level language identification (Molina et al., 2016) and NER (Aguilar et al., 2018). Techniques such as curriculum learning (Choudhury et al., 2017) and attention over different embeddings (Wang et al., 2018;Winata et al., 2019a) have been among the successful techniques. CS parsing and use of monolingual parses are discussed in Sharma et al. (2016); Bhat et al. (2017. Sharma et al. (2016) introduces a Hinglish test set for a shallow parsing pipeline. In Bhat et al. (2017), outputs of two monolingual dependency parsers are combined to achieve a CS parse.  extends this test set by including training data and transfers the knowledge from monolingual treebanks. Duong et al. (2017) introduced a CS test set for semantic parsing which is curated by combining utterances from the two monolingual datasets. In contrast, CSTOP is procured independently of the monolingual data and exhibits much more linguistic diversity. In Pratapa et al. (2018), linguistic rules are used to generate CS data which has been shown to be effective in reducing the perplexity of a CS language model. In contrast, our augmentation techniques are generic and do not require rules or constituency parsers.

XL Data Augmentation
Most approaches to cross-lingual data augmentation use machine translation and slot projection for sequence labeling tasks (Jain et al., 2019). Wei and Zou (2019) uses simple operations such as synonym replacement and Lee et al. (2019a) use phrase replacement from a parallel corpus to augment the training data. Singh et al. (2019) present XLDA that augments data by replacing segments of input text with its translations in other languages. Some recent approaches Winata et al., 2019b) also train generative models to artificially generate CS data. More recently, Kumar et al. (2020) study data augmentation using pre-trained transformer models by incorporating label information during fine-tuning. Concurrent to our work, Bari et al. (2020) introduce Multimix, where data augmentation from pre-trained multilingual language models and self-learning are used for semi-supervised learning. Recently,  generate CS data by translating keywords picked based on attention scores from a monolingual model. Generating CS data has recently been studied in Liu et al. (2020b)

Task-oriented Dialog
The intent/slot framework is the most common way of performing language understanding for task oriented dialog using. Bidirectional LSTM for the sentence representation alongside separate projection layers for intent and slot tagging is the typical architecture for the joint task (Yao et al., 2013;Mesnil et al., 2015;Hakkani-Tür et al., 2016). Such representations can accommodate trees of up to length two, as is the case in CSTOP. More recently, an extension of this framework has been introduced to fit the deeper trees Rongali et al., 2020).

Conclusion
In this paper, we propose a new task for codeswitched semantic parsing and release a dataset, CSTOP, containing 5800 Spanglish utterances over two domains. We hope this foments further research on the code-switching phenomenon which has been set back by paucity of sizeable curated datasets. We show that cross-lingual pre-trained models can generalize better than traditional models to the code-switched setting when monolingual data from only one languages is available. In the presence of only EN data, we introduce generic augmentation techniques based on translation and generation. As such, we show that translating and aligning the EN data can significantly improve the zero-shot performance. Moreover, generating code-switched data using a generation model and a match-and-filter approach leads to improvements in a few-shot setting. We leave exploring and combining other augmentation techniques to future work.