Adapting Multilingual Models for Code-Mixed Translation

The scarcity of gold standard code-mixed to pure language parallel data makes it difficult to train translation models reliably. Prior work has addressed the paucity of parallel data with data augmentation techniques. Such methods rely heavily on external resources making systems difficult to train and scale effectively for multiple languages. We present a simple yet highly effective two-stage back-translation based training scheme for adapting multilingual models to the task of code-mixed translation which eliminates dependence on external resources. We show a substantial improvement in translation quality (measured through BLEU), beating existing prior work by up to +3.8 BLEU on code-mixed Hi → En, Mr → En, and Bn → En tasks. On the LinCE Machine Translation leader board, we achieve the highest score for code-mixed Es → En, beating existing best base-line by +6.5 BLEU, and our own stronger base-line by +1.1 BLEU.


Introduction
As code-mixing (Diab et al., 2014;Winata et al., 2019;Khanuja et al., 2020;Aguilar et al., 2020) becomes widespread in an increasingly digitized bilingual community, it becomes important to extend translation systems to handle code-mixed input.A major challenge for training code-mixed translation models is the lack of parallel data.Recent work on generating synthetic parallel data using available non-code-mixed parallel data depend on language specific tools for transliteration, wordalignment, and language identification (Gupta et al., 2021).This makes the approach difficult to scale to new languages and increases software complexity.Back-translation (BT) is another effective and popular strategy to handle non-availability of parallel data (Sennrich et al., 2016;Edunov et al., 2018).However, for the code-mixed to English translation task, simple BT is not an option since we cannot assume the presence of an English to code-mixed translation model.
Meanwhile the mainstream translation community is converging on frameworks based on multilingual models for translation between multiple language pairs (Johnson et al., 2017;Aharoni et al., 2019;Arivazhagan et al., 2019;Zhang et al., 2020;Fan et al., 2021).Going forward, code-mixed translation needs to be integrated within these frameworks to impact practical systems.
We propose a novel two stage back-translation methodology called Back-to-Back Translation (B2BT) targeted for adapting multilingual models to code-mixed translation.Our approach is simple and integrates easily with existing multilingual translation models without any need for special models or language specific tools.We compare B2BT with six other baselines on both standalone and mBART-based models across four benchmarks and show significant gains.For example, on codemixed Hindi to English translation B2BT improves state-of-art accuracy by +3.8 and by +6.3 over default back-translation.We analyze the reasons for the gains via both human evaluation and impact on downstream models.We release a new dataset and will publicly release our code.

Our Approach
Our objective is to train a model that can translate a sentence from the code-mixed language C, which contains words from English and an additional language S, to monolingual English E. Following (Myers-Scotton, 1997) we refer to S as the matrix language as it lends its grammar in a code-mixed utterance, and English as the embedded language since it lends only its words.We are given parallel S to English corpus (S, E) ⊂ (S, E) and a non-parallel code-mixed corpus C ⊂ C. Since code-mixing appears more in domains like social media, which differ from formal domains like news in which parallel data (S, E) is available, we addi-  Training Base Multilingual Model The first step is to train a multilingual model (M) on parallel matrix language to English corpus (S, E) in both directions and non-parallel data in English E M , matrix language S M , and code-mixed C. Following Johnson et al. (2017) we prefix source sentences with one of <2en>, <2cm>, and <2xx> directing target as English, CM, or S respectively.For the non-parallel corpora, we train the model to copy the source to the target by masking out 20% tokens in the source as used in (Song et al., 2019b).
The above training exposes M to all three languages in both encoder and decoder, and a baseline is to just use this bidirectional model for our task.We will show that such a model provides marginal gains over a simple S → E model.However, we adapt M further using synthetic parallel data for the C → E task.Back-translation (BT) of E to C using M to generate synthetic parallel data provides very poor quality as we show in Section 4. This motivates our two stage BT approach.A key insight of B2BT method is that M trained with parallel S → E data gives better quality outputs when translating C to E than the reverse.The reason is C shares the grammar structure of S and M is trained to handle noise in the input.We describe the two step BT next.
Fine-tune for E → C Here we prepare M to back-translate pure English sentences to codemixed sentences so that the resulting synthetic parallel data can be used to train a better code-mixed to English translation model.We first back-translate the monolingual code-mixed corpus C to English E B using M. The back-translation is done by prefixing <2en> to the code-mixed input and sampling English output from M. This provides us with a synthetic English to code-mixed parallel corpus (E B , C).We fine-tune M on (E B , C) to produce a model M ′ where source sentences are prefixed with <2cm>.Since the target distribution C is preserved during training, we can now generate high quality in-domain code-mixed sentences using M ′ .
Fine-tune for C → E In the final step we realise our objective of C → E translation.We start by back-translating the in-domain monolingual English corpus E M D to code-mixed C B using M ′ .This is done by prefixing English sentences with the <2cm> tag, and sampling code-mixed outputs from M ′ .We now have a synthetic code-mixed to English parallel corpus (C B , E M D ).We fine-tune M to obtain our final model M * on this synthetic parallel corpus where all the source sentences in C B are prefixed with the <2en> token.The biggest challenge in translation of codemixed sentences is the lack of large parallel training data (Mahesh et al., 2005;Menacer et al., 2019;Nakayama et al., 2019;Srivastava and Singh, 2020).Gupta et al. (2021) propose to create synthetic parallel CM data via these two steps: (1) train an mBERT model to identify word set W to switch in a sentence from S to E, effectively creating a sentence from C (2) align parallel sentences from (S, E) and replace words in W to their aligned English words.We call this the mBertAln method in this paper.This pipeline for a new language S requires the following four external tools: (1) mBERT pre-trained on S, (2) a language identifier tool to spot English tokens in a CM sentence, (3) a word alignment model, and (4) a translator E → S for BT.For low-resource languages such tools may not exist.In contrast B2BT is totally standalone.Even when external tools exist, we show empirically that the synthetic sentences thus generated tend to be of lower quality than ours because of errors in any of the two steps.The CALCS 2021 workshop (Solorio et al., 2021) also released a shared task for CM translation but the submissions so far are straight-forward application of BART multilingual models, with which we also compare our method.

Related Work
B2BT is reminiscent of dual learning NMT methods (He et al., 2016;Artetxe et al., 2018;Hoang et al., 2018;Cheng et al., 2016) but these methods were designed for two generic languages whereas B2BT for code-mixed translation handles three languages related in specific asymmetric ways.We exploit that asymmetry to design our training schedule.For example, since C → E translations are more accurate than the reverse we insert the intermediate BT stage.

Experiments
We use the notation SoEn→En, to indicate translation from a code-mixed matrix language with code 'So' to English.We evaluate on four codemixed datasets: Hindi (HiEn→En) from Gupta et al. (2021), Spanish (EsEn→En) on the LinCE leaderboard1 , Bengali (BnEn→En) from Gupta et al. (2021)   Results Table 1 compares B2BT approach against these baselines on HiEn→En, BnEn→En, and MrEn→En.Observe how B2BT significantly outperforms mBertAln and multilingual model adapted with existing single step back-translation across all language pairs.We also see substantial improvements on the two adversarial subsets ST-OOV and ST-Hard.This establishes the importance of our two-stage back-translation approach.Note in particular that when we fine-tuned with Our approach can also complement existing multilingual pre-trained models such as mBART.Table 2 presents results with base multilingual model M trained by fine-tuning an mBART checkpoint.
Here again we observe gains beyond simple BTbased fine-tuning of the multilingual model.

Why does B2BT outperform mBertAln?
We hypothesize that the reason our model performs substantially better is that the synthetic data generated by our model is of higher quality.To test this hypothesis we replace the synthetic code-mixed parallel data of B2BT with synthetic data from mBertAln (Gupta et al., 2021) while keeping the rest of the training of M * unchanged.Table 3 presents this result.It is important to note that all the fine-tuning sets have the exact same size and all fine-tuning is performed on the same multilingual base model, M. The only difference is in the method used to create the synthetic side of the fine-tuning dataset.The improvement of almost +4.9BLEU points on ST-Test over using mBertAln  data, clearly shows that the synthetic data from our model has better quality.
To directly quantify this fact, we performed human evaluation of data quality.Human raters were asked to rate fluency and intent preservation for source-target pairs (similar to Wu et al. (2016)) on a scale of 0 (irrelevant) to 6 (perfect).Across 500 examples, we observe that synthetic data from B2BT is rated as 4.27 out of 6 on average compared to 3.74 for mBertAln.In 39% of examples B2BT is rated higher than mBertAln, 45% of examples get the same score, and only in 17% examples is mBertAln better (Table 4).In mBer-tAln the quality of synthetic data could suffer because of poor back-translation, mBERT failing to capture the code-switching pattern, or the alignment model failing to predict the aligned English token.Figure 2 presents examples of synthetic sentences generated by B2BT vs mBertAln.The mBertAln method has word repetition like "open" in row 2, which could be an alignment mistake, and word omissions like "box" in row 1 which could be caused by poor back-translation or alignment.
Finally, we compare code-mixing statistics between the synthetic data generated by B2BT and mBERT in Table 4.The data generated from B2BT is closer to the test data in terms of Code-Mixing Index, fraction of English tokens common in the source and target, and the average probability of switching at a given word.
Varying degree of code-mixing Following Gupta et al. (2021), we also evaluate the effectiveness of our model across different splits of the test set with varying Code-Mixing Index (Gambäck and Das, 2016) (CMI).Figure 3 presents the improvements from our model on the three splits of the test set.We see improvements across all splits, but the largest improvements are on the split with the highest degree of code-mixing.On the high CMI split, we see about +8.7 BLEU point improvement over the mBERT approach, and +14.5 BLEU point improvement over the baseline.Masking during fine-tuning in B2BT A distinctive property of code-mixed translation is word overlap between the source and target sentences.Such overlap makes the fine-tuned model overly biased towards the easier copy action.We alleviate this bias by introducing random masking of words in the source sentence (with masking probability 0.2).Unlike prior work (Song et al., 2019b) which apply such masking only for pre-training with mononlingual corpora, we propose to mask tokens even when training with parallel data.We evaluate the impact of this source side masking in B2BT's fine-tuning stages.Table 5 compares model performance with and without source side masking when fine-tuning.We observe noticeable gains, with the highest for BnEn at +1.5.

Conclusion
We present a simple two-stage back-translation approach (B2BT) for adapting multilingual models for code-switched translation.B2BT shows remarkable improvements on four datasets compared to recent methods, and default back-translation baselines.Our approach fits naturally with existing multilingual translation frameworks, which is crucial in expanding coverage to low resource lan-guages without building per-language pair models.We demonstrate with ablation studies and human evaluations that the synthetic data created through the two step process in B2BT is objectively higher quality than the one used by existing work.

Limitations
Our method depends on code-mixed monolingual data which may not be always available.Additionally, for low resource languages, we might not have access to enough non-code-mixed parallel data which also forms a crucial component of our approach.
Standalone Multilingual Models For training all non-mBART models, we use the standard transformer architecture from Vaswani et al. (2017) with six encoder and decoder layers.In the data pre-processing step, we first tokenize with Indic-NLP (Kunchukuttan, 2020) tokenizer for Indic language sentences and code-mixed sentences and Moses tokenizer6 for pure English sentences.Next, we apply BPE with code learned on monolingual English and monolingual non-code-mixed datasets jointly, for 20,000 operations (the resulting dictionary is manually appended with the special tokens <2en>, <2xx>, <2cm> and <M>).We use Adam optimizer with a learning rate of 5e-4 and 4000 warmup steps.We train all models for up to 100 epochs and select the best checkpoint based on loss on the validation split.For the two BT based finetuning stages in B2BT we use a constant learning rate of 1e-4 and use a random 2K subset of the BT data as the validation split.

Pre-trained mBART-based Multilingual Models
The mBART models are trained by fine-tuning the CC25 mBART checkpoint.The model has 12 encoder and decoder layers, with model dimension of 1024 and 16 attention heads (∼610M parameters).We modify the existing sentence piece model by adding the three special tokens <2en>, <2xx> and <2cm>, so they are not tokenized and also add them to the dictionary by replacing three tokens in a language we are not currently experimenting with.The multilingual model is trained for 100K steps, while fine-tuning stages of B2BT are trained for up to 25K steps.

Fine
Figure 1: B2BT training pipeline, showing the two-stage back-translation based adaptation of an initial multilingual model.( • ) indicates source side masking during training.
introduce 2 .A summary of the training data used, and our model setup is in Appendix A and B. Baselines We compare our method, B2BT against the mBertAln model (Gupta et al., 2021) and these baselines: (1) the base bi-lingual S → E model, (2) base model fine-tuned with E → S BT on domain data E M D , (3) base multilingual model M obtained after first stage of B2BT, (4) M finetuned with E → S BT on domain data E M D , (5) M fine-tuned with E → C BT on E M D .

Figure 2 :
Figure 2: Examples of synthetic sentences from mBer-tAln vs B2BT.English translations of Devanagari words are provided.

Figure 3 :
Figure 3: Improvements in BLEU with B2BT against the mBERT based model and the domain-adapted bilingual model baseline across three splits of the test set with varying degree of code-mixing in the source.
but augmented with the newly released Samanantar data to create a stronger baseline (evaluation is done on the splits released by the authors), and a new Marathi (MrEn→En) dataset that we

Table 5 :
Comparing BLEU on ST-Test between masked vs un-masked fine-tuning to train M * in the B2BT approach.