Code-Switched Text Synthesis in Unseen Language Pairs

,


Introduction
Code-switching, the linguistic phenomenon of using more than one language within a single utterance or conversation, 1 is a common expression of multilingualism in informal text and speech (Auer and Wei, 2008;Gumperz, 1982). To accommodate the needs of multicultural and multilingual societies and individuals, there is a growing interest in investigating models dedicated to code-switching * Work was done when the author interned at Amazon. 1 In this paper, we mainly focus on the sentence-level codeswitching involving only two languages.

En-Fr
Everything out there is really provisional Everything lbas est en ralit provisional Figure 1: An illustration of our problem setting. Given a sentence in arbitrary languages (English (En) in this figure) and a designated language (Chinese (Zh) and German (De), in the figure), the model needs to synthesize a corresponding code-switched sentence that mixes the original language and the designated language. Additionally, we allow the designated language selections to differ from examples seen during training.
within the realm of conversational AI (FitzGerald et al., 2022;Khanuja et al., 2020;Winata et al., 2022;Sitaram et al., 2019). However, a notable obstacle in code-switching modeling is the scarcity of large-scale code-switched text datasets for different applications in diverse language pairs (Gupta et al., 2020;Tarunesh et al., 2021). This necessitates generative models capable of synthesizing code-switched texts, facilitating subsequent studies for code-switching. Most prior work on text synthesis for codeswitching assumes the availability of training data for all language pairs being tested. Early trials concentrate on individual language pair (Samanta et al., 2019;Chang et al., 2019;Tarunesh et al., 2021). For example, Bhat et al. (2016) develop a code-switched text synthesizer for Hindi-English based on linguistic rules (Poplack, 1980;Belazi et al., 1994;Myers-Scotton, 1997), while ; Winata et al. (2019); Garg et al. (2018) explore neural generative models for Chinese-English code-switched text synthesis. More recently, Gupta et al. (2020) presents pioneering efforts in developing a generic method for producing high-quality and fluent code-switched sentences across diverse language pairs. This is achieved through the collection of code-switched texts in multiple languages.
However, the requirement of training on codeswitched texts for target language pairs hinders the scalability of existing models to cover a broader range of language pairs. Many real-world codeswitching scenarios, such as Swahili-English in Tanzania (Kanijo, 2018), Shona-English in Zimbabwe (Mashiri, 2002) suffer from limited or nonexistent curated datasets. Recognizing this resource limitation, in this work, our study focuses on synthesizing code-switched text in multiple language pairs, including those language pairs that are unseen during training (zero-shot transfer setting (Huang et al., 2021(Huang et al., , 2022). In this setting, models must learn code-switched patterns from limited code-switched training data in some language pairs and generalize to other language pairs, as shown in Fig. 1. The setting enables a more flexible process of code-switched text synthesis by using existing resources to assist resource-limited language pairs. Yet, it also introduces new challenges: (1) models must possess the ability to generate tokens in multiple languages; (2) models need to acquire a transferable ability for code-switching such that they can generate code-switched text in unseen language pairs. To overcome the challenges, we propose GLOSS, a GeneraLized cOde-Switched text Synthesizer that introduces an additional codeswitching module to a pre-trained multilingual machine translation model (PMMTM). The codeswitching module, implemented either through an adapter (Houlsby et al., 2019) or extra prefixes (Li and Liang, 2021), offers a parameter-efficient approach to transfer learning from machine translation to code-switched text synthesis. Inheriting the ability of PMMTM, GLOSS can generate text across multiple languages. The incorporation of an additional code-switching module, instead of directly fine-tuning the PMMTM, serves as an effective method to prevent models from overfitting to the specific training code-switched language pairs. Furthermore, we develop a self-training algorithm on the target language pairs to improve GLOSS further. Specifically, our preliminary study shows that although GLOSS can successfully generate reasonable code-switched sentences, when performing zero-shot transfer to unseen language pairs, it may still generate non-code-switched sentences (around 11% to 13% of cases). The proposed self-training framework aims to introduce weakly-supervised signals to help GLOSS more stably generate target domain cases when the target language pair is known. 2 To achieve this, we iteratively fine-tune GLOSS on a filtered dataset that is generated by GLOSS itself in the target domain case. The filter incorporates a language identification model to remove low-quality instances. 3 Being fine-tuned on filtered data, GLOSS learns to generate texts that satisfy the filtering rules and become more stable.
Our contribution is three-fold. First, we present GLOSS, a code-switched text synthesizer that can generate code-switched sentences across multiple language pairs, even those not in the training data. To the best of our knowledge, we are the first to study this setting. Second, we introduce a selftraining framework to further improve GLOSS under the setting where the target language pair is known. Third, extensive experiments, including automatic evaluations on four languages and human evaluations on two languages, showcase GLOSS's strong performance. GLOSS achieves at least 55% relative BLEU and METEOR score improvements compared to strong baselines.

Problem Formulation
Our goal is to synthesize code-switched (CS) texts for language pairs where their CS examples are never provided during training.
Given a monolingual input sentence x e in language l e and an assigned language l m (l m ̸ = l e ), we aim to generate a sentence x m,e that mixes l m and l e , while preserving the semantic meaning of x e . 4 We consider the setting where the assigned language l m in the testing time is different from those in the training time. More formally, as illustrated in Figure 1, the training set consists of N language pairs (l en , l mn )(n ∈ {1, 2, ..., N }), while  Figure 2: An overview of our GLOSS model. GLOSS is built on top of a pre-trained multilingual machine translation model, which is trained using machine translation data in many different language pairs. After the pre-trained multilingual machine translation model is prepared, we augment an adapter or extra prefixes to the model. The adapter or prefixes are trained using code-switched data while the pre-trained multilingual machine translation model's parameter will be frozen during the fine-tuning.
the testing set includes target language pairs where l mt / ∈ {l m 1 , ..., l m N }, ∀t. This scenario reflects real-world situations where code-switched data is more readily available for certain language pairs, such as Spanish-English and Hindi-English, while it is less accessible for others, such as Bengali-English and Swahili-English.

Method
We introduce GLOSS, a GeneraLized cOde-Switched text Synthesizer that tackles the two specific challenges raised by our problem setting: (1) the model needs to generate texts across many languages, some are not even in the CS training data; (2) the model needs to learn transferable CS ability such that they generate reasonable CS sentences in unseen language pairs. Fig. 2 provides an overview.
To address the first challenge, we begin by obtaining a Pre-trained Multilingual Machine Translation Model (PMMTM) using multilingual machine translation data, which covers all languages that would be used for final CS text synthesis ( §3.1). 5 The remaining challenge is how to make PMMTM a code-switched text synthesizer with only limited language coverage of training data.
We propose to augment an additional codeswitching module onto PMMTM, thereby creating GLOSS ( §3.2). This additional code-switching module is trained on our limited CS data while keeping PMMTM parameters fixed. Instead of finetuning the entire PMMTM, this modularized design improves systematic generalization (Bahdanau et al., 2019;Ruis and Lake, 2022), where PMMTM focuses on generating translated sentences and the code-switching module concentrates on "mixing" languages. This approach allows GLOSS to be more adaptable and less prone to overfitting during the fine-tuning process on CS data. Finally, we present a self-training framework that enables GLOSS to more stably generate CS texts in target language pairs ( §3.3).

PMMTM
Multilingual machine translation models (Ha et al., 2016;Johnson et al., 2017;Baziotis et al., 2022;Tang et al., 2020) enable simple deployment and parameter-efficient support of machine translation for a large number of language pairs by using a shared representation space. To train a PMMTM, we follow the strategy of mBART-50 (Tang et al., 2020) to notify the model of the source language and the target language to be translated into. Specifically, a language-specific special token is  prepended to both the source and target sentences. Hence, during decoding, the first token fed to the decoder is the target language's special token that guides the translation. This is illustrated in Fig. 2.

The GLOSS Model
After obtaining a PMMTM, which can comprehend and generate phrases across multiple languages, our next step is to transform a PMMTM into a CS text synthesizer. A commonly used way is to directly fine-tune the PMMTM on CS training data (Tarunesh et al., 2021;Gupta et al., 2020). However, models directly fine-tuned on new data could easily overfit to the fine-tuning scenario. Thus it is hard to adapt the ability to perform codeswitching to unseen language pairs. Therefore, instead of directly fine-tuning the whole PMMTM, we propose to use an additional code-switching module paired with the PMMTM. The module is specifically learned to mix languages for a given translation pair generated by PMMTM.
To implement the design and enable end-to-end training, we employ either an adapter (Houlsby et al., 2019) or extra prefixes (Li and Liang, 2021) as the code-switching module. These approaches are parameter-efficient methods to introduce control into pre-trained models and guide the final generation (He et al., 2022): Adapter is an additional layer (and parameters) that is introduced inside each Transformer block (Vaswani et al., 2017), and it was shown to be an effective way to conduct transfer learning for NLP tasks (Houlsby et al., 2019). This layer is appended after each feed-forward layer (in a Transformer block). It projects the original feature size to a smaller dimension and then projects them back to the original size, ensuring that the number of parameters stays substantially small.
Prefix is another parameter-efficient way to conduct transfer learning for NLP tasks (Li and Liang, 2021). "Prefix" are the new key and value matrices used when calculating attention in Transformer. More specifically, trainable prefixes are a set of vectors that will be concatenated with the original key and value matrices when calculating dot-product attention. Hence, in each layer, inputs will be influenced by these additional keys and values after attention is applied.
During fine-tuning using CS training data, we keep the parameters of PMMTM frozen and solely train the adapter or prefixes. This allows the codeswitching module to learn how to blend a translated distribution with the input sentence. When GLOSS is tested and tasked with generating a codeswitched sentence in an unseen target language pair, the frozen PMMTM, having been trained to produce translations for this specific pair, can still generate reliable translations. With reliable translations, our code-switching module continues to perform a similar function during training by blending languages. As a result, GLOSS exhibits improved generalization capabilities.

GLOSS with Self-Training
Although GLOSS has the ability to generalize to synthesize CS text to languages that the PMMTM supports, the generation could still be unstable. As we will show in §5, GLOSS still has around 11% to 13% of cases that will generate non-CS sentences when performing zero-shot transfer to unseen language pairs. Hence, we aim to improve this stability issue if more information about the test case is provided. We assume a common scenario in real practice -the target language pair l m and l e is known, and we can update GLOSS for fitting this specific target language pair.
We design a self-training procedure to incorporate off-the-shelf language identification models to help GLOSS synthesize target CS sentences more stably. The procedure is illustrated in Fig. 3. To be more specific, we first use the input sentence written in l e in the CS training data as the input query and ask GLOSS to make a prediction on the target language l m , forming potential CS sentences x m,e . Then, we use language identification models to perform sentence filtering based on the following constraints: • The synthesized sentence should at least cover one token from l m . • The synthesized sentence should at least cover tokens from l e . • The synthesized sentence cannot cover tokens from other languages except l m and l e .
We use CLD3 6 as the language identification model, which extracts character n-grams from the input text and computes an embedding based on the fraction of times each n-gram character appears. Notably, CLD3's training does not rely on codeswitched text. We leverage CLD3's predicted language distribution for each token to determine if each generated sentence meets the aforementioned constraints. We filter out low-quality instances and collect the remaining sentences as a synthetic codeswitching corpus specific to the target domain. This corpus is subsequently used for further fine-tuning of GLOSS. The procedure can be executed repeatedly in R rounds, where R is a hyper-parameter. Notice that other advanced filtering can be easily included in our proposed procedure and we leave the exploration as a future work.
Different from the classic self-training algorithm in semi-supervised research (Fei et al., 2023), in our procedure, the initial model is a zero-shot transfer model. Additionally, we apply a filtering process to further improve the quality of the synthetic codeswitching corpus.

Discussion
Utilizing pre-trained models that are initially trained on machine translation data as a foundation for constructing code-switched (CS) text synthesizers has gained significant attention recently due to the resemblance between machine translation and CS text synthesis tasks (Tarunesh et al., 2021;Gupta et al., 2020). However, our work differs from theirs in that we train a single model capable of consuming all the machine translation data, thereby supporting translation across multiple language pairs. In contrast, prior works rely on selecting data based on the target language pair (l m and l e ) as a priori.
Our approach enables a unified model that possesses the ability to generate phrases in multiple languages, thereby facilitating CS text synthesis across various language pairs. Conversely, constraining the training of the PMMTM to a limited number of languages, such as a few specific pairs, would result in GLOSS losing its ability to generalize to a broader range of CS language pairs. 4 Automatic Evaluation

Experimental Settings
Dataset and Evaluation Metrics. We use the data provided by Gupta et al. (2020), which covers eight language pairs, including Bengali-English (Bn-En), German-English (De-En), Spanish (Es-En), French-English (Fr-En), Hindi-English (Hi-En), Malayalam-English (Ml-En), Tamil-English (Ta-En), and Telugu-English (Te-En). Note that in this dataset, the input language sentence is always English. Hence, the target code-switched (CS) language pair is X-English, where X is the different languages that the dataset covers. In the original paper, they used English-X to call the language pair in their dataset, but we changed the naming to present the dominant language first. The dataset statistics are listed in Appendix §A.
In our setting, we conduct leave-one-out experiments, i.e., seven CS language pairs are selected as the CS training data, and the remaining is the test language pair. We select Bn-En, De-En, Es-En, and Hi-En as the four test scenarios based on the language resource levels defined in Tang et al. (2020), such that our selection covers high-resource (German, Spanish), medium-resource (Hindi), and low-resource (Bengali) languages. We evaluate the synthesized text using BLEU (Papineni et al., 2002)   Implementation Details. We use two different PMMTM for GLOSS. The first one directly adapts the pre-trained mBART50-many-to-many-MMT model (mBART50-MMT) from (Tang et al., 2020), which is a machine translation model trained on 50 language pairs using the ML50 benchmark. The other one is to further fine-tune mBART50-MMT on the machine translation data collected by Gupta et al. (2020) to make an "augmented mBART50-MMT" (augment-MMT). The second setting is considered since machine translation data in the ML50 benchmark are limited for Indic languages. Hence, we further fine-tune mBART50-MMT on the machine translation data provided in (Gupta et al., 2020) for three epochs. Notice that the machine translation data in (Gupta et al., 2020) only covers eight language pairs, making augment-MMT a more restricted machine translation model in terms of supported languages. All GLOSS (mBART50-MMT/augment-MMT paired with adapter/prefix) are implemented using the Huggingface package (Wolf et al., 2020) as the backbone. To implement the adapter and prefix, we leverage AdatperHub (Pfeiffer et al., 2020). We use their default setting to set prefix length as 30 and use all prefixes in the self-attention block in the Transformer encoder, and cross-attention block as well as the self-attention block in the Transformer decoder. We train GLOSS with a machine equipped with 4 NVIDIA Tesla V100 GPUs. We train GLOSS using 1 GPU at a time with around 30 hrs of training.
We consider AdamW optimizer (Loshchilov and Hutter, 2019) with learning rate set to 10 −5 and the weight decay set to 10 −5 . We set the batch size to 12 and the number of training epochs to 15. For GLOSS with self-training, we experiment with R ∈ {1, 2, 5} rounds with heuristics. Hyperparameter determination, except for R, is based on the available CS data in the development set without considering the leave-out language pair. Due to the computational resource restriction, our experiment results from a single seed. We note the gradual performance improvement as R increased in §4.3. However, determining the optimal stopping point for R presented a challenge since no development data exist under the zero-shot scenario. As a result, we decide not to increase R further in our experiments.
Compared baselines. Three types of baselines are considered: • Unsupervised baselines -(1) Copy Input: directly copy the input sentence as the prediction,  Table 2: Automatic evaluation results for GLOSS paired with our self-training procedure. We evaluate the result in BLEU (B) and METEOR (M). Numbers in bold are the best performance among models using the same architecture. We can observe the gradual improvement when more rounds of self-training are applied to GLOSS.
• Supervised baselines -(1) Gupta et al. (2020): a sequence-to-sequence model that leverages XLM (Conneau and Lample, 2019) features and utilizes the transfer learning signal from machine translation to warm-up the model, (2) Fine-tuned PMMTM on all language pairs: we fine-tune mBART50-MMT on CS data in all eight language pairs.
• Zero-shot transfer baselines -(1) Fine-tuned PMMTM on available language pairs: finetune whole mBART50-MMT on available CS training data only (excluding test language pair).
Note that the training of supervised baselines contains CS data in target language pairs; hence, it can be viewed as an upper bound for GLOSS. Zeroshot transfer baselines are trained only using CS data from other language pairs but not the target language pair. Unsupervised baseline training does not use any CS training data.

Main Results
Tab. 1 shows the results. From the table, we can observe that the unsupervised baselines generate very unreliable CS sentences in general. Additionally, naively fine-tuning the whole PMMTM could perform even worse than the unsupervised methods. GLOSS improves unsupervised baselines and zero-shot transfer baselines by at least 55% relative scores across the board, and every variation of GLOSS could outperform these baselines. By comparing different variations of GLOSS, we can observe that GLOSS with prefixes is more robust than using an adapter, especially in the cases where the PMMTM model has worse performance (Bengali & Hindi due to limited training machine translation data used in mBART50-MMT). Furthermore, by comparing GLOSS equipped with augment-MMT and GLOSS equipped with mBART50-MMT, we highlight the PMMTM's impact on our model.

Results Given Known Target Language
When the target language pair is known, we can then apply our self-training procedure to GLOSS. We experiment on GLOSS using prefixes and present results in Tab. 2. From the table, we can observe the consistent improvement when adopting self-training to GLOSS, and the improvement is especially significant for Hindi-English. Additionally, by conducting self-training with more rounds, we can observe the gradual improvements in both of the cases for GLOSS with mBART50-MMT and augment-MMT.

Human Evaluation
To further verify the quality of our method, we conduct the human evaluation for Hindi-English and Chinese-English code-switched (CS) text using sentences in English as the source language.

Evaluator Selection
Considering the expertise of the annotation task requires people familiar with both English and Chinese (or English and Hindi), we have a highstandard selection process to recruit 3 professionals for the human evaluation. For Hindi-English annotation, We engaged the services of a team of expert professionals who were contracted to provide labels for various Hindi and English-related tasks. They're all native Hindi speakers and highly skilled in speaking Hindi-English code-switching. Conversely, our Chinese-English annotators are native Chinese NLP researchers with over three  Table 3: Human evaluation results for GLOSS in Hindi-English (Hi-En) and Chinese-English (Zh-En). Codeswitching correctness rate (CS Rate.) measures the percentage of the prediction is a correct CS. F is the abbreviation of the Fluency score. S is the abbreviation of the Semantic Correctness score. Geometric Mean (Geo. Mean) is the average of each sample's geometric mean between its code-switching correctness score, fluency and semantic scores. Results for the supervised baseline and the ground truth are presented as an upper bound (UB.) for reference.
years of experience, residing in the US for at least four years, and proficient in Chinese, English, and Chinese-English code-switching. We offer a competitive hourly payment that meets regional legal standards, though it's difficult to determine the average payment for them on this single task.

Experimental Settings
Dataset. To avoid the evaluation being biased in the domain we trained on, we collect testing English instances by selecting sentences from the following CS dataset. We sample 50 sentences for each language pair.
• Hindi-English: We use the data released from Tarunesh et al. (2021), who collected the dataset via crowd-sourcing in India. Every data point in this dataset is a pair of an English sentence and its corresponding Hindi-English CS sentence. • Chinese-English: We use the transcript data of the SEAME dataset (Lyu et al., 2010), which is a Chinese-English CS speech recognition dataset.
To get the English counterpart of the Chinese-English sentences, we ask experts to translate the CS sentence back to their English version.
Compared models. We compare six methods: (1) Translate, Align, then Swap, which serves as a representative of unsupervised methods, (2) Finetuned PMMTM on available language pairs, which serves as a baseline for zero-shot transfer, (3) GLOSS + prefix, we use augment-MMT as the backbone for Hindi-English, while using mBART50-MMT as the base model for Chinese-English, (4) GLOSS + prefix + self-training, we apply self-training (R = 5) to GLOSS + prefix, (5) Fine-tuned PMMTM on all language pairs, which serves a strong supervised baseline. Notice that since the training dataset in Gupta et al. (2020) does not contain the Chinese-English pair. Hence, when evaluating on Chinese-English, this baseline is not applicable, (6) Ground truth, the original CS sentences we sampled from the dataset.
Evaluation Procedure We ask each expert annotator to evaluate all the output of 50 testing instances from all models (i.e., 300 sentences for Hindi-English and 250 for Chinese-English). Our questionnaire covers the following three questions when using Hindi-English as an example.
• Code-switching Correctness: We measure whether the present sentence is correct CS (binary score). Specifically, we define a sentence as a correct CS sentence if it satisfies the constraints: (a) It's not fully Hindi or English, (b) It should be mainly in Hindi, and (c) There's no other language except English and Hindi. • Fluency: Measuring the fluency of the prediction presented to humans with scores from 1 to 5, with 5 as the best. • Semantic Correctness: Measuring whether the predicted sentence correctly reveals the meaning of the corresponding input sentence with scores from 1 to 5, with 5 as a fully correct translation.

Results
Tab. 3 presents the results. First, we can observe that the code-switching correctness rate is extremely low for the zero-shot baseline -Finetuned PMMTM on available language pairs. Second, although the unsupervised baseline -Translate, Align, then Swap gets a high code-switching success rate, the low fluency reveals that deciding a suitable position to switch languages is a task beyond random. Third, we can observe that self-training can successfully improve the codeswitching quality across all metrics in both lan- Missing small details "screen". Good quality.

German-English
Bengali-English Figure 4: Real examples generated by the models for German-English and Bengali-English cases. Fine-tuned PMMTM (Zst.) refers to the Fine-tuned PMMTM on available language pairs method. The explanation for each prediction is presented in the bottom boxes in the Figure. guages, indicating the method's effectiveness.

Output Examples
Lastly, we present real examples generated by our models in Fig. 4. For these examples, we can see that directly fine-tuning the whole PMMTM on CS training data will generate unnatural or even predictions containing tokens in other languages. In contrast, GLOSS can generate more stable results, and our self-training algorithm can even help GLOSS to generate high-quality CS sentences.

Related Work
Early approaches (Pratapa et al., 2018;Bhat et al., 2016;Pratapa and Choudhury, 2021;Li and Fung, 2014) on code-switched (CS) text synthesis were built based on various linguistic theories, such functional head constraints (Belazi et al., 1994), Matrix-Language theory (Myers-Scotton, 1997;Joshi, 1982), and Equivalence-Constraint theory (Poplack, 1980;Sankoff, 1998 Sitaram et al., 2019). Although many of these efforts had some success, the above-mentioned methods can only generate CS text in the same language pair sets used in training. Given the difficulties of acquiring CS data, this requirement hinders the scalability of these models to support more language pairs. Hence, in this paper, we take a step forward to explore the possibility of zero-shot transfer generalization in CS text synthesis and present GLOSS that can generate reasonable outputs.

Conclusion
In this paper, we develop a novel generalized codeswitched text synthesizer, which can even generate code-switched sentences where the corresponding code-switched training data is unavailable. We introduce GLOSS that is built on top of a pre-trained multilingual machine translation model and augmented with an adapter or prefixes. The modularized design of learning specific parameters for mixing languages from a translated distribution helps the overall system generalization, hence, fulfilling our goal. Extensive experiments verify our methods' effectiveness qualitatively and quantitatively. In the future, we plan to investigate how our synthesizer performs on downstream tasks such as conversational understanding under a code-switched scenario.

Limitation
Our paper presents a pilot exploration of investigating a new setting in code-switched text synthesis -we allow the target language pair selection not limited to those for which we already have training data. Although we have shown the strength of GLOSS qualitatively and quantitatively, our experimental setting is still confined due to the dataset restriction -all the input text is in English. It would be an even harder challenge if the source languages are more diverse and we leave such exploration for future work.
Additionally, due to the computational restriction, in GLOSS, we only explore mBART50-MMT and an augment-MMT as our PMMTM. From the experimental results, we do observe the benefit of having a more stable PMMTM in GLOSS. We anticipate the models' performance can be further improved by leveraging more stronger PMMTM, and the exploration is left for the future.

Broader Impacts
Our proposed models are based on a model that is pre-trained on a large scale of multilingual machine translation data. It is known that the machine translation model could capture the bias reflecting the training data (Wang et al., 2022). Therefore, our models can potentially generate code-switched text containing offensive or biased content. We suggest that for deploying our model in any real-world applications, careful examination of the potential bias is an essential step.