From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.


Introduction
Code-switching (CS) refers to the linguistic phenomenon of using more than one language within a single sentence or conversation.CS appears naturally in conversational speech among multilingual speakers.The main challenge with building models for conversational CS text is that we do not have access to large amounts of CS text that is conversational in style.One might consider using social media text that contains CS and is more readily available.However, the latter is quite different from conversational CS text in its vocabulary (e.g., due to the frequent use of abbreviated slang terms, hashtags and mentions), in its sentence structure (e.g., due to character limits in tweets) and in its word forms (e.g., due to transliteration being commonly employed in social media posts).This motivates the need for a generative model of realistic CS text that can be sampled to subsequently train models for CS text.
In this work, we tackle the problem of generating high-quality CS text using only limited amounts of real CS text during training.We also assume access to large amounts of monolingual text in the component languages and parallel text in both languages, which is a reasonable assumption to make for many of the world's languages.We focus on Hindi-English CS text where the matrix (dominant) language is Hindi and the embedded language is English. 1 Rather than train a generative model, we treat this problem as a translation task where the source and target languages are monolingual Hindi text and Hindi-English CS text, respectively.We also use the monolingual Hindi text to construct synthetic CS sentences using simple techniques.We show that synthetic CS text, albeit being naive in its construction, plays an important role in improving our model's ability to capture CS patterns.
We draw inspiration from the large body of recent work on unsupervised machine translation (Lample et al., 2018a,b) to design our model, which will henceforth be referred to as Translation for Code-Switching, or TCS.TCS, once trained, will convert a monolingual Hindi sentence into a Hindi-English CS sentence.TCS makes effective use of parallel text when it is available and uses backtranslation-based objective functions with monolingual text.
Below, we summarize our main contributions: 1. We propose a state-of-the-art translation model that generates Hindi-English CS text starting from monolingual Hindi text.This model requires very small amounts of real CS text, uses both supervised and unsupervised training objectives and considerably benefits from a carefully designed training curriculum, that includes pretraining with synthetically constructed CS sentences.
2. We introduce a new Hindi-English CS text corpus in this work. 2Each CS sentence is accompanied by its monolingual Hindi translation.We also designed a crowdsourcing task to collect CS variants of monolingual Hindi sentences.The crowdsourced CS sentences were manually verified and form a part of our new dataset.
3. We use sentences generated from our model to train language models for Hindi-English CS text and show significant improvements in perplexity compared to other approaches.
4. We present a rigorous evaluation of the quality of our generated text using multiple objective metrics and a human evaluation study, and they clearly show that the sentences generated by our model are superior in quality and successfully capture naturally occurring CS patterns.

Related Work
Early approaches of language modeling for codeswitched text included class-based n-gram models (Yeh et al.), factored language models that exploited a large number of syntactic and semantic features (Adel et al., 2015), and recurrent neural language models (Adel et al., 2013) for CS text.All these approaches relied on access to real CS text to train the language models.Towards alleviating this dependence on real CS text, there has been prior work on learning code-switched language models from bilingual data (Li and Fung, 2014b,a;Garg et al., 2018b) and a more recent direction that explores the possibility of generating synthetic CS sentences.(Pratapa et al., 2018) presents a technique to generate synthetic CS text that grammatically adheres to a linguistic theory of code-switching known as the equivalence constraint (EC) theory (Poplack, 1979;Sankoff, 1998).Lee and Li (2020) proposed a bilingual attention language model for CS text trained solely using a parallel corpus.
Another recent line of work has explored neural generative models for CS text.Garg et al. (2018a) use a sequence generative adversarial network (Se-qGAN (Yu et al., 2017)) trained on real CS text to generate sentences that are used to aid language model training.Another GAN-based method proposed by Chang et al. (2019) aims to predict the probability of switching at each token.Winata et al. (2018) and Winata et al. (2019) use a sequenceto-sequence model enabled with a copy mechanism (Pointer Network (Vinyals et al., 2015)) to generate CS data by leveraging parallel monolingual translations from a limited source of CS data.Samanta et al. (2019) proposed a hierarchical variational autoencoder-based model tailored for codeswitching that takes into account both syntactic information and language switching signals via the use of language tags.(We present a comparison of TCS with both Samanta et al. (2019) and Garg et al. (2018a) in Section 5.2.1.) In a departure from using generative models for CS text, we view this problem as one of sequence transduction where we train a model to convert a monolingual sentence into its CS counterpart.Chang et al. (2019); Gao et al. (2019) use GANbased models to modify monolingual sentences into CS sentences, while we treat this problem of CS generation as a translation task and draw inspiration from the growing body of recent work on neural unsupervised machine translation models (Lample et al., 2018a,b) to build an effective model of CS text.
The idea of using translation models for codeswitching has been explored in early work (Vu et al., 2012;Li and Fung, 2013;Dhar et al., 2018).Concurrent with our work, there have been efforts towards building translation models from English to CS text (Solorio et al., 2021) and CS text to English (Gupta et al., 2021).While these works focus on translating from the embedded language (English) to the CS text or vice-versa, our approach starts with sentences in the matrix language (Hindi) which is the more dominant language in the CS text.Also, ours is the first work, to our knowledge, to repurpose an unsupervised neural machine translation model to translate monolingual sentences into CS text.Powerful pretrained models like mBART (Liu et al., 2020) have been used for code-mixed translation tasks in concurrent work (Gautam et al., 2021).We will further explore the use of synthetic text with such models as part of future work.

Our Approach
Figure 1 shows the overall architecture of our model.This is largely motivated by prior work on unsupervised neural machine translation (Lample et al., 2018a,b).The model comprises of three layers of stacked Transformer (Vaswani et al., 2017) encoder and decoder layers, two of which are shared and the remaining layer is private to each language.Monolingual Hindi (i.e. the source language) has its own private encoder and decoder layers (denoted by Enc p 0 and Dec p 0 , respectively) while English and Hindi-English CS text jointly make use of the remaining private encoder and decoder layers (denoted by Enc p 1 and Dec p 1 , respectively).In our model, the target language is either English or CS text.Ideally, we would like Enc p 1 and Dec p 1 to be trained only using CS text.However, due to the paucity of CS text, we also use text in the embedded language (i.e.English) to train these layers.Next, we outline the three main training steps of TCS.
(I) Denoising autoencoding (DAE).We use monolingual text in each language to estimate language models.In Lample et al. (2018b), this is achieved via denoising autoencoding where an autoencoder is used to reconstruct a sentence given a noisy version as its input whose structure is altered by dropping and swapping words arbitrarily (Lample et al., 2018a).The loss incurred in this step is denoted by L DAE and is composed of two terms based on the reconstruction of the source and target language sentences, respectively.
(II) Backtranslation (BT): Once the layers are initialized, one can use non-parallel text in both languages to generate a pseudo-parallel corpus of backtranslated pairs (Sennrich et al., 2015).That is, a corpus of parallel text is constructed by translating sentences in the source language via the pipeline, Enc p 0 , Enc sh , Dec sh and Dec p 1 , and translating target sentences back to the source language via Enc p 1 , Enc sh , Dec sh and Dec p 0 .The backtranslation loss L BT is composed of crossentropy losses from using these pseudo-parallel sentences in both directions.
o a m 4 r P g o S j n S E 8 i L Q m E l K N J 8 b g o l k J i s i U y w x 0 a a u q i n B W T x 5 m f R a T e e s 2 b o 7 b 7 S v y j o q c A T H c A o O X E A b b q E D X S D w C E / w A q 9 W Z j 1 b b 9 b 7 z + i K V e 4 c w h 9 Y H 9 9 n + J S s < / l a t e x i t > Enc sh < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 C l z R t h u u S 4 G M 2 7 v r w B a 0 T G g j j A = " > A r 0 X q 2 3 q z 3 n 9 G K V e 4 c g D + w P r 4 B B Y i U f g = = < / l a t e x i t > Dec sh < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 y q c H a Z e Q 2 t 4 y t T 0 r B f r 3 f q Y j 6 5 Y x c 4 h / I H 1 + Q P 2 C p R 0 < / l a t e x i t > L CE : Enc p0 Enc sh Dec sh Dec p1 ; Enc p1 Enc sh Dec sh Dec p0 Y P y 5 d X I 7 W s U i 2 y S 4 5 I C 4 5 I R f k l t R I g 3 B r x 7 q x a l b d 3 r e r 9 p 3 d + E 6 1 r V H N F v k T 9 t M X z p P T Y g = = < / l a t e x i t > L BT : Enc p1 Enc sh Dec sh Dec p0 ; Enc p0 Enc sh Dec sh Dec p1 < l a t e x i t s h a 1 _ b a s e 6 4 = " o G v g J y H 7 E u F e g n v s W 5 X K A R p K 1 h 9 x 5 9 8 4 a b P Q 1 g M D h 3 P u 5 Z 4 5 r c + F q M l q 9 g 5 g j + w P n 8 A W g S U o w = = < / l a t e x i t >

Hi Hi
En/CS En/CS L DAE : Enc p0 Enc sh Dec sh Dec p0 ; Enc p1 Enc sh Dec sh Dec p1 (III) Cross-entropy loss (CE): Both the previous steps used unsupervised training objectives and make use of non-parallel text.With access to parallel text, one can use the standard supervised cross-entropy loss (denoted by L CE ) to train the translation models (i.e.going from Enc p 0 to Dec p 1 and Enc p 1 to Dec p 0 via the common shared layers).

Synthetic CS text
Apart from the use of parallel text and monolingual text employed in training TCS, we also construct large volumes of synthetic CS text using two simple techniques.This synthetic CS text is non-parallel and is used to optimize both L DAE and L BT .The role of the synthetic CS text is to expose TCS to various CS patterns (even if noisy), thereby encouraging the model to code-switch.The final step of finetuning using All-CS enables model to mimic switching patterns of real CS texts The first technique (named LEX) is a simple heuristic-based technique that constructs a CS sentence by traversing a Hindi sentence and randomly replacing a word by its English translation using a bilingual lexicon (Conneau et al., 2017).The probability of replacing a word is chosen to match the switching distribution in real CS text.The second technique (named EMT) is more linguistically aware.Following the methodology proposed by Bhat et al. (2016) that is based on the embedded matrix theory (EMT) for code-switching, we apply clause substitution methods to monolingual text to construct synthetic CS text.From inspecting English parse trees, we found that replacing embedded sentence clauses or subordinate clauses with their Hindi translations would likely produce CS text that appears somewhat natural.We introduce a new Hindi-English CS dataset, that we will refer to as All-CS.It is partitioned into two subsets, Movie-CS and Treebank-CS, based on their respective sources.Movie-CS consists of conversational Hindi-English CS text extracted from 30 contemporary Bollywood scripts that were publicly available. 3The Hindi words in these sentences were all Romanized with potentially multiple noncanonical forms existing for the same Hindi token.We employed a professional annotation company to convert the Romanized Hindi words into their respective back-transliterated forms rendered in Devanagari script.We also asked the annotators to provide monolingual Hindi translations for all these sentences.Using these monolingual Hindi sentences as a starting point, we additionally crowdsourced for CS sentences via Amazon's Mechanical Turk (MTurk) (Amazon, 2005).Table 1 shows two Hindi sentences from Movie-CS and Treebank-CS, along with the different variants of CS sentences.

Description of Datasets
Turkers were asked to convert a monolingual Hindi sentence into a natural-sounding CS variant that was semantically identical.Each Turker had to work on five Hindi sentences.We developed a web interface using which Turkers could easily copy parts of the Hindi sentence they wanted to retain and splice in English segments.More details about this interface, the crowdsourcing task and worker statistics are available in Appendix A.
All-CS comprises a second subset of CS sentences, Treebank-CS, that was crowdsourcing using MTurk.We extracted 5292 monolingual Hindi sentences (with sentence lengths less than or equal to 15 words) from the publicly available Hindi Dependency Treebank that contains dependency parses. 4 These annotations parse each Hindi sentence into chunks, where a chunk is defined as a minimal, non recursive phrase.Turkers were asked to con-3 https://www.filmcompanion.in/category/fc-pro/scripts/https://moifightclub.com/category/scripts/ 4 http://ltrc.iiit.ac.in/treebank_H2014/ vert at least one Hindi chunk into English.This was done in an attempt to elicit longer spans of English segments within each sentence.Figure 2 shows the sentence length distributions for Movie-CS and Treebank-CS, along with histograms accumulating English segments of different lengths in both subsets.We clearly see a larger fraction of English segments with lengths within the range [2-6] in Treebank-CS compared to Movie-CS.
Table 2 provides detailed statistics of the new CS dataset.We also report two metrics proposed by Guzmán et al. (2017) to measure the amount of code-switching present in this new corpus.Monolingual Index (M-Index) is a value between 0 and 1 that quantifies the amount of mixing between languages (0 denotes a purely monolingual corpus and 1 denotes equal mixing from both languages) and I-Index measures the fraction of switching points in the corpus.We observe Treebank-CS exhibits higher M-index and I-index values compared to Movie-CS indicating more code-switching overall.
All-CS also contains a non-trivial number of named entities (NEs) which are replaced by an NE tag in all our language modeling experiments.

Other Datasets
Parallel Hindi-English Text.As described in Section 5, TCS uses parallel text for supervised training.For this purpose, we use the IIT Bombay English-Hindi Corpus (Kunchukuttan et al., 2017) containing parallel Hindi-English text.We also construct a larger parallel corpus using text from the OpenSubtitles (OpSub) corpus (Lison and Tiedemann, 2016) that is more conversational and hence more similar in style to Movie-CS.We chose ˜1 million English sentences (OpSub-EN), where each sentence contained an embedded clause or a subordinate clause to support the construction of EMT lines.We used the Google Translate API to obtain Hindi translations for all these sentences (OpSub-HI).Henceforth, we use OpSub to refer to this parallel corpus of OpSub-EN paired with OpSub-HI.We extracted 318K sentences from the IITB corpus after thresholding on length (5-15) and considering overlap in vocabulary with OpSub.
(One could avoid the use of an external service like Google Translate and use existing parallel text (Zhang et al., 2020)) in conjunction with a word aligner to construct EMT lines.OpSub, being more conversational in style, turns out to be a better pretraining corpus.A detailed comparison of these choices is described in Appendix H.) Datasets from existing approaches.
(I) VACS (Samanta et al., 2019) is a hierarchical variational autoencoder-based model designed to generate CS text.We train two VACS models, one on All-CS (VACSv1) and the other on OpSub-EMT followed by All-CS (VACSv2).(II) Garg et al. (2018a) use SeqGAN (Yu et al., 2017) -a GANbased sequence generation model -to generate CS sentences by providing an RNNLM as the generator.As with VACS, we train two SeqGAN5 models, one on All-CS (SeqGANv1) and one on OpSub-EMT followed by All-CS (SeqGANv2).Samples are drawn from both SeqGAN and VACS by first drawing a random sample from the standard normal distribution in the learned latent space and then decoding via an RNN-based generator for SeqGAN and a VAE-based decoder for VACS.We sample ˜2M lines for each dataset to match the size of the other synthetic datasets.

Experiments and Results
First, we investigate various training curricula to train TCS and identify the best training strategy by evaluating BLEU scores on the test set of All-CS ( §5.1).Next, we compare the output from TCS with synthetic CS text generated by other methods ( §5.2).We approach this via language modeling ( §5.2.1), human evaluations ( §5.2.2) and two downstream tasks-Natural Language Inference and Sentiment Analysis-involving real CS text ( §5.2.3).Apart from these tasks, we also present four different objective evaluation metrics to evaluate synthetic CS text: BERTScore, Accuracy of a BERT-based classifier and two diversity scores ( §5.3).

Improving Quality of TCS Outputs
Table 3 shows the importance of various training curricula in training TCS; these models are evaluated using BLEU (Papineni et al., 2002) scores computed with the ground-truth CS sentences for the test set of All-CS.We start with supervised pretraining of TCS using the two parallel datasets we have in hand -IITB and OpSub (System A).A is then further finetuned with real CS text in All-CS.The improvements in BLEU scores moving from System O (trained only on All-CS) to System B illustrate the benefits of pretraining TCS using Hindi-English parallel text.Systems C and D in Table 3 use our synthetic CS datasets OpSub-LEX and OpSub-EMT, respectively.These systems are further finetuned on All-CS using both unsupervised and supervised training objectives to give C 1 , C 2 , D 1 and D 2 , respectively.Comparing these four systems with System B shows the importance of using synthetic CS for pretraining.Further, comparing C 1 against D 1 and C 2 against D 2 , we observe that OpSub-EMT is indeed a better choice for pretraining compared to OpSub-LEX.Also, supervised finetuning with All-CS is clearly superior to unsupervised finetuning.Henceforth, Systems D 1 and D 2 will be referred to as TCS (U) and TCS (S), respectively.
While having access to parallel CS data is an advantage, we argue that the benefits of having parallel data only marginally increase after a threshold.Figure 3 shows how BLEU scores vary when changing the amount of parallel CS text used to train D 2 .We observe that BLEU increases substantially when we increase CS data from 1000 lines to 5000 lines, after which there is a trend of diminishing returns.We also find that D 1 (that uses the data in All-CS as non-parallel text) is as good as the model trained using 4000 lines of parallel text.

Language Modeling
We Table 4 shows test perplexities using different training curricula and data generated using two prior approaches, VACS and SeqGAN.Sentences generated using TCS yield the largest reductions in test perplexities, compared to all other approaches.

Human Evaluation
We evaluated the quality of sentences generated by TCS using a human evaluation study.We sampled 150 sentences each, using both TCS (U) and TCS (S), starting from monolingual Hindi sentences in the evaluation sets of All-CS.The sentences were chosen such that they were consistent with the length distribution of ALL-CS.For the sake of comparison, corresponding to the above-mentioned 150 monolingual Hindi samples, we also chose 150 CS sentences each from All-CS-LEX and All-CS-EMT.
Along with the ground-truth CS sentences from All-CS, this resulted in a total of 750 sentences. 6These sentences were given to three linguistic experts in Hindi and they were asked to provide scores ranging between 1 and 5 for worst, 5 for best) under three heads: "Syntactic correctness", "Semantic correctness" and "Naturalness".Table 5 shows that the sentences generated using TCS (S) and TCS (U) are far superior to the EMT and LEX sentences on all three criteria.TCS (S) is quite close in overall quality to the real sentences and TCS (U) fares worse, but only by a small margin.
Table 6 shows some illustrative examples of code-switching using TCS (U) on test samples.We also show some examples of code-switching within monolingual sentences from OpSub.We observe that the model is able to introduce long contiguous spans of English words (e.g."meeting next week", "but it is clear", etc.).The model also displays the ability to meaningfully switch multiple times within the same sentence (e.g., "i love you very much", "but", "friend").There are also interesting cases of English segments that appear to be ungrammatical but make sense in the CS context (e.g., "because i know main dish", etc.).

GLUECoS Benchmark
GLUECoS (Khanuja et al., 2020) is an evaluation benchmark spanning six natural language tasks for code-switched English-Hindi and English-Spanish data.The authors observe that M-BERT (Pires et al., 2019) consistently outperforms cross-lingual embedding techniques.Furthermore, pretraining M-BERT on small amounts of code-switched text improves its performance in most cases.For our evaluation, we select two tasks that require semantic understanding: Natural Language Inference (NLI) and Sentiment Analysis (SA).We sample 100K monolingual sentences from OpSub-HI and select corresponding LEX, EMT and TCS (S) sentences.M-BERT is then trained using the masked language modelling (MLM) objective on text from all 4 systems (including OpSub-HI) for 2 epochs.We also train M-BERT on 21K sentences from All-CS (real CS).Finally, these pretrained models are fine-tuned on the selected GLUECoS tasks.(More details are in Appendix G.) Table 7 lists the accuracies and F1 scores using different pretraining schemes for both NLI and sentiment analysis, respectively.Plain monolingual pretraining by itself leads to performance improvements on both tasks, presumably due to domain similarity between GLUECoS (movie scripts, social media etc.) and OpSub.As mentioned in Khanuja et al. (2020), pretraining on CS text further improves performance for both NLI and SA.Among the synthetic methods, TCS (S) has consistently better scores than LEX and EMT.For SA, TCS (S) even outperforms pretraining on real CS text from All-CS.

Other Objective Evaluation Metrics
BERTScore.BERTScore (Zhang* et al., 2020) is a recently-proposed evaluation metric for text generation.Similarity scores are computed between each token in the candidate sentence and each token in the reference sentence, using contextual BERT embeddings (Devlin et al., 2018) of the tokens.We use this as an additional objective metric to evaluate the quality of the sentences generated using TCS.We use the real monolingual sentence as the reference and the generated CS sentence as the candidate, excluding sentences from TCS (S) and TCS (U) that exactly match the real sentence.Since our data is Hindi-English CS text, we use Multilingual BERT (M-BERT) (Pires et al., 2019) for high-quality multilingual representations.
Table 8 outlines our main results on the test set of All-CS.TCS sometimes generates purely monolingual sentences.This might unfairly tilt the scores in favour of TCS since the reference sentences are also monolingual.To discount for such biases, we remove sentences generated by TCS (U) and TCS (S) that are purely monolingual (Row label "Mono" in BERTScore).Sentences having ¡UNK¿ tokens (labeled "UNK") are also filtered out since these tokens are only generated by TCS for outof-vocabulary words."UNK & Mono" refers to applying both these filters.
EMT lines consistently show the worst performance, which is primarily due to the somewhat poor quality of translations involved in generating these lines (refer to Appendix B).With removing both monolingual and ¡UNK¿ tokens, we observe that TCS (U) and TCS (S) yield the highest BERTScores, even outperforming the BERTScore on real data obtained from the Turkers.
BERT-based Classifier.In this evaluation, we use M-BERT (Pires et al., 2019) to build a classifier that distinguishes real CS sentences from synthetically generated ones (fake).When subject to examples from high-quality generators, the classifier should find it hard to tell apart real from fake samples.We add a fully connected layer over the M-BERT base architecture that takes the [CLS] token as its input to predict the probability of the sentence being real or fake.Fake sentences are drawn from the union of TCS (U), TCS (S), All-CS-LEX and All-CS-EMT.In order to alleviate the class imbalance problem, we oversample the real sentences by a factor of 5 and shuffle the data.The model converges after training for 5 epochs.We see in Table 8 that the classification accuracy of whether a sample is fake or not is lowest for the  outputs from TCS among the different generation techniques.
Measuring Diversity.We are interested in finding out how diverse the predictions from TCS are.We propose a simple measure of diversity in the CS variants that is based on how effectively sentences can be compressed using the gzip utility. 7We considered using Byte Pair Encoding (BPE) (Gage, 1994) as a measure of data compression.However, BPE operates at the level of individual words.Two word sequences "w1 w2 w3" and "w3 w2 w1" would be identically compressed by a BPE tokenizer.We would ideally like to account for such diversity and not discard this information.gzip uses Lempel-Ziv coding (Ziv and Lempel, 1977) that considers substrings of characters during compression, thus allowing for diversity in word ordering to be captured.
Our diversity measure D is simply the following: For a given set of CS sentences, run gzip on each sentence individually and sum the resulting file sizes (S 1 ).Next, paste all the CS sentences into a single file and run gzip on it to get a file of size S 2 .Then, D = S 1 −S 2 .Smaller D scores indicate larger diversity.If the variants of a sentence are dissimilar to one another and hence very diverse, then S 2 would be large thus leading to smaller values of D. Table 8 shows the diversity scores for different techniques.Both TCS (S) and TCS (U) have a higher diversity score compared to LEX and EMT.TCS (U) exceeds even the responses received via MTurk (Real) in diversity.We note here that diversity, by itself, is not necessarily a desirable trait.Our goal is to generate sentences that are diverse while being natural and semantically meaningful.The latter properties for text from TCS (S) and TCS (U) have already been verified in our human evaluation study.Zhu et al. (2018) propose self-BLEU score as a metric to evaluate the diversity of generated data.However, using self-BLEU is slightly problematic in our setting as systems like LEX that switch words at random positions would result in low self-BLEU (indicating high diversity).This is indeed the case, as shown in Table 8 -LEX, EMT give lower self-BLEU scores as compared to TCS.However, note that the scores of the TCS models are comparable to that of real CS data.Figure 4 depicts the portal used to collect data using Amazon's Mechanical Turk platform.The collection was done in two rounds, first for Movie-CS and then for Treebank-CS.With Treebank-CS, the sentences were first divided into chunks and the Turkers were provided with a sentence grouped into chunks as shown in Figure 4.They were required to switch at least one chunk in the sentence entirely to English so as to ensure a longer span of English words in the resulting CS sentence.A suggestion box converted transliterated Hindi words into Devanagari and also provided English suggestions to aid the workers in completing their task.With Movie-CS, since there were no chunk labels associated with the sentences, they were tokenized into words.

A MTurk Task Details
On MTurk, we selected workers with HIT approval rate of 90% and location restricted to countries with significant Hindi speakers -Australia, Bahrain, Canada, India, Kuwait, Malaysia, Mauritius, Myanmar, Nepal, Netherlands, New Zealand, Oman, Pakistan, Qatar, Saudi Arabia, Singapore, South Africa, Sri Lanka, Thailand, United Arab Emirates, United Kingdom, United States of America.It was clearly specified in the guidelines that the task must be attempted by native Hindi speakers.Each response was manually checked before approving.Turkers were paid $0.15 for working on 5 sentences (roughly takes 3-4 minutes).This amounts to $2.25-$3/hr which is in the ballpark of a median hourly wage on MTurk of ˜$2/hr (Hara et al., 2018).

B EMT lines generation
Following the methodology described in (Bhat et al., 2016), we apply clause substitution methodology to produce EMT sentences.To create OpSub-EMT, we start with the gold English sentence that contains either embedded sentence clauses (S) or subordinate clauses (SBAR) and swap one or more of them with their Hindi translations to produce an EMT synthetic CS sentence.Due to the lack of gold English translations available for All-CS sentences, we used the Google Translate API to first acquire their English translation.Many of the sentences in All-CS are shorter in length and do not contain the abovementioned clauses.So, we also considered inverted declarative sentence clauses (SINV), inverted question clauses (SQ) and direct question clauses (SBARQ) in addition to S and SBAR.In case none of the clause level tags were present, we considered the following phrase level tags as switching candidates: Noun Phrase (NP), Verb Phrase (VP), Adjective Phrase (ADJP) and Adverb Phase (ADVP).Owing to the shorter length and lack of clause-level tags, we switch only one tag per sentence for All-CS-EMT.The choice of which clause to switch was made empirically by observing what switches caused the resulting sentence to resemble a naturally occurring CS sentence.One can also use the toolkit provided by Rizvi et al. (2021) for generating EMT lines.

C Implementation Details: TCS
As an initialisation step, we learn the token embeddings (Mikolov et al., 2013) on the same corpus using skipgram.The embedding dimension was set to be 256 and the encoder-decoder layers share these lookup tables.Adam optimiser with a learning rate of 0.0001 was used to train the model.Validation BLEU scores on (HI → ENG/CS) translations and (EN → HI → EN) reconstructions were used as metrics to save the best model for TCS (S) and TCS (U), respectively.

D Human Evaluation
The 150 samples evaluated in Table 5 were taken entirely from test/validation splits.We undertook an alternate human evaluation experiment involving 100 real CS sentences and its corresponding CS sentences using LEX, EMT, TCS (U) and TCS (S).Out of these 100 sentences, 40 of them came entirely from the test and validation splits and the remaining 60 are training sentences which we filtered to make sure that sentences generated by TCS (S) and TCS (U) never exactly matched the real CS sentence.The table below (Table 9) reports the evaluations on the complete set of 100 sentences from 5 datasets.We observe that the trend remains exactly the same as in Table 5, with TCS (S) being very close to real CS sentences in its evaluation and TCS (U) trailing behind TCS (S).

E Language Model Training
The AWD-LSTM language model was trained for 100 epochs with a batch size of 80 and a sequence length of 70 in each batch.The learning rate was set at 30.The model uses NT-ASGD, a variant of the averaged stochastic gradient method, to update the weights.The mix-review decay parameter was set to 0.9.This implies that the fraction of pretraining batches being considered at the end of n epochs is 0.9 n , starting from all batches initially.Two decay coefficients {0.8, 0.9} were tested and 0.9 was chosen based on validation perplexities.

F Code-switching examples
The sentences in Table 10 have been generated on the test and validation splits of All-CS as well as the OpSub dataset.Overall, they depict how the model is able to retain context over long sentences (e.g."and social sectors") and perform meaningful switching over large spans of words (e.g."old conversation writer media", "regularly security practices").We also note that at times, the model uses words which are different from the natural English translations of the sentence, which are appropriate within the context of a CS sentence (e.g. the use of "manage" instead of "manageable").

G Details of GLUECoS Experiments
For masked language modeling (MLM), we select the default parameters for the learning rate (5e-5), batch masking probability (0.15), sequence length (512).The models are trained for 2 epochs with a batch size of 4 and gradient accumulation step of 10.For task specific fine tuning we rely on the official training scripts provided by GLUECoS repository. 8We train the models for 5 seed (0,1,2,3 and 4) and report mean and standard deviations of Accuracy and F1 for NLI and Sentiment Analysis respectively

H Additional Dataset and Experiments
Dataset The additional corpus on which experiments were performed is OPUS-100 (Zhang et al., 2020) which was sampled from the original OPUS corpus (Tiedemann, 2012).The primary difference between OpSub and OPUS-100 is that OpSub does not have manual Hindi translations of its sentences and requires the use of an external API such as Google Translate for translation.However, OPUS-100 has manually annotated sentences as part of the corpus.The source of OPUS-100 ranges from movie subtitles to GNOME documentation to the Bible.We extract 340K sentences from OPUS-100 corpus after thresholding on length (5-15).We offer this comparison of systems trained on OpSub and OPUS-100 to show how our models fare when using two datasets that are very different in their composition.
LEX lines generation.Generation of LEX lines is straightforward and requires only a bilingual lexicon.For each monolingual Hindi sentence we generate ˜5 sentences on OPUS-100 resulting in OPUS-100-LEX (to roughly match the size of OpSub-LEX).
EMT lines generation.For generation of EMT lines we have two strategies depending on the availability of tools (parsers, translation service, aligners, etc).The first strategy requires a translation service (either in-house or publicly available).We substitute the embedded clause from parse trees of English sentences with their Hindi translations.This strategy does not require a parallel Hindi corpus and has been previously used for generating OpSub-EMT and All-CS-EMT (Described in detail in Appendix B).
The second strategy, that is used to generate OPUS-100-EMT, requires a parallel corpus, a constituent parser in English and a word aligner between parallel sentences.OPUS-100 sentences are aligned using SimAlign (Jalili Sabet et al., 2020) and embedded clauses from parse trees of English sentences are replaced by Hindi clauses using word aligners.Here again, for each monolingual Hindi sentenece we generate ˜5 EMT sentences (strategy-2) on OPUS-100 resulting in OPUS-100-EMT.
Curriculum Training Experiments.Table 11 provides a walkthrough of systems using various training curricula that are evaluated for two different choices of datasets -OpSub vs OPUS-100 differing in the generation of EMT lines.The models are evaluated using BLEU (Papineni et al., 2002) scores computed on the test set of All-CS.The vocabulary is generated by combining train sets of all datasets to be used in the curricula.It is  3 are replicated here for ease of comparison.
126,576 when X = OpSub and 164,350 when X = OPUS-100 (OpSub shows a higher overlap in vocabulary with All-CS compared to OPUS-100).
The marginal difference in System O for OpSub and OPUS-100 is attributed to differences in the size of the vocabulary.OpSub being conversational in nature, is a better pretraining corpus compared to OPUS-100 as seen from System A, the sources of the latter being GNOME documentations and The Bible, apart from movie subtitles.
Language Modelling Experiments.Table 13 shows results from LM experiments (using the same setup as in Section 5.2.1).The values for TCS (S) and TCS (U) have been reproduced here for ease of comparison.(Note that TCS (SIMALIGN) does not perform as well as the  other models since the sentences for training the language model are generated on OpSub for all the models here, but TCS (SIMALIGN) has been trained on OPUS-100.) Evaluation Metrics.Table 14 shows the results of the three objective evaluation metrics on the additional TCS models.In comparison with the results in Table 8, we observe that TCS (LEX) and TCS (SIMALIGN) perform comparably to TCS (S) and TCS (U) on all metrics.
t e x i t s h a 1 _ b a s e 6 4 = " Q e 5 F C 3 z 9 z Y m V n r v 2 5 Z e S n b K 6

Figure 1 :
Figure 1: Model architecture.Each loss term along with all the network components it modifies are shown.During unsupervised training with non-parallel text, LDAE and LBT are optimized while for supervised training with parallel text, LDAE and LCE are optimized.
4.1 A New Hindi-English CS Dataset

Figure 2 :
Figure 2: Distribution across overall sentence lengths and distribution across lengths of continuous English spans in Movie-CS and Treebank-CS.

Figure 3 :
Figure 3: Variation of BLEU score with amount of All-CS parallel training data.
use text generated by our model to train a language model (LM) and evaluate perplexities on the test set of All-CS to show how closely sentences from TCS mimic real CS text.We use a state-of-the-art RNNLM model AWD-LSTM-LM Merity et al. (2018) as a blackbox LM and only experiment with different training datasets.The model uses three LSTM layers of 1200 hidden units with weight tying and 300-dimensional word embeddings.In initial runs, we trained our language model on the large parallel/synthetic CS datasets and finetuned on the All-CS data.However, this training strategy was prone to overfitting on All-CS data.To counter this problem of forgetting during the pretrain-finetuning steps, we adopted the Mix-review strategy proposed byHe et al. (2021).The training sentences from All-CS remain constant through the epochs and the amount of pretraining data is exponentially decayed with each epoch.This greatly alleviates the forgetting problem in our model, and leads to better overall perplexities.Additional details about these LMs are provided in Appendix E.
(a) BERTScores on test split of All-CS.Each row corresponds to a different data filter.The numbers in parenthesis denote the number of sentences in the data after filtering.(b) Accuracies from the classifier for samples generated by various methods as being fake.The |Sentences| refer to size of dataset for each system.TCS models have the lowest accuracy among synthetic methods.(c) Diversity Scores for different techniques using Gzip and Self-BLEU based diversity measures.

Figure 4 :
Figure 4: A snapshot of the web interface used to collect Movie-CS and Treebank-CS data via Amazon Mechanical Turk.
) Requires parser along with parallel data Alignment can be generated using SimAlign

Table 2 :
Key statistics of CS datasets.
CS Datasets.As mentioned in Section 3.1, we use two simple techniques LEX and EMT to generate synthetic CS text, which in turn is used to train TCS in an unsupervised training phase.For each Hindi monolingual sentence in OpSub, we generate two LEX and two EMT synthetic CS sentences giving us OpSub-LEX and OpSub-EMT, respectively.We also generate five LEX and five EMT lines for each monolingual sentence in All-CS.In order to generate EMT lines, we first translate the monolingual Hindi sentences in All-CS to English using Google Translate and then follow the EMT generation scheme.This results in two datasets, All-CS-LEX and All-CS-EMT, which appear in later evaluations.(Appendix B contains more details about EMT applied to OPUS and All-CS.)

Table 5 :
Mean and standard deviation of scores (between 1 and 5) from 3 annotators for 150 samples from 5 datasets.

Table 7 :
GLUECoS Evaluation: Mean and standard deviation of scores after evaluating on 5 seeds.Baseline denotes the M-BERT model without any MLM pretraining.

Table 9 :
Mean and standard deviation of scores (between 1 and 5) from 3 annotators for 100 samples from 5 datasets.

Table 11 :
BLEU score on (HI ß CS) for different curricula measured on All-CS (test).X -Y represents starting with model X and further training using dataset Y. Values from Table

Table 12 :
Use cases for different TCS models.

Table 13 :
Test perplexities on OpSub and All-CS using different pretraining datasets.

Table 14 :
Evaluation metrics for the additional TCS models.Please see Table8for a comparison with other models.