CoMix: Guide Transformers to Code-Mix using POS structure and Phonetics

Code-mixing is ubiquitous in multilingual societies, which makes it vital to build models for code-mixed data to power human language interfaces. Existing multilingual transformer models trained on pure corpora lack the ability to intermix words of one language into the structure of another. These models are also not robust to orthographic variations. We pro-pose CoMix 1 , a pretraining approach to improve representation of code-mixed data in transformer models by incorporating phonetic signals, a modified attention mechanism, and weak supervision guided generation by parts-of-speech constraints. We show that CoMix improves performance across four code-mixed tasks: machine translation, sequence classification, named entity recognition (NER), and abstractive summarization. It also achieves new SOTA performance for English-Hinglish translation and NER on LINCE Leaderboard and provides better generalization on out-of-domain translation. Motivated by variations in human annotations, we also propose a new family of metrics based on phonetics and demonstrate that the phonetic variant of BLEU correlates better with human judgement than BLEU on code-mixed text.


Introduction
Code-mixing, i.e., embedding linguistic units of one language (embedded language L E ) into a sentence grammatically structured as per another language (matrix language L M ), is common in multilingual communities. Growing mobile penetration coupled with the increased adoption of informal conversational interfaces is leading to further rise in such communication. Currently, over 20% of user generated content from South Asia and parts of Europe is code-mixed (Choudhury et al., 2019). Hinglish (code-mixed Hindi-English) has nearly 350 million speakers (GTS, 2019) making it one of the most widely spoken languages. Recent literature suggests that multilingual users associate codemixing with cultural affinity and prefer chatbots that can code-mix (Bawa et al., 2020). Code-mixed modeling is, thus, a foundational prerequisite for linguistic systems targeted towards such users.
Transformer models such as BART  and BERT (Devlin et al., 2018) have been successful across various NLP tasks. These models can readily capture code-mixing semantics if a large corpus was available for training. Unfortunately, that is not true for most code-mixed languages. Existing approaches rely on learning from a parallel corpus of embedded and matrix languages (e.g., English and Hindi for Hinglish). Recent work (Chen et al., 2022), however, shows that multilingual models such as mBERT trained on monolingual sources fail to effectively interleave words from topologically diverse languages.
Adapting transformers to code-mixed data requires addressing the following challenges: 1. Divergent grammatical structure. For codemixed languages such as Hinglish, where L E and L M have different Parts-of-Speech (POS) patterns, models trained on monolingual corpora do not yield similar representations for equivalent words across languages, which is needed to facilitate interleaving of L E and L M words. Linguistic theories propose certain syntactic constraints for code-mixed generation (Poplack, 1980), but these are not usually incorporated into the modeling. 2. Code-mixing diversity. Code-mixed languages also exhibit a wide diversity in the degree of code-mixing (e.g., ratio of L E to L M words). Fig 1 shows multiple Hinglish constructions for a given sentence in English. Accounting for this variation in code-mixing is necessary for high fidelity modeling. 3. Orthographic variations. The informal nature of code-mixed interactions and lack of standardized transliteration rules leads to users employing adhoc phonological rules while writing code-mixed content. Fig 1 shows Hinglish sentences with similar sounding words and their variations ("kis", "kys"). Contributions. In this paper, we adapt transformer models for code-mixed data by addressing the above challenges. To ensure applicability to multiple downstream tasks, we focus on pretraining. 1. We propose CoMix, a set of generic pretraining methods to improve code-mixed data representations that can be applied to any transformer model assuming the availability of POS-tagger and phonetic transcription tools. These include: (a) Domain Knowledge-based Guided Attention (DKGA) mechanism that facilitates intermixing of linguistic units of L E into the structure of L M through a modified attention function, (b) Weakly Supervised Generation (WSG) that generates code-mixed data for training in a controllable fashion driven by linguistic constraints, and (c) inclusion of phonetic signals to align embeddings of similar sounding with different orthographic representation. 2. We instantiate CoMix pretraining for BART and BERT and demonstrate efficacy on multiple downstream NLP tasks, namely Machine Translation, NER, Sequence Classification, and Abstractive Summarization with relative improvements of up to 22%. CoMixBART and CoMixBERT achieve new state-of-the-art (SOTA) results for English-Hinglish translation and Hinglish NER tasks on LINCE Leaderboard (Aguilar et al., 2020), beating previous best mT5 (Jawahar et al., 2021) and XLM-R (Winata et al., 2021) models, despite having less than 0.5x and 0.1x model size respectively. 3. We evaluate out-of-domain code-mixed translation performance on two test sets, one created in-house and other one adapted from GupShup corpus (Mehnaz et al., 2021), and show that CoMix generalizes better than other models. To the best of our knowledge, this is the first such evaluation for English-Hinglish translation. We hope our benchmark will assist the community to improve out-ofdomain generalization of code-mixed translation, a critical need for low-resource regimes. 4. To address the limitations of existing metrics in handling orthographic variations in code-mixed data, we propose a new family of natural language generation (NLG) metrics based on phonetic adaptation of existing metrics. We observe that PhoBLEU, the phonetic variant BLEU, is better aligned to human judgement (+0.10 -0.15 on Pearson correlation) than BLEU on Hinglish.

Related Work
Multilingual and Code-Mixed NLP. Recent advances in large multilingual pre-trained models such as mBERT (Devlin et al., 2018) and mBART  have led to significant gains on many multilingual NLP tasks. However, evaluation of these models on code-mixed content for machine translation (Chen et al., 2022), sequence classification (Patwa et al., 2020), summarization (Mehnaz et al., 2021) and other tasks (Aguilar et al., 2020) points to their inability to intermix words from two languages since these are pretrained on monolingual text without any language alternation. Our CoMix approach encourages the model to learn representations that allows appropriate embedding of words from one language into structure of another via domain knowledge guided attention and through weakly supervised code-mixed generation. Prior work (Sanad Zaki Rizvi et al., 2021) focuses on generating synthetic code-mixed data using constraints from linguistic theories followed by learning. We perform joint generation and learning using pretrained models that has dual benefit of data generation and improving model representations, and has been shown to be effective for anomaly detection in medical images (Li et al., 2019). Incorporating Phonetics in Language Modeling. Combined modeling of phonemes and text has been a topic of recent interest and has contributed in improving robustness to ASR errors (Sundararaman et al., 2021). In code-mixed domain, Soto et al. (Soto and Hirschberg, 2019) engineered spelling and pronunciation features by calculating distance between pairs of cognate words to improve perplex-ity of English-Spanish models. We also incorporate phonetic signals to learn robust representations. Sentence Evaluation Metrics. Automated sentence evaluation metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) for comparison of unstructured sentences led to rapid innovation in NLP by facilitating ready evaluation of NLP systems against ground truth without additional human annotations. However, these metrics are unreliable for code-mixed content as they disregard widely prevalent orthographic variations. We propose a new family of metrics to address this gap.

CoMix Approach
Given a corpus of sentence pairs from L M and L E , our goal is to adapt transformer models such as BART and BERT to overcome the key challenges in modeling code-mixed data. To ensure applicability to multiple downstream tasks, we focus on the pretraining phase. We assume access to POS tagging and phonetic transcription tools 2 which is true for many languages (see Section 7). Below we summarize our approach for each of the challenges. P1 -Divergence in POS structure of L E and L M : To enable transformer models to extrapolate from L E and L M to code-mixed data, we rely on linguistic constraints. We observe that coarse groups of POS labels of concepts are preserved across translation (see Section 7) and that code-mixed sequences often retain the POS structure of L M sequence. Assuming access to POS labels the above constraints provide token-level correspondence for parallel training sentences, which can be used to augment the transformer attention mechanism and lead to representations that facilitate accurate interleaving of L E and L M words. [Section 3.1] P2 -Variations in the level of code-mixing: To accurately model variations in code-mixed data such as the mixing propensity, we propose a weakly supervised approach that combines POS constraints with a control on the code-mixing probability to generate code-mixed sequences from parallel monolingual corpora 3 for training. [Section 3.2] P3 -Orthographic variations: To align similar sounding words with orthographic variations, we incorporate phonetic signal as an additional input channel. We modify the transformer architecture to include two multi-head self-attention layers, one each for text and phoneme channels. [Section 3.3]

Domain Knowledge Guided Attention
Attention (Vaswani et al., 2017) is an essential mechanism of transformer architecture that converts an input sequence into a latent encoding using representational vectors formed from the input, i.e., queries, keys and values to determine the importance of each portion of input while decoding. Let X and Z denote the sequence of input tokens and the associated representations. Further, let Q, K, V denote the sequences of query, key and value vectors derived from appropriate projections of Z. In this case, attention is typically defined in terms of the scaled dot-product of Q and K.
To incorporate domain knowledge, we propose augmenting attention with an additional independent term f DKGA (X) defined on the input: (1) where d k is the dimension of query and key vectors. While the notion of DKGA is general, to aid with code-mixing, we focus on linguistic constraints. We construct three groups of POS labels (see A.1) that are preserved during translation (see 7). Let X denote the concatenation of the parallel monolingual sentences, i.e., X = X M ∥X E , where X M and X E are sentences in L M and L E respectively. Let POS GP (x) denote the group of the POS label of a token x. The linguistic constraints require that aligned token pairs from X M and X E belong to the same POS label group. Hence, for matrix tokens, we restrict attention to compatible embedded (2) Note that the above asymmetric choice is motivated by the fact that code-mixed sentences retain the POS structure of L M . Fig 2 shows how tokens from X E are selected using the above strategy which coupled with self-attention ensures learning representations that facilitate better intermixing of L E tokens into L M structure. See A.4.9 for example. Instead of a hard constraint on POS-label preservation, the DKGA function can also be modified to incorporate soft transition probabilities of POS labels during an L M to L E translation, which could be learned from parallel sentence pairs with token level alignment. We can also extend DKGA to include other sources for attention guidance, e.g., domain ontologies, word-alignment and also for cross-attention. Pretraining with DKGA. We modify all selfattention blocks in the encoder with DKGA and pretrain CoMixBERT with masked language modeling (MLM) objective (Devlin et al., 2018) and CoMixBART with denoising objective (text infilling with span=1). We mask the tokens of X M for which we want DKGA to guide attention to embedded words (e.g., in Fig 2, "kapde" will be masked).

Weakly Supervised Generation (WSG)
Lack of large code-mixed corpora poses a big challenge for code-mixed modeling. Hence, to facilitate direct training and allow control over desired properties such as the level of code mixing, we propose a weakly supervised mechanism for code-mixed generation using any transformer-based encoderdecoder model. The key idea is to nudge a pretrained multilingual model to code-mix by restricting the search space in the autoregressive step to a small suitable set of L E tokens exploiting the the fact that tokens with similar meaning and POS labels in L E are likely to replace the L M token.
Fig 3 shows the generative mechanism (Equation 3 shows the corresponding equations). At each auto-regressive step, we first determine the choice to code-mix, denoted by M i , sampled based on the mixing probability p Mix of the POS label of the token x i ∈ X M and an overall code-mixing level τ . The vocabulary search space denoted by V i is chosen as the POS-compatible words (same POS group as that of x i ) from X E for code-mixing, and the entire vocabulary V all otherwise. The next token is generated via greedy search with teacher forcing. In case of code-mixing, the target y i is set to the predicted valueŷ i and x i otherwise. We train the model with negative log-likelihood loss using X M as the input, as the prediction. Due to the self-dependency in WSG, the efficacy depends on whether the underlying model can correctly order the tokens in V i , which is a reasonable expectation from SOTA pretrained multilingual models. y i = argmax y∈V i P (y|y 1 , y 2 , ..., y i−1 , X M ), In our experiments, we set τ = 1 and p M ix to 1 for POS groups {N OU N, P ROP N, ADJ, ADV, V ERB} where code-mixing is frequent and 0 for the rest but can learn it from a small codemixed corpus in future. The proposed WSG mechanism can also be applied to encoder-only models such as BERT by considering a similar restriction of the vocabulary set V i at the last layer.

Mixing Phonetic Signal
Given a text sequence X, let X Ph denote the corresponding phonetic sequence. To incorporate both the signals, we replace the multi-head self attention layer in the transformer encoder layer with two multi-head self attention layers, one each for text and phoneme channel. Text sequence shares feed-forward layers with the phonetic sequence as shown in Fig 4 since we want phonetic representations to be in the same space as text representations . To keep the number of parameters in check, we add phonetics part of the encoder to only alternate encoder layers. Our decoder uses the concatenated sequence of contextual embeddings from X and X Ph as keys and values for cross attention. Pretraining with Phonetics.
We pretrain CoMixBERT for phonetics with MLM objective (as in BERT) and CoMixBART with denoising objective (text infilling with span length 1).

Phonetic Sentence Comparison Metrics
Lack of standardized transliteration rules is a key challenge not only for modeling but also for evaluating code-mixed or multi-lingual NLG tasks, e.g., English to Hinglish translation Human annotators employ orthographic variations and are also inconsistent in the use of punctuation and upper-lower casing as shown in Fig 1. Most NLG evaluation metrics such as BLEU, do not account for these variations, which leads to pessimistic and inaccurate estimates of the performance of NLG systems. To address this gap, we propose a new family of metrics based on the phonetic representation. Let s(·, ·) be any metric such as BLEU and ROUGE (Banerjee and Lavie, 2005) that facilitates comparison of a word sequence against a reference one. Given a pair of sentences (X, Y ), we define the phonetic metric as P ho-s(X, Y ) = s(X Ph , Y Ph ), where X Ph , Y Ph are the phonetic sequences. In this paper, we limit our focus to PhoBLEU and  present observations in Sec 6.6. Table 1 lists the four downstream tasks, baselines, SOTA models and metrics used in the evaluation 4 . For translation on HooD dataset (Sec 5.2), we also include Echo baseline which just passes the input sequence as output and helps measure the contribution of input-output overlap to the final performance.

Downstream Tasks, Baselines and Metrics
Previous studies (Dabre et al., 2021) (Kakwani et al., 2020) indicate that IndicBART (Dabre et al., 2021) and IndicBERT (Kakwani et al., 2020) are competitive relative to mBART and mBERT respectively on Indic languages. Further, since we initialize our models CoMixBART and CoMixBERT with weights from IndicBART and IndicBERT, we consider these as strong baselines for our evaluation of generative and classification tasks respectively.

Datasets for Downstream Tasks
We evaluate on LINCE Eng-Hinglish dataset for translation (Chen et al., 2022), SemEval-2020 Hinglish Sentimix dataset for sequence classification (Patwa et al., 2020), GupShup Hinglish chats to summaries (GupShup H2H) dataset (Mehnaz et al., 2021) for summarization and LINCE Hinglish (Singh et al., 2018) dataset for the NER task. Table 9 in Appendix A.3.2 lists data statistics. Hinglish Out-of-Domain Translation Dataset (HooD). We introduce two out-of-domain translation test-sets for Hinglish. Of these, the first one from shopping domain was prepared by inhouse human experts who translated English sentences generated by humans and models like GPT3, following the guidelines in Appendix A.3.1. The second test set was prepared from GupShup corpus (Mehnaz et al., 2021) from parallel English-Hinglish summaries of conversations created by linguists (Gliwa et al., 2019). These datasets can help  assess the zero-shot transfer capabilities of models from movies to shopping and open-domain for models trained on LINCE English-Hinglish dataset. Table 2 shows statistics of the HooD dataset.

Pretraining
Initialization. Training large transformer models from scratch requires thousands of GPU hours, which can be prohibitive. To ensure broader accessibility and best utilise existing models, we initialize CoMixBART decoder and encoder's nonphonetic weights (NPW) from IndicBART and CoMixBERT's NPW from IndicBERT. These are pretrained using Samanantar English-Hindi parallel corpus (Ramesh et al., 2021).
CoMixBART. We pretrain CoMixBART with DKGA on 1M sentences from Samanantar for 36k steps (∼ 2 epochs) on three 24GB GPUs with batch size of 2816 tokens, linear learning rate warmup and decay with 16k warmup steps. We use Adam optimizer with max learning rate of 1e-3, label smoothing of 0.1, dropout of 0.1 and token masking probability of 0.4. For WSG, we pretrain the DKGA model for additional 2k steps with the same setup except label smoothening and masking probability of 0. Learning curve for pretraining with DKGA and WSG is shown in Appendix A.4.2. Since pretraining CoMixBART for phonetics from scratch is computationally prohibitive because of its size, we devise a way to obtain reasonable weights for downstream training. We initialize embeddings of phonetic tokens with the mean of embeddings of the text tokens that map to it. We also initialize phonetic self-attention layer parameters with the same weights as that of corresponding text channel's self-attention layer.
CoMixBERT. We pretrain CoMixBERT with DKGA, WSG and Phonetics on 100k sentences from Samanantar on six 32GB GPUs with batch size of 20 per GPU, starting with a learning rate of 5e-5, linear learning rate warmup and AdamW optimizer. We pretrain DKGA and WSG for 1k steps and Phonetics for 3k steps. We are able to pretrain CoMixBERT with Phonetics because it has 7x less parameters than CoMixBART.

Downstream Fine-tuning
CoMixBART. The pretrained model is fine-tuned for downstream tasks in two stages. First, we attach a custom task-specific head to the decoder and train its weights, CoMixBART's NPW (encoder and decoder) for 5k steps on three 24GB GPUs with batch size of 2048 tokens, linear learning rate warmup, and decay with 2k warmup steps and max. learning rate of 5e-4 using Adam optimizer. In the second stage, phonetic weights of CoMixTransformer encoder are initialized as per section 5.3.1. Then, in downstream training of complete model, the weights from previous step are optimised with smaller learning rate than CoMixBART encoder's phonetic weights for additional 5k steps. We use beam search (size 4) for decoding. We train baseline IndicBART model for all tasks using YAN-MTT (Dabre, 2022), as prescribed in IndicBART repository (IndicBART, 2022) with the same setup as CoMix models. In all cases, we pick the model with best validation score after 5k training steps.
CoMixBERT. We attach a custom task-specific head to the model and train using standard finetuning procedure. For NER, we also have CRF layer attached after all models including baseline. Since its possible to combine encoder only models without sequential training, we report ensemble results obtained by averaging logits for DKGA+WSG and DKGA+WSG+Phonetic variants as they were better than sequential training. We use grid search to find the right set of hyperparameters for all models including baseline and pick the model with best validation score. We custom-build CoMixBART and CoMixBERT implementation using transformers (Wolf et al., 2020), YANMTT (Dabre, 2022), and PyTorch (Paszke et al., 2019).
6 Results and Analysis 6.1 Machine Translation    To test the generalization capabilities, we also evaluate the above models on out-of-domain HooD data.     (Patwa et al., 2020), we train our model in a minimalistic fashion without any data-augmentation or weighted adversarial loss or token ids that can improve performance. Hence, we do not compare our results against other solutions in Semeval-2020 task and only compare against mBERT and IndicBERT.

Abstractive Summarization
On Gupshup H2H summarization dataset, CoMix beats IndicBART on all metrics (BLEU, PhoBLEU, R1, R2 and RL) by margin of 0.8 to 2 points as shown in Table 7. CoMix even beats previously published best BLEU results obtained from PE-GASUS model (Zhang et al., 2019) but is worse on R1 and R2 metrics. CoMix is worse on recallbased metrics (R1, R2) and better on precision based metrics (BLEU) than PEGASUS and BART likely because of their ability to recall Englishbased words in the Hinglish summaries as they were pretrained only on English and the Gupshup dataset has been adapted from English conversa-  R2,RL 30.73,9.35,24.74 35.69,11.01,11.45,10.11,25.87 32.18,10.13,25.73 32.73,10.12,26.02 31.72,9.8,25.  tional summarization corpus due to which it contains a lot of English named entities and words. We believe CoMixBART performance can further improve if we do pretraining with phonetics in future.

Qualitative Analysis
We examine how well the models in Section 6.1 separate 3655 pairs of words (178 similar, 3477 dissimilar) from 20 sentences in Appendix A.4.6. Figure 5 shows the distribution of cosine similarity of contextual embeddings (phonetic and textual for CoMix, textual for IndicBART) for similar (green) and dissimilar (red) pairs . We note that CoMix text embeddings separate the similar and dissimilar pairs better relative to IndicBART. Note that the scores for phonetic embeddings are on the higher side most likely due to initialisation choice (mean of all text tokens mapped to a phonetic token) and the smaller (0.25x of text) vocab size for phonetics.

Efficacy of PhoBLEU on code-mixed data
On English-Hinglish translation for the LINCE dataset, we observe that annotations from human experts fluent in both Hindi and English achieve a BLEU score of 10.43 BLEU, which is lower than most MT models. Further analysis revealed that BLEU is unable to account for valid variations in Figure 6: Mean automated scores corresponding to human rating levels.  spellings, pronouns, and language switching (L E vs. L M ) as shown in Fig 1. To address these gaps, we consider PhoBLEU [as defined in Section 4] and evaluate its correlation with human judgements. We randomly selected 200 English-Hinglish sentence pairs and their system-generated translations to be rated by professional on a scale of 1 to 5, with 1 as poor and 5 as perfect. Completeness (no information lost) and fluency (grammatical correctness) were considered as the rating criteria. Results in Table 8 and Figure 6 show that PhoBLEU is significantly more correlated with human judgement and that its distribution is better aligned with human ratings than other BLEU variants.

Extensibility of CoMix
The proposed ideas of domain-knowledge guided attention, weak supervision, and phonetic representations are not specific to Hinglish and readily generalize to any language pair where we have a parallel corpus of embedded and matrix language content and tools for POS tagging and phonetic transcription. Below we discuss these requirements along with other assumptions on the POS structure that permits extensions of our methodology to most common code-mixed language pairs.
Assumption 1: Availability of parallel corpora.
Most common code-mixed languages happen to include English among the language pair which is typically transcribed in Latin script that permits easy phonetic transcription through tools such as Pyphonetics. Currently, there also exist multiple large parallel corpora (e.g. Flores, CCMatrix, Samanantar) where sentences in English are paired with that of multiple other languages. There are also many ongoing initiatives for creating such parallel corpora even for low-resource languages. Hence, the requirement of a large parallel corpus of matrix and embedded language content is satisfied by most common code-mixed pairs.
Assumption 2: Availability of pretrained multilingual models. With the proliferation of massively multilingual foundational models (e.g., mBART (50 languages), mBERT (104 languages), T5 (101 languages)) including advances in synthetic data augmentation, our assumption on the availability of pretrained LLMs or datasets to pretrain those models is also a reasonable one. We choose to work with IndicBART and IndicBERT which support 11 and 12 Indic languages respectively because they provide stronger baselines for Indic Languages and are faster to experiment with because of their smaller size, but the proposed ideas can be readily applied with any pre-trained transformer model.
Assumption 3: Languages of L M and L E share the same POS set and access to POS tagging utilities. (Petrov et al., 2012) proposed Universal POS tagset comprising 12 categories that exist across languages and had developed a mapping from 25 language-specific tagsets to this universal set. They demonstrated empirically that the universal POS categories generalize well across language boundaries and led to a open community initiative by universaldependencies.org on creating Universal POS tags (Nivre et al., 2020) for 90 languages. In our work, we use these universal POS tags to build three coarse groups (nouns-pronouns, adjectivesverbs-adverbs, rest) of POS tags (see Fig 7). Note that even though we utilize POS-tagging, the structural constraints are imposed with respect to these three coarse groups . Fig 7 in A.1 lists the POS tags from the universal POS tags website, which we use in our work. Further, Stanza provides Universal POS-tagging utilities for around 66 languages.
Assumption 4: Equivalent word pair from L M and L E share the same coarse POS group. We assume that equivalent words in an L M and L E pair share the same coarse POS group (from Fig 7 ) and not necessarily the same POS tag. A small-scale empirical analysis of 50 Hindi-English-Hinglish sentences from the HooD dataset (Sec 5.2) indicates this assumption is true in 88.6% of the cases. POS-tags provide complementary (weak) supervision for intermixing (in DKGA) and generation (in WSG) in addition to word semantics already captured in embeddings. Further, even though our current guiding function f DKGA assumes a hard constraint on the word pairs to be in the same coarse POS group, our methodology is general and can be extended to the case where the two languages have different POS tag sets. In particular, given empirical probabilities that a matrix token with POS tag A maps to an embedded token with POS tag B for all possible pairs of POS tags (A, B), we can define the guiding function's value f DKGA ij associated with matrix token x i and embedded token x j as the log of the empirical transition probability of the POS tags of the matrix token x i and embedded token x j . The current choice is the special case where transition probability is uniform for all POS tag pairs within a coarse group and 0 for the rest.

Conclusion
We presented CoMix, a pretraining approach for code-mixed data that combines (a) domain knowledge guided attention (DKGA), (b) weakly supervised code-mixed generation based on POSstructure constraints, and (c) transformer encoder modification to include phonetic signal. We showed that CoMix yields improvements across multiple code-mixed tasks, achieving new SOTA result for Eng-Hinglish translation and Hinglish NER on LINCE Leaderboard with superior performance on out-of-domain translation. Our approach is applicable to code-mixing with all languages where POS tagging and phonetic transcription is possible. Motivated by gaps in current NLG evaluation metrics for code-mixed data, we proposed a new family of metrics based on phonetic representation and show that PhoBLEU is better correlated with human judgement than BLEU on Hinglish. In future, we plan to extend the applicability of DKGA and WSG to other settings that can benefit from domain knowledge, and explore new metrics for code-mixed NLG with a large scale evaluation.

Limitations
Our CoMix approach assumes availability of parallel bilingual (embedded and matrix language) corpora and mature tools for POS tagging and phonetic transcription for both the embedded and matrix languages which does not hold true for every language. But these assumptions are reasonable for a large number of languages as shown in Appendix 7. Second, our current choice of guiding function for attention f DKGA and mixing probability p Mix are based on limited knowledge of the linguistic structure specific to English and Indic languages, and might need to be adapted for other language families. Additionally, as discussed in Section 4, due to multiple variations in code-mixed generation, current automated metrics that compare system generated text with reference text do not provide a true reflection of a system's ability to generate code-mixed text. Lastly, as with large language models, our CoMix models are also vulnerable to biases inherent in the training corpus.

Ethics Statement
Our research motivation is to address the inequities in language resources and AI systems for multilingual societies such as India. The primary contribution of our work is a new modeling approach CoMix, which is especially designed to leverage existing pretrained models with moderate computation so that it is accessible to a wider community and does not create an adverse environmental impact. We also created two new Hinglish datasets for out-of-domain evaluation (HooD), which we described in detail in Section 5.2. There are no privacy or intellectual property rights associated with either of these datasets. We will open-source HooD, our models and code in future post organizational approval. Human translations and evaluations reported in the paper have been done by professional annotation teams and are reflective of typical performance. Similar to other large language models, our CoMix model also encodes biases in the original training corpus and the domain constraints used as supervision. While the performance might be acceptable for natural language understanding, it is important to have guardrails while using the models directly for natural language generation. Figure 7 shows the POS tag groups used by DKGA. We built these groups using information from Universal Dependencies (Dependencies, 2014).

A.2.2 Sequence Classification
In sequence classification task, we are given a sequence X = [x 1 , x 2 , ...x S ] and corresponding label y ∈ {y 1 , y 2 , ..., y k } from fixed set of k classes. Given a training set with M data points, we aim to maximize L θ = M i=0 log P (y (i) |X (i) ; θ)

A.2.3 Abstractive Summarization
Mathematical formulation for summarization is same as translation so we avoid repeating it here for brevity. In abstractive summarization, unlike translation, target sequence Y is a concise summary of source sequence X, usually much shorter in length than X.

A.2.4 Token Classification
In token classification task, we are given a sequence X = [x 1 , x 2 , ...x S ] and corresponding label y s ∈ {y 1 , y 2 , ..., y k }∀s ∈ {1, S} for every input token, where {y 1 , y 2 , ..., y k } is the fixed set of k classes. Given a training set D = A.3 More details about datasets A.3.1 Guidelines for preparing HooD Shopping dataset Figure 8 shows the guidelines given to human annotators for translating English sentences to Hinglish for HooD Shopping dataset.
A.3.2 Data statistics of public datasets Table 9 shows statistics for public datasets which we have used for downstream tasks.

A.4.1 Details about tokenization
For Phonetics data, we train our own sub-word tokenizer using sentencepiece 6 . For text data we use pretrained IndicBART's tokenizer for CoMix-BART and IndicBERT's tokenizer for CoMix-BERT. We consider sub-words POS to be same as the POS of the word from which sub-words have been created.
A.4.2 More details on pretraining with DKGA and WSG Figure 9 shows the learning curve for pretraining CoMixBART. As you can see from the curve, loss stabilizes after 25k steps and does not change much. Figure 10 shows the learning curve for pretraining ComixBART with WSG. Table 10 shows few sample inputs which went into the model during WSG training and corresponding targets constructed by the model. These generated sentences can be used for data-6 https://github.com/google/sentencepiece  augmentation which we plan to explore in the future. Figure 11 shows the convergence speed for In-dicBART and CoMix models for LINCE English-Hinglish translation task. As you can see from the curve, CoMix is better than baseline IndicBART throughout training. A.4.5 Set of sentences for cosine similarity distribution Figure 13 shows the 20 sentences from which every pair of word was labelled similar/dissimilar manually and then used to create Figure 5 that shows cosine similarity score distribution of contextual embeddings obtained from the encoder of CoMix and IndicBART.   A.4.6 CoMix vs IndicBART Cosine Similarity Distribution of Contextual Embeddings Figure 11 shows the mean and variance of the cosine similarity distribution of 3655 word pairs constructed from the 20 sentences in Figure 13 along different subsets of positive pairs and close negative pairs. We observe that the cosine score for positive pairs based on CoMix text embeddings has a bimodal distribution with high scores for those with same language and spelling, but relative low scores when that is not the case. However, even these low scoring positive pairs are comparable or score higher than close negatives. In the case of IndicBART, we again observe a bimodal distribution for the negative pairs with high scores for pairs that have different semantics but share either the spelling or phonetic representation, which makes it difficult to separate it from the positive pairs. CoMix phonetic embeddings by itself does not seem to be very discriminatory but it is helpful in making up for the shortcomings of CoMix text embeddings for handling phonetic variations.
Since in WSG we're nudging the model to codemix, that behaviour is visible in the generated translations by the two models as well. Figure 14 shows few randomly sampled translations generated by DKGA and DKGA+WSG models. It is visible from the translations that DKGA+WSG model is switching between matrix and embedded language more often, because of its pretraining.    Figure 14: Comparing randomly sampled 10 translations generated by DKGA vs DKGA+WSG. We see that DKGA+WSG model switches between matrix and embedded language more often than DKGA because of how we nudge the WSG model to code-mix during pretraining. Figure 15: DKGA attention matrix and contextual embeddings construction for example sentence. Green ticks is where DKGA will guide attention. If we choose to intermix embedding language token at a position then "Choice 2" tokens will be considered else "Choice 1" tokens will be considered for constructing contextual embeddings. C1,C2,C3 are POS groups defined in Appendix A.1.