Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.


Introduction
Code-switching (CSW) is the alternation of language in text or speech, which can occur across different levels of granularity: sentences, words and morphemes.CSW is a common phenomenon in Arabic-speaking countries, as in other multilingual communities.Given that Arabic is a morphologically rich language (Habash et al., 2012), speakers produce morphological CSW, as illustrated below: algorithm+ implement+ ⇐ 'Okay, I'll+implement the+algorithm right away' CSW introduces a set of challenges to NLP systems, not least of which is data scarcity.This is attributed to CSW being a predominantly spoken phenomenon, only recently increasing in written form on social media.Data augmentation has proved to be a successful workaround for this limitation.Researchers have investigated several techniques for CSW data augmentation, including learning CSW points (Solorio and Liu, 2008;Gupta et al., 2021), lexical replacements (Appicharla et al., 2021;Xu and Yvon, 2021;Gupta et al., 2021;Hamed et al., 2022c), linguistic theories (Pratapa et al., 2018;Lee et al., 2019;Hussein et al., 2023), neural-based approaches (Chang et al., 2018;Winata et al., 2018Winata et al., , 2019;;Menacer et al., 2019;Song et al., 2019;Li and Vu, 2020), and machine translation (MT) (Vu et al., 2012;Tarunesh et al., 2021).With increasing efforts in this area, we need more comparative studies to better understand the merits and requirements of different approaches.
Efforts along these lines include the work of Pratapa and Choudhury (2021), where different linguistic-driven and lexical replacement techniques were compared through human evaluation, but not for NLP tasks.Winata et al. (2018) propose the use of pointer-generator network and compare it against the equivalence constraint (EC) theory (Poplack, 1980) and random lexical replacement for LM, without human evaluation.Hamed et al. (2022c) compare multiple lexical replacement techniques covering human evaluation and performance on language modeling (LM), automatic speech recognition (ASR), MT, and speech translation.Hussein et al. (2023) compare using the EC theory and random lexical replacement for LM and ASR, also reporting human assessments.
In this work, we compare three main approaches: lexical replacements, linguistic theories, and back-translation (BT).We evaluate the approaches for both naturalness of CSW generations and performance on MT, where we focus on CSW Egyptian Arabic-English to English translation.The rationale for our focus on MT is the scarcity of work around data augmentation as opposed to LM and ASR.Furthermore, previous work on MT focuses on lexical replacements (Menacer et al., 2019;Song et al., 2019;Appicharla et al., 2021;Xu and Yvon, 2021;Gupta et al., 2021;Hamed et al., 2022c) and BT (Tarunesh et al., 2021), without substantial comparison between approaches.Through our comparative study, we provide answers to the following research questions: • RQ1: Which augmentation technique perform best in zero-shot and non-zero-shot settings (with/without the availability of CSW parallel corpora) for MT?
• RQ2: Does generating more natural synthetic CSW sentences entail improvements in MT?
2 Data Augmentation Techniques We provide an overview on the investigated techniques. 1 Our aim is to augment Arabic-to-English parallel sentences, converting the source side of the parallel data from monolingual Arabic to CSW Arabic-English, further extending the MT training data with CSW instances.In Figure 1 4.

Lexical Replacements
We investigate the following three approaches: Dictionary Replacement: We replace x random Arabic words on the source side with English gloss entries.We obtain the gloss entries using MADAMIRA (Pasha et al., 2014), an Arabic morphological analyzer and tagger.Such a specialized analysis system is required for this task as Arabic is morphologically rich and orthographically ambiguous.We refer to this approach as LEX Dict .
Aligned with Random CSW Point Assignment: We augment the Arabic-to-English parallel sentences by randomly picking x source-target aligned words (using intersection alignments) and replacing the source words with their counterpart words on the target side.In Hamed et al. (2022c), the authors investigated two types of alignments for performing source-target replacements: (1) word replacements using intersection alignments and (2) segment replacements where grow-diag-final alignments are used to identify aligned segments.
Given that segment replacements were shown to be superior, we follow that setup in our experiments.We refer to this approach as LEX Rand .
Aligned with Learnt CSW Point Prediction: Similar to the previous approach, we perform target-to-source replacements; however, the choice of words on the target side to be inserted into the source side is based on a CSW predictive model (Appicharla et al., 2021;Hamed et al., 2022c  model is trained to identify words on the target side that would be plausible CSW words on the source side.The task of CSW point prediction is modeled as a sequence-to-sequence classification task.The neural network takes as input the target sentence word sequence x = {x 1 , x 2 , .., x N }, where N is the length of the sentence.The network outputs a sequence y = {y 1 , y 2 , .., y N }, where y n ∈ {1,0} represents whether the word x n is a plausible CSW word or not.To obtain the training data for the predictive model, we utilize a limited amount of CSW Egyptian Arabic-English to English parallel sentences, where we tag the words on the target side as 0 or 1 based on whether they appear as CSW words on the source side or not.This is done using a matching algorithm described in Hamed et al. (2022c).The CSW predictive model is then trained by fine-tuning mBERT on this data.2Afterwards, to augment Arabic-to-English parallel data, we use the model to identify CSW candidates on the target side which are inserted in the source side using segment replacements.For a detailed description of this approach, see Hamed et al. (2022c).We refer to this approach as LEX P red .

Linguistic Theories
We cover the following two linguistic theories: Equivalence Constraint (EC) Theory: The EC Theory (Poplack, 1980) is an alternational model for CSW, where there are no defined matrix and embedded languages.Instead, the theory states that code-switching can occur at points where the surface structures of both languages map onto each other.In the example in Figure 1, the permissible alternations are indicated by dotted lines.Generating " Italian" and "Italian " is not allowed as the syntactic rules of both languages are different (Arabic adjectives follow the nouns they modify).
Matrix Language Frame (MLF) Theory: The MLF Theory (Myers-Scotton, 1997), on the other hand, is an insertional model.It is based on the identification of a matrix language, to which constituents of the embedded language are inserted such that the sentence follows the grammatical structure of the matrix language, and the embedded language is inserted at grammatically correct points.Unlike the EC theory, replacements from the embedded to matrix language are not allowed within nesting sub-trees.Replacements of closedclass constituents are also not allowed, including determiners, quantifiers, prepositions, possessive, auxiliaries, tense morphemes, and helping verbs.
For both linguistic theories, we use the GCM tool (Rizvi et al., 2021). 3The tool provides multiple augmentations per source-target parallel sentence, following a linguistic theory.To sample from these generations, it provides two sampling approaches: random and Switch-Point Fraction (SPF) (Pratapa et al., 2018).In random sampling, k generations are picked randomly.In SPF sampling, the generations are ranked based on their SPF distribution compared to a reference distribution obtained from real CSW data and the top-k generations are chosen.SPF is calculated as the number of switch points divided by the total number of language-dependent tokens in a sentence. 4e set k to 1, which is unified across all techniques.We include both sampling approaches, where we refer to the variants as EC Rand , EC SP F , ML Rand , and ML SP F .

Model
Table 1: The number of CSW generations (#Aug) obtained from the different BT setups: (1) BT model trained on English to Arabic and English to CSW Arabic-English parallel sentences, (2) same as 1 and followed by fine-tuning using the English to CSW Arabic-English parallel data, (3) similar to 2, with appending English sentences to both sides of the training data, and (4) same as 3 with utilizing the top-19 hypotheses.

Back-translation
Despite BT (Sennrich et al., 2015) being a wellknown data augmentation technique, it has received little attention in the scope of CSW (Tarunesh et al., 2021).In this approach, we train a BT model to translate English sentences to CSW Arabic-English.We then use this model to translate the target side of the Arabic-to-English parallel sentences, generating synthetic CSW Arabic-English to English parallel sentences.The BT model is trained on a limited amount of English to CSW Arabic-English parallel sentences and a larger amount of English to Arabic parallel data.However, when using this model to translate 309k English sentences, only 109 CSW sentences are generated, with the rest of the translations being monolingual Arabic.This is due to the training data of the BT model only constituting of 0.7% of sentences having CSW.We boost the number of generated CSW synthetic sentences through the following steps: 1. We fine-tune the model using the English to CSW Arabic-English parallel data.
2. In the BT model training data, we further append the English sentences in the parallel corpus to both source and target sides.
3. At inference, instead of obtaining the top-1 hypothesis for each English sentences, we utilize the top-k hypotheses and obtain the CSW translation with the highest confidence score.
We set k to 19, where we could not further increase the value of k due to computational constraints.
In Table 1, we show the effect of each step on the number of obtained CSW generations, reaching a total of 151k CSW augmentations by applying all three steps (augmenting 49% of original sentences).
3 Experimental Setup

Data
We use two sources of data: (1) ArzEn-ST (Hamed et al., 2022b), which is a CSW-focused parallel corpus and (2) monolingual Egyptian Arabicto-English parallel corpora.ArzEn-ST contains English translations of a CSW Egyptian Arabic-English speech corpus (Hamed et al., 2020) gathered through informal interviews with bilingual speakers.The corpus is divided into train, dev, and test sets having 3.3k, 1.4k, and 1.4k sentences (containing 2.2k, 0.9k, and 0.9k CSW sentences).

Setup of Augmentation Approaches
Through augmentation, we convert the source side of the 309k Arabic-to-English parallel sentences to CSW Arabic-English.For word alignments, we use Giza++ (Och and Ney, 2003). 5For the augmentation approaches that require CSW parallel sentences, we utilize ArzEn-ST.In BT, we train the model on the train sets of the parallel corpora outlined in Section 3.1, with reversed source and target sides.The predictive model in LEX P red is trained on the portion of ArzEn-ST train set having CSW sentences.That subset is also utilized in the linguistic theories to obtain the reference SPF distribution (= 0.22).It is also utilized in LEX Dict and LEX Rand , where the value of x is set to 19% of the source words based on the percentage of English words in ArzEn-ST train set CSW sentences, which is 18.8%.However, the average percentage calculated over sentences is 22.1% with a standard deviation of 17.5%.The decision of 19% is in agreement with Hussein et al. (2023) where the authors report LM perplexities achieved by embedding different percentages of English words in Arabic text using random lexical replacement and decide on a percentage of 20%.In future work, we believe an interesting direction is to model CSW distribution to obtain a wider coverage of various CSW levels rather than targeting a single percentage for all sentences.

Machine Translation System
We train a Transformer model using Fairseq (Ott et al., 2019) on a single GeForce RTX 3090 GPU.We use the hyperparameters from the FLORES benchmark for low-resource machine translation (Guzmán et al., 2019). 6The hyperparameters are given in Appendix C. We use a BPE model trained jointly on source and target sides with a vocabulary size of 16k (which outperforms 1, 3, 5, 8, 32, 64k).The BPE model is trained using Fairseq with char-acter_coverage set to 1.0.For MT training data, we use the train sets of the corpora outlined in Section 3.1.For the augmentation experiments, we append the synthetically generated CSW Arabic-English to English parallel sentences.For development and evaluation of the MT models, we use ArzEn-ST dev and test sets.

Evaluation
In this section, we present intrinsic evaluation, human evaluation, and extrinsic evaluation.

Intrinsic Evaluation
In Table 2, we report the number of CSW sentences generated per technique as well as CSW statistics.We report that the number of augmentations varies considerably across techniques: LEX Dict > LEX Rand > BT > EC > LEX P red > ML.
With regards to CSW metrics, we report Codemixing Index (CMI) (Gambäck and Das, 2016), SPF, and the average percentage of English tokens over sentences.CMI reflects the level of mixing between multiple languages, and is calculated on the sentence-level as follows: where N is the number of language-dependent tokens in sentence x; L i ∈ L is the set of languages in the corpus; max L i ∈L {t L i } is the number of tokens in the dominating language in x; and P is the number of switch points in x, where 0 ≤ P < N .The corpus-level CMI is calculated as the average of sentence-level CMI values.We observe that in general, LEX Rand and LEX P red provide the closest figures to ArzEn-ST with regards to CSW metrics.It is to be noted that unlike LEX Rand and SPF-based linguistic theories, no explicit CSW heuristics were provided to LEX P red , and the predictive model learnt to imitate the CSW frequency in ArzEn-ST.In the case of linguistic theories, we note that SPF sampling provides CMI and SPF figures that are closer to ArzEn-ST than random sampling.Finally, we report that the linguistic theories and BT augmentations contain high percentages of English words.

Human Evaluation
In order to assess the quality of the synthetically generated CSW sentences, we perform a human evaluation study.Out of the original sentences that get augmented by all techniques, we randomly sample 150 sentences. 7These sentences are evaluated by three annotators across the eight augmentation techniques against two measures: understandability and naturalness.All three annotators are female Egyptian Arabic-English bilingual speakers, in the age range of 33-39, all graduates of private English schools.We follow the rubrics introduced by Prat-Understandability 1 No, this sentence doesn't make sense.2 Not sure, but I can guess the meaning of this sentence.3 Certainly, I get the meaning of this sentence.
Naturalness 1 Unnatural, and I can't imagine people using this style of code-mixed Arabic-English. 2 Weird, but who knows, it could be some style of code-mixed Arabic-English.3 Quite natural, but I think this style of codemixed Arabic-English is rare.4 Natural, and I think this style of code-mixed Arabic-English is used in real life.5 Perfectly natural, and I think this style of codemixed Arabic-English is very frequently used.apa and Choudhury (2021), outlined in Table 3.
Understandability is rated on a scale of 1-3 and naturalness is rated on a scale of 1-5 where scores of 3-5 are assigned to natural sentences with different levels of commonality to be encountered in real life.A total of 1,200 augmentations are annotated by each of the three annotators for both understandability and naturalness, giving a total of 7,200 annotations. 8For each augmentation, we calculate the mean opinion score (MOS) as the average of scores received by the three annotators.The full results are provided in Appendix E, where the percentage of sentences falling under each MOS range per technique is presented in Table 7.In Figure 2, we show the percentage of sentences perceived as natural by annotators across techniques (summation of the last two rows in Table 7).We observe the following ranking between techniques: BT > LEX P red > ML > EC > LEX Rand > LEX Dict .
With regards to linguistic theories, as noted by Dogruöz et al. (2021), computational implementations of linguistic theories do not necessarily generate natural CSW sentences that would mimic human CSW generation.We elaborate on this point in Section 6.While ML achieves higher naturalness ratings than EC, we do not observe superiority across the different sampling techniques, which can be due to the SPF values only changing slightly between both techniques in our case.This can be different in other setups with different reference SPF distributions.With regards to understandability, there is less variability across the techniques (91-96% of the augmentations are given ratings between 2 and 3), except for LEX Dict (the percentage is 65%).We perform inter-annotator agreement by applying pairwise Cohen Kappa (Cohen, 1960), reporting 0.25-0.28(fair agreement) on naturalness between annotator pairs.Low agreement on this task is expected, as CSW attitude is speaker-dependent (Vu et al., 2013).The pairwise Cohen Kappa scores for understandability are higher (0.33-0.35), yet still showing fair agreement.We also apply Fleiss' Kappa (Fleiss, 1971) across all annotators, scoring fair agreement of 0.312 and 0.249 for understandability and naturalness.9

Extrinsic MT Evaluation
The augmentation techniques covered in this study vary in terms of requirements.One main difference is the reliance on CSW parallel data, which is only available for a few CSW language pairs (Hamed et al., 2022b).To have a fair comparison and to show the effectiveness of the techniques in both cases (availability and lack of CSW-focused parallel corpora), we run two sets of experiments: • Zero-shot setting: In this setting, our baseline system is trained only using the 309k monolingual Arabic-to-English parallel sentences.
We extend the training data with augmentations generated using techniques that do not require CSW parallel data, namely: LEX Dict , LEX Rand , EC, and ML.
• Non-zero-shot setting: In this setting, we assume the availability of CSW parallel data.We train our baseline system using the monolingual Arabic-to-English parallel sentences in addition to ArzEn-ST corpus.We then append the augmentations generated by each of the investigated techniques.
In the following sections, we present our baseline systems and the results for zero-shot and nonzero-shot settings.The full results are reported in Table 5, showing BLEU (Papineni et al., 2002), chrF, chrF++ (Popović, 2017), and BERTScore (F1) (Zhang et al., 2019).BLEU, chrF and chrF++ are calculated using SacreBLEU (Post, 2018).We report performance on ArzEn-ST test set; on all sentences as well as CSW sentences only.Our analysis in this section is based on chrF++.This choice is based on chrF++ showing higher correlation with human judgments over chrF (Popović, 2017) and chrF showing higher correlation over BLEU (Kocmi et al., 2021).We report performance on ArzEn-ST test set CSW sentences, as this is our main concern.Statistical significance tests for zeroand non-zero-shot settings are shown in Table 6.

Baselines
We develop the following MT baselines, showing the improvements achieved by each source of data: • BL CSW : We train it solely on ArzEn-ST train set, having 3.3k parallel sentences.
• BL M ono : We train it on the 309k monolingual Arabic-to-English parallel sentences.
• BL M onoT gt : In BL M ono , we observe that English words on the source side get dropped in translation.This issue has been previously tackled by researchers using techniques including direct copying (Song et al., 2019) or the use of a pointer network (Menacer et al., 2019).We propose a simple technique of including target-target pairs in the training process.In other words, in addition to the sourcetarget sentences used in BL M ono , we append the English (target) sentences on both source and target sides, ending up with 617k parallel sentences.Our hypothesis is that by doing so, the model learns to retain the English words on the source side through translation.• BL All : We include the same data as in BL M onoT gt , in addition to ArzEn-ST train set, giving a total of 620k parallel sentences.
The chrF++ scores are shown in Figure 3 (full results in Table 5 Exp 1-4).The effectiveness of the simple step of adding target-target pairs during training is confirmed, where BL M onoT gt achieves an increase of +15.6 chrF++ points over BL M ono .Adding ArzEn-ST train set (BL All ) results in further +2.3 chrF++ points, achieving 57.3 on chrF++.

Zero-shot Setting Experiments
This setting is tailored to the majority of CSW language pairs, that are under-resourced and lack CSW-focused parallel corpora.We demonstrate the effectiveness of the augmentation techniques in a zero-shot setting.Given that LEX P red and BT are reliant on CSW parallel data, they are excluded from this comparison.We include the folapproaches: LEX Dict , LEX Rand , EC Rand , EC SP F , ML Rand , and ML SP F .We acknowledge that some of these approaches rely on heuristics obtained from CSW data, such as SPF value or the enforced CSW percentage.However, we argue that these figures can be obtained from textual data (that is more easily accessible than parallel data).The baseline in this setting is BL M onoT gt , which is our best baseline that does not utilize real CSW data.
We report that LEX Dict degrades the MT performance, falling 3.2 chrF++ points below the baseline.We present the chrF++ scores for the other techniques in Figure 4 (full results in Table 5 Exp 5-10).We observe that linguistic-based models and LEX Rand perform equally well, despite LEX Rand generating more data.As shown in Table 6, there is no statistical significance be- tween LEX Rand and linguistic-based models.Comparing the linguistic theories, EC performs better than ML, however, there is no difference between SPF and random sampling strategies.Overall, EC Rand performs the best, with statistical significance over ML Rand and ML SP F , achieving +1.3 chrF++ points over BL M onoT gt .

Non-zero-shot Experiments
In this setting, we assume the availability of CSWfocused parallel data, and thus compare all augmentation techniques.The baseline for this setting is BL All .The chrF++ scores are shown in Figure 5 (full results in Table 5 Exp 11-18).10LEX Dict falls below BL All by 1.4 chrF++ points, we thus exclude it from Figure 5.We observe that LEX P red and BT outperform LEX Rand and linguistic theories.The best performance is achieved by BT, achieving +1.3 chrF++ points over BL All .We also report that LEX Rand and linguistic theories are unable to achieve significant improvements over BL All . 11We examine the amount of real in-domain CSW data that would result in equivalent performance achieved by LEX Rand and linguistic theories in the zero-shot setting.In Figure 6, we show a learning curve by adding different amounts of ArzEn-ST train set CSW sentences to BL M onoT gt training data, and show that LEX Rand and linguis- tic theories (generating 98-192k CSW synthetic sentences) perform on par at 50% of ArzEn-ST train set CSW sentences (≈ 1,080 sentences).

Discussion
In this section, we revisit our RQs: RQ1 -Which augmentation techniques perform the best for MT?In the zero-shot setting, LEX Rand and linguistic theories achieve similar performance, with EC outperforming ML models.In the non-zero shot setting, BT outperforms all techniques, followed by LEX P red .Both techniques, being trained on real CSW data, are able to generate more natural CSW sentences, that could also be closer in CSW style to ArzEn-ST.

RQ2 -Does generating more natural synthetic CSW sentences entail improvements in MT?
Here, we look into the relation between MT scores and naturalness ratings.In the non-zero shot setting, we report a correlation of 0.97 between the chrF++ scores (presented in Figure 5) and the percentage of sentences perceived as natural (presented in Figure 2).This demonstrates a strong positive correlation between MT performance and naturalness of augmentations.
Given that the number of augmentations varies considerably across techniques (shown in Table 2), this variation could empower some techniques over others, affecting performance as an effect of quantity rather than quality.Therefore, we perform another set of experiments where we control for this variable.We report results under a constrained setup, where we restrict the augmentations appended to the baseline training data to only those that are successfully augmented across all techniques (= 24.8k sentences).We first append the constrained augmentations per technique to BL M onoT gt training data.The results are presented in Table 5 Exp 19-26.The order based on chrF++ is: BT > LEX P red > [LEX Rand & linguistic theories] > LEX Dict .The correlation between the chrF++ scores achieved on ArzEn-ST test set CSW sentences and the percentage of sentences perceived as natural is 0.95.We replicate the constrained experiments with appending the constrained augmentations to BL All training data.Given the availability of CSW data in the training data, and with constraining the amounts of augmented data, the majority of the models show no improvements over BL All .We therefore cannot use this setup to make conclusions on the relation between quality and performance.However, from the previously discussed findings, we confirm a positive relation between the naturalness of generated synthetic sentences and MT performance.

Insights into Augmentations
In this section, we present insights into the augmentations produced by the different techniques, further elaborating on their strengths and weaknesses.All examples mentioned in this section refer to the examples demonstrated in Table 4.
Lexical Replacements: The main drawback in LEX Dict is that the replaced words might not be correct translations within context, which can negatively affect the MT model.As shown in Table 4 Example 1, AwlE12 'turn on', in the context of turn on this light, is replaced by 'kindle'.This drawback is also observed in the case of ambiguous words, as shown in Example 2, where the word TAbE 'stamp' is replaced by 'impression'.With regards to LEX Rand , CSW can occur at unnatural locations, such as replacing dh 'this' in Example 1.This is less likely for LEX P red , which is reflected in human evaluation.
Linguistic theories: We observe that applying linguistic theories does not guarantee naturalness, e.g., the augmentation provided by EC SP F shown in Example 3, despite being a correct augmentation under the EC theory, was given a rating of '2' by all annotators.Moreover, the effectiveness of these techniques is tied to (and currently restricted by) the performance of the available tools that implement them.We observe that the augmentations obtained from the GCM tool in some cases violate the EC or ML theories.For the EC theory, in Example 4, we demonstrate a case where Arabic-to-English alternation occurs at the word 'station' which is a point of syntactic divergence in mHTp AlAwtwbys and 'the coach station'.For the ML theory, in Example 5, the augmentation includes the stand-alone CSW segment 'in', while replacements of closed-class constituents (including prepositions) is prohibited.As the tool relies on generating Arabic parse trees from English parse trees using alignments, errors are likely to be introduced.Furthermore, as noted by Hussein et al. (2023), the augmentations are sometimes missing information from the original sentences. 13T: We observe that BT is capable of generating correct morphological code-switching (MCS).As shown in Example 6, the MCS construction ' +handle' bt+handle 'handles' is correctly composed of bt 'progressive-imperfect-2nd-mascsing' preceding the verb 'handle' and 'field' is correctly preceded by the definite article Al 'the'.While researchers have provided insights into common Arabic-English MCS constructs (Kniaź, 2017;Kniaź and Zawrotna, 2021;Hamed et al., 2022a), there is no current research that allows for modeling Arabic-English MCS in a rule-based approach.
Therefore, the ability of neural-based approaches to generate MCS is an advantage.On the other hand, similar to the partial transcription issue noted in Chowdhury et al. (2021) for ASR models using BPE, the BT approach can provide partial translations of words, such as 'modifications' translated to s+ tEdyl+s 'modification+s'(Example 7).BT might also provide literal translation.With both issues combined, we find cases such as 'locker' being translated to er+ qfl+er 'lock+er'.

Conclusion and Future Work
We present a comparative study between different CSW data augmentation techniques and their effectiveness for MT in both zero-shot and non-zeroshot settings.We show that in the zero-shot setting, random lexical replacement performs equally well as linguistic theories.In the case of non-zero shot setting, back-translation performs best, followed by CSW predictive-based lexical replacement.Both approaches also stand out in human evaluation, where we confirm a positive correlation between naturalness of augmentations and MT performance.However, both approaches are reliant on expensive and limited CSW parallel data.Overall, the set of approaches examined proves useful in alleviating data scarcity.Each approach comes with particular merits and requirements, guiding the choice for different research needs.In future work, we plan on enhancing the back-translation approach to leverage larger amounts of English data.In parallel, we will investigate the effectiveness of generative AI to broaden the benchmark of approaches, and expand our study to cover other NLP tasks.

Limitations
One limitation of the presented work is that the models were evaluated on one test set only, and therefore, we cannot interpret how the models will perform on other sets, covering other domains and sources (spoken versus written).Another limitation is that the study involves only one language pair.Further research is needed to investigate whether the findings hold for other language pairs.A third limitation is the low variability in the annotators' demographics, as the three annotators are female annotators, in the same age group, receiving similar levels of education.Including a broader set of annotators would enrich research with insights on the level of agreement between annotators with wider background differences.

A Augmentation Examples
In Table 4, we present examples of augmentations generated by the different techniques.These examples are discussed in Section 6.

B Data Preprocessing
Following Hamed et al. (2022c), we remove corpusspecific annotations, remove URLs and emoticons through tweet-preprocessor, tokenize numbers, apply lowercasing, run Moses' (Koehn et al., 2007) tokenizer as well as MADAMIRA (Pasha et al., 2014) simple tokenization (D0), and perform Alef/Ya nor-malization. 14For entries with words having literal and intended translations, we opt for one translation having all literal translations and another having all intended translations.For LDC2017T07, we utilize the work by Shazal et al. (2020), where the authors used a sequence-to-sequence model to transliterate the corpus text from Arabizi (where Arabic words are written in Roman script) to Arabic orthography.For the Egyptian Arabic-to-English parallel corpora discussed in Section 3.1, we only utilize the 309k monolingual Egyptian Arabic-to-English parallel sentences available in these corpora, where we do not utilize the parallel sentences with codeswitching within the scope of this work.In future work, it would be interesting to investigate how the effectiveness of data augmentation varies with the availability of different amounts of real CSW parallel data, to draw further conclusions under different levels of low-resourcefulness.Also, for MADAR and LDC2012T09, we only utilize the Egyptian Arabic subsets of both corpora.

C Hyperparameters
For finetuning mBERT for the CSW predictive model, we set the epochs to 5, drop-out rate to 0.1, warmup steps to 500, batch size to 13, and learning rate to 0.0001.The training and inference time took ≈ 12 hours.

D MT Results
In Table 5

E Human Evaluation
We present the full results of the human evaluation study discussed in Section 4.2.For each evaluated augmentation, we calculate the mean opinion score (MOS) as the average of scores received by the three annotators.In Table 7, we present the percentage of sentences falling under each MOS range for understandability and naturalness per augmentation technique.In Table 8, we present the average MOS scores per technique.
, we provide an example showing possible augmentations across techniques.More examples are shown in Table

Figure 1 :
Figure 1: An example showing possible augmentations by the different techniques.We show the parse tree for the English sentence and word alignments.The permissible switching points under the EC theory are shown by the dotted lines.

Figure 2 :
Figure 2: The percentage of augmentations with 3≤ MOS ≤5 (quite natural but rarely used -perfectly natural and frequently used) per technique.

BLFigure 3 :
Figure 3: chrF++ scores of the different baselines on ArzEn-ST test set CSW sentences.

Figure 5 :Figure 6 :
Figure 5: The effectiveness of the augmentation techniques in a non-zero-shot setting.We show the chrF++ scores on ArzEn-ST test set CSW sentences.The solid and dashed lines represent BL M onoT gt and BL All .

Table 2 :
The number of generated sentences per technique, and their CMI and SPF mean and standard deviation (SPF/SPF σ ) values and average percentage of English words (%En).We also report the figures for the CSW sentences in ArzEn-ST train set as reference.
The effectiveness of the augmentation techniques in a zero-shot setting.We show the chrF++ scores on ArzEn-ST test set CSW sentences.The solid and dashed lines represent BL M onoT gt and BL All .

Table 4 :
, we report theMT results, showing  BLEU, chrF, chrF++, and BERTScore(F1).The Examples of synthetic CSW sentences generated by the different augmentation techniques, demonstrating strengths and weaknesses of techniques.Given that Arabic is written from right to left, we display all augmentations in a right-to-left orientation.statisticalsignificancebetweenthe models in the zero-shot and non-zero-shot settings for chrF++ achieved on ArzEn-ST test set CSW sentences are shown in Table6.The number of parameters in the models for Exp 1 is 39,712,768 and Exp 2-26 is 44,967,936.The training time taken by Exp 1 is ≈ 8 minutes, Exp 2 ≈ 2.6 hours, and Exp 3-26 ≈ 5.2-6.5 hours.