Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

The differences in decision making between behavioural models of voice interfaces are hard to capture using existing measures for the absolute performance of such models. For instance, two models may have a similar task success rate, but very different ways of getting there. In this paper, we propose a general methodology to compute the similarity of two dialogue behaviour models and investigate different ways of computing scores on both the semantic and the textual level. Complementing absolute measures of performance, we test our scores on three different tasks and show the practical usability of the measures.

One longstanding challenge in this area involves acquiring high-quality and low-cost code-mixed data.For one, code-mixing is observed more frequently in colloquial settings and spoken communication, which makes procuring and curating extensive datasets logistically demanding and costly (Chan et al., 2009;Winata et al., 2021).Moreover, despite code-mixing's prevalence across social media and digital messaging platforms, consolidating such data may be curtailed by legal guardrails and scalability issues.Recognizing these challenges, we explore the feasibility of using generative Large Language Models (LLMs) to ameliorate data scarcity in code-mixing research.As (ChatGPT) English-Tagalog code-mixed sentence: "Artificial Intelligence ay isang exciting na teknolohiya na makakapagbigay ng maraming benepisyo sa society."Translation: "Artificial Intelligence is an exciting technology that can provide many benefits sa lipunan." 1 (Davinci-003) AI is revolutionizing the way we live, trabaho, at mabuhay ngayon sa pamamagitan ng pagbibigay ng modernong solusyon sa mga proyekto at pagkukuha ng ilang mga trabahong naghihintay na ma-automate.

3
(Davinci-002) Artificial Intelligence is something that we should be thankful for.0 (BLOOMZ) An expert in artificial intelligence explains their area of expertise to a student with the help of a colleague who speaks both languages.0 (Flan-T5-XXL) Ang pagsusuri ng mas maraming kapasira ng hariko ay nag-aaral ng sasakyan ng pag-aaral ng pangangalambot, na magtitipon para sa pamamagitan ng pagpapasagawa ng karangalan ng isang lalaking.recent work shows that LLMs can successfully generate synthetic data (Taori et al., 2023;He et al., 2023;Tang et al., 2023;Whitehouse et al., 2023), here we evaluate whether multilingual LLMs can be prompted to create code-mixed data that look natural to native speakers (and if so, to what extent).
To this end, we hone in on languages in South East Asia (SEA).Home to more than 680 million people and over 1200 languages, code-mixing is particularly prevalent in this region due to its countries' extended histories of language and cultural cross-fertilization and colonialism (Figure 1) (Goddard, 2005;Bautista and Gonzalez, 2006;Reid et al., 2022).Marked by its distinctive multilingual and multiracial composition today, SEA presents an opportunity to further research numerous marginalized languages and linguistic practices in NLP research 1 (Migliazza, 1996;Goddard, 2005;Joshi et al., 2020;Aji et al., 2022;Winata et al., 2023;Cahyawijaya et al., 2022).Nonetheless, publicly available code-mixed datasets relevant to SEA communities remain limited (Lyu et al., 2010;Winata et al., 2022).
mixes English with either Malay, Indonesian, Chinese, Tagalog, Vietnamese, or Tamil.All of these six SEA languages (alongside English) are used across six SEA countries, namely Singapore, Malaysia, Brunei, Philippines, Indonesia, and Vietnam.Furthermore, they belong to different language families-Indo-European, Austronesian, Sino-Tibetan, Austro-Asiatic, and Dravidian.An example of a prompt we used is: "Write an English and Tamil code-mixed sentence about Artificial Intelligence."In addition, we prompt these LLMs to generate texts in Singlish, an English-based creole widely spoken in Singapore that combines multiple SEA languages such as Malay, Chinese and Tamil.We ask native speakers to annotate the naturalness (i.e., whether a native speaker would speak as such) and the level of code-mixing in the outputs.
To the best of our knowledge, our work marks the first attempt at studying the generation of synthetic code-mixed data through prompting LLMs in a zero-shot fashion without any monolingual reference texts or explicit linguistic constraints (Solorio and Liu, 2008;Tarunesh et al., 2021;Rizvi et al., 2021;Mondal et al., 2022).We find that publicly available multilingual language models such as BLOOMZ and Flan-T5-XXL are only capable of code-mixing with loanwords or topic-related nouns.Most of the time, they fail to code-mix (despite being advertised as multilingual).While ChatGPT stands out in its ability to generate codemixed texts, it is extremely sensitive to the prompt template and exhibits a considerable variance of success in generating natural-sounding code-mixed texts across different language pairs.Additionally, it may erroneously introduce additional languages not specified in the prompt and wrongly explain the code-mixing of the text.
Our results lead us to conclude that code-mixing, at least as of today, is not considered an essential component of many multilingual LLMs.Moreover, the opaque creation of models like ChatGPT makes it difficult to ascertain the mechanisms that enable code-mixing.By highlighting the limited promises of LLMs in a specific form of low-resource data generation, we advise NLP researchers against using existing systems to produce synthetic codemixed data without extensive human evaluation.
In our prompts, we specify code-mixing between English and either Indonesian, Malay, Mandarin, Tagalog, Vietnamese, or Tamil.We focused on code-mixing English with SEA languages for two reasons: (1) extensive literature on code-mixed English provides a relevant point of comparison, and (2) English is one of the most widely used languages in code-mixing across SEA countries (Kirkpatrick, 2014).We additionally prompt with sentences in Singlish, a creole language, to evaluate how sensitive LLMs are to the diversity of language practices in the SEA region.In total, we submitted 210 unique prompts per language model.

Level of Code-Mixing
To evaluate outputs, we ask whether LLMs can produce intrasentential code-mixed text.We adopt the definition of intrasentential code-mixing from Berk-Seligson (1986), which covers mixing small constituents-such as noun and verb phrases-and large constituents-such as coordinate clauses and prepositional phrases.Native speakers are then tasked to manually annotate the collected responses on a scale from 0 to 3 using the following coding guidelines to denote the degree of code-mixedness: • 0 -No code-mixing: The generated text is written purely in one language or only exhibits intersentential code-mixing (i.e., switching at sentence boundaries including interjection, idiom, and tags).We adopt the definition from Berk-Seligson (1986).
• 1 -Loanwords: The generated text uses loanwords for common terminologies.We consider a word as a loanword if it is listed in Wiktionary3 .For example: In the sentence, "I like eating pho," "pho" is a loanword.
• 2 -Topic-related nouns: The generated text uses nouns related to the topic specified in the prompt in another language.For instance, for the topic of traffic, an example would be "今 天的 traffic 真的很糟糕，我开了一个小时 We use this scale instead of popular word-level metrics such as CMI (Gambäck and Das, 2014) because our scale more holistically evaluates the ability of LLMs to code-mix.The lower end of this scale reflects a lower complexity of code-mixing.Code-mixing with loanwords is arguably less challenging, as they are often used in a monolingual context to begin with.Likewise, code-mixing topic-related nouns is not as complex as there is presumably a correspondence between the nouns in the two languages and is primed by the prompts.
On the other hand, code-mixing prefixes/suffixes, phrases and clauses requires a good grasp of the intricate morphosyntactic structures of both languages and can produce syntactically diverse code-mixed data.Therefore, we consider the LLM to have successfully generated code-mixed text only if the text belongs to this category.

Naturalness
We asked native speakers to annotate the naturalness of the generated text on a rating scale of 1not natural at all, 2-an annotator may see someone else from other regions/cultures speaking it, and 3-an annotator may see themselves speaking it.If the text is not considered natural (rating of 1), we requested the annotators to document their reasons, including fluency issues, syntactical errors, and semantic incorrectness.
Code-mixing practices vary across race, gender, class, and geography, among other axes.Therefore, in line with sociolinguistic research, we emphasize that the "naturalness" of code-mixed text (as perceived by an annotator) is subjective, and depends on the background and social location of the annotator.All annotators are AI researchers and native speakers of their respective SEA languages.We refer to Appendix G for more information about our annotators and their inter-annotator agreement scores.

Accurateness
We additionally annotate the accurateness of the generations to account for task failure or incorrect explanations.Inaccurate outputs were defined by: • Failure to follow instructions: Some LLMs fail to accurately carry out the prompt instruction (for example, generating monolingual sentences).We also observe some code-mixed outputs that include additional languages unspecified in the prompts.
• Inaccurate explanations: ChatGPT, among other LLMs, tends to explain the code-mixed phrases, but occasionally provides incorrect explanations (see Figure 10).

English-SEA Languages
Figure 3 demonstrates that ChatGPT outperforms other language models in generating code-mixed data across six different language pairs, and is particularly proficient at code-mixing linguistic expressions.For InstructGPT, the davinci-003 model demonstrates some ability to generate syntactically varied code-mixed texts, whereas davinci-002 mostly code-mixes with loanwords.Conversely, the other two publicly available multilingual LLMs have extremely limited code-mixing capabilities.More specifically, Flan-T5-XXL can only codemix with loanwords for the Indonesian-English language pair, and most of its non-English monolingual outputs suffer from severe fluency issues (see Appendix D).BLOOMZ is only capable of codemixing topic-related nouns for Tamil-English even though its multilingual pretraining data ROOTS (Laurenc ¸on et al., 2022) and instruction-tuning data xP3 (Muennighoff et al., 2022) cover Indonesian, Chinese, Tamil, and Vietnamese.We observe no direct effects of the proportions of these languages in the training sets on BLOOMZ's ability to codemix (Appendix E).
We further break down the performance of Chat-GPT in Figure 4 5 .In Figure 4(a), we see that 5 Detailed analysis for davinci-002, davinci-003, Flan-T5-XXL and BLOOMZ can be found in the Appendix (Figure 11, Figure 12, Figure 13 and Figure 14).
ChatGPT is least proficient at mixing linguistic elements for English-Tagalog.This may be due to syntactic differences between the two languages; for example, English exhibits Subject-Verb-Object (SVO) word order, whereas Tagalog exhibits a verb-initial structure.Moreover, English demonstrates nominative-accusative alignment, whereas Tagalog, being a symmetrical-voice language, utilizes a case system with a typological classification that "remains controversial among Austronesian linguists" (Aldridge, 2012, 192).In contrast, ChatGPT performs the best for English-Indonesian code-mixing, which may be due to training data distribution and similarities between the two languages regarding word order and morphosyntactic alignment.We also find that ChatGPT is capable of using either English or a SEA language as the matrix language, i.e., as the main language of a sentence as per the Matrix Language Frame model (Myers-Scotton, 1997).
Figure 4(b) shows ChatGPT's code-mixing proficiency based on topics.ChatGPT tends to codemix with loanwords when the topic is about "AI" by mixing the English loanwords "Artificial Intelligence," or its short form "AI." For food, it tends to code-mix with food-related terms-which are topic-related nouns-in SEA languages such as "bánh mì" (Vietnamese sandwich).We also observe some representation biases in specific language-topic pairs.For instance, when it comes to food, ChatGPT uses the word "nasi goreng" (fried rice) for all English-Indonesian responses.For other topics, such as traffic and weather, it tends to code-mix phrases related to traffic congestion and hot weather.
In Figure 4(c), we find the prompt template with the highest quality results is the one where the term code-mixing is explicitly defined.In contrast, the worst-performing template consists of asking the model to generate conversations between two bilingual speakers, where the term code-mixing is unmentioned.In Figure 5, we see that ChatGPT generates an uncommon pattern of conversations where one interlocutor speaks in English and the other speaks in another language entirely (top example).Furthermore, ChatGPT may assume there are four speakers though the prompt asks for a conversation between two speakers (bottom example).
In terms of naturalness, we observe a considerable variance in ChatGPT's ouputs, with English-Tamil being the least natural (Figure 6).Further  analysis shows that ChatGPT either commits grammatical mistakes (such as comma splice and redundancy) or generates semantically confusing sentences.We also observe unnatural text patterns that mix two different script systems of the same language in the single sentence (for example, the Tamil script and its transliterated Latin script).We document these naturalness and fluency issues in Table 4 (Appendix F) and report the interannotator agreement scores in Appendix G.

Singlish
Imagine that you are a Singaporean speaker.How would you talk about food using Singlish in a single sentence?(ChatGPT) "Wah, the hawker center got so many sedap food, like chicken rice, laksa, and rojak, can die for sia!" (Translation: "Wow, the hawker center has so many delicious foods, like chicken rice, laksa, and rojak, they are to die for!")Given that Singlish is an English-based creole language that has lexicalized a lot of words from many languages such as Malay, Chinese, Tamil, Cantonese, and Hokkien (e.g., "lah," "leh," and "loh," etc.), it is difficult to disambiguate loanwords and linguistic elements.Therefore, we classify the model success based on a binary scale of whether the generated sentences contain Singlish expressions (e.g., interjections, lexicalized phrases of languages without a standardized writing system) 6 .In Figure 7, we see that ChatGPT and In-structGPT (davinci-003) have up to a 96% success rate in generating Singlish sentences, whereas Flan-T5-XXL and BLOOMZ have a near-zero success rate.Furthermore, we find that ChatGPT is also capable of translating Singlish expressions into Standard American English (Figure 8).
While ChatGPT is capable of generating fluent and natural Singlish (Figure 6), we observe semantic inaccuracies in their word choices that are discernible by native speakers.In Figure 2(b), ChatGPT generates an English-Malay code-mixed phrase "sotong and chilli sauce" where "sotong" is a Malay word for "squid."Upon first inspection, the sentence appears syntactically sound but native speakers found the analogy between familial connections and the dish (squid with chili sauce) to be semantically confusing.
OpenAI's documentation of model differences7 -more specifically, that davinci-003 can produce higher-quality writing, process more complex instructions, and generate longer content than davinci-002-does not adequately explain why ChatGPT and davinci-003 significantly outperform davinci-002 and other multilingual LLMs in generating Singlish text.We hypothesize that the performance gap may be due to the larger presence of Singlish in OpenAI's training data.Singlish is not only one of the most well-researched dialects of English (Sin, 2017), but it also uses the English alphabet and Latin script systems (unlike many other SEA languages), which potentially makes it easier for LLMs to scrape and parse data in this language.

Failures of ChatGPT
We notice that ChatGPT may fail to produce correctly code-mixed text when the nationality of the speaker is mentioned.Invoking nationality may Imagine that you are a Malaysian speaker.How would you talk about traffic using both English and Chinese in a single sentence?(ChatGPT) Assuming that you are asking for a sentence that combines both English and Chinese in a natural and commonly used way by Malaysian speakers, here is an example: "今天的 traffic 很 teruk，我要赶紧走去 KLCC for meeting。" This sentence incorporates the English word "traffic" and the Chinese word "teruk" (which means "bad" or "terrible" in English) to describe the situation of the traffic being very bad.The speaker then switches back to English to express their intention to hurry to KLCC for a meeting.This kind of code-switching is common in Malaysia, where speakers often mix languages in their daily conversations.induce ChatGPT to generate code-mixed texts with an additional national language of the country.In the Figure 9 example, because the national language of Malaysia is Malay 8 , ChatGPT not only returns a code-mixed response that includes the Malay word "teruk" when prompted for English-Chinese utterance by a Malaysian, but also misclassifies the word "teruk" as Chinese.
While ChatGPT can explain how its generated text is code-mixed (as illustrated in Figure 2(a)), we find its justifications unreliable for two reasons.First, it may hallucinate words that do not exist in the code-mixed text.For instance, in the top example in Figure 10, the English word "cold" does not exist in the code-mixed text.Second, it may assign words or phrases to the wrong language.In the bottom example in Figure 10, the phrase "nak minum air sejuk je" is in Malay, not English.This observation corroborates Bang et al.'s (2023) finding suggesting ChatGPT performs poorly on language identification tasks for low-resource languages.

Implications for LLMs and Ways Forward
In this section, we walk through various implications of our work for understanding the deployment of LLMs for synthetic data generation in lowresource settings, multilinguality in LLMs, importance of code-mixing ability in generative LLMs, and research transparency.
Deploying LLMs for Low-Resourced Data Generation By putting LLMs' generative capabilities to the test, we ask in this work if they can generate high-quality and low-cost code-mixed texts for 8 Mandarin Chinese is widely spoken in the country.
Imagine that you are an Indonesian speaker.How would you talk about weather using both English and Indonesian in a single sentence?
(ChatGPT) If I were to talk about weather using both English and Indonesian in a single sentence, I might say something like: "Hari ini sangat hot dan panas sekali, saya rasa aku butuh minum air dingin."This sentence includes the English words "hot" and "cold" to describe the weather, and the Indonesian words "panas" and "dingin" which have similar meanings.Additionally, the sentence also includes a personal statement in Indonesian, "saya rasa aku butuh minum air dingin," which means "I feel like I need to drink cold water." Code-mixing refers to a phenomenon of combining two or more languages in a single sentence.Can you generate a code-mixed English-Malay sentence about weather?
(ChatGPT) "The weather hari ni sangat panas, I rasa macam nak minum air sejuk je." (Translation: "The weather today is very hot, I feel like drinking cold water.")In this sentence, "hari ni" and "sangat panas" are in Malay, while "I" and "nak minum air sejuk je" are in English.researchers working on a topic plagued by limited data availability.While we conclude that Chat-GPT has shown relative success in generating codemixed texts for some SEA languages, we advise researchers to exercise heavy caution when using this data generation technique.Even for Singlish, which outperforms the other languages examined, we find that syntactically-sound responses may contain semantic inaccuracies that are difficult for non-native speakers to detect.Furthermore, its explanations may be misleading.Due to the lack of reliability, we strongly suggest researchers to implement extensive human checks with native speakers if they wish to pursue this method of data generation.
Multilingual ̸ = Code-Mix Compatible Our results with BLOOMZ and Flan-T5-XXL show that the ability to code-mix is not acquired by LLMs after pretraining and/or finetuning with multilingual data (Laurenc ¸on et al., 2022;Muennighoff et al., 2022;Chung et al., 2022).In other words, for most NLP models, multilinguality simply means that the same system can process tasks and generate outputs in multiple languages, but not necessarily in the same sentence.By highlighting this limitation, we echo previous research motivating the inclusion of code-mixing abilities in NLP models.Doing so requires NLP models to capture the dynamics of combining languages that have different degrees of typological affinities, as well as pragmatic and contextual features such as tone, formality, and other cultural nuances (Winata et al., 2020;Lai and Nissim, 2022;Kabra et al., 2023).

Towards More Inclusive Language Technology
Recognizing that generative LLMs are the primary driving force behind the advancement of AI conversational agents and speech technology (Thoppilan et al., 2022;SambaNova Systems, 2023;Pratap et al., 2023), we emphasize the significance of incorporating code-mixed output recognition and generation capabilities in LLMs in order to enhance the inclusivity and humaneness of language technology.By enabling conversational agents to reflect the language-mixing patterns of the users, people can communicate in ways that are more comfortable and authentic to their linguistic identities.In fact, a recent study by Bawa et al. (2020) has shown that multilingual users strongly prefer chatbots that can code-mix.Removing the need for people to adjust their speech patterns to become legible to machines would not only mitigate the effects of linguistics profiling (Baugh, 2005;Dingemanse and Liesenfeld, 2022) and hegemonic, Western-centric technological designs, but also enable users to develop more trust with language technology through naturalistic dialogue interactions.
Research Transparency Aside from showing that ChatGPT and InstructGPT can code-mix, we cannot confidently identify how the models do so due to the lack of transparency in how these systems are developed.Without a window into training data and engineering processes that went into models like ChatGPT, we can only speculate that their training data includes a substantial amount of code-mixed texts.To help facilitate greater levels of transparency and accountability, we urge forthcoming LLMs to be more open about how the models were developed and to document accurately and comprehensively the training data used.

Related Work
Code-Mixed Data in SEA Unlike monolingual data, there is only a limited number of human-curated code-mixed datasets.This resource limitation is more severe in SEA due to its marginalization in NLP research (Winata et al., 2022).Popular current code-mixing evaluation benchmarks (Aguilar et al., 2020;Khanuja et al., 2020) do not include SEA languages, and ex-isting code-mixing studies in SEA only cover a limited number of language pairs and creoles, e.g., English-Tagalog (Oco and Roxas, 2012), English-Indonesian (Barik et al., 2019;Yulianti et al., 2021), Javanese-Indonesian (Tho et al., 2021), Chinese-English (Lyu et al., 2010;Lovenia et al., 2022;Zhang and Eickhoff, 2023) and Singlish (Chen and Min-Yen, 2015;Lent et al., 2021) 9 .The current corpus does not even scratch the surface of the sheer amount of code-mixedness in SEA (Redmond et al., 2009), where deployable data is practically non-existent.In this work, we try to close this gap by exploring the potential of generating synthetic code-mixed data for the SEA region by prompting LLMs.(2021) leveraged parse tree structure for such replacements.Another approach is to perform neural machine translation to translate monolingual sentences to code-mixed ones (Appicharla et al., 2021;Gautam et al., 2021;Jawahar et al., 2021;Dowlagar and Mamidi, 2021).In this work, we assess a novel way of generating synthetic code-mixed sentences through prompting multilingual LLMs.

Conclusion
To ameliorate the scarcity of code-mixed data for South East Asian languages, we explore generating synthetic code-mixed data using state-of-the-art multilingual Large Language Models (LLMs).On one hand, we find that publicly available LLMs such as BLOOMZ and Flan-T5-XXL have limited capability in generating syntactically diverse code-mixed data.On the other hand, closed-source models such as ChatGPT and InstructGPT are better at generating natural code-mixed text, but their performance varies substantially depending on the prompt template and language pairing.Furthermore, many outputs suffer from syntactic, semantic, and reliability issues.Therefore, we caution against using LLM-generated synthetic code-mixed data without the involvement of native speakers for annotating and editing.

Limitations
7.1 Effectiveness of Synthetic Code-Mixed Data on Downstream Tasks In our study, we did not evaluate how much our synthetically generated code-mixed data improve the ability of language models to handle code-mixed text in downstream NLP tasks.While previous findings have shown that finetuning models with synthetic code-mixed data yields less performance gains than with naturally occurring code-mixed data (Santy et al., 2021), we believe that this performance gap will diminish as the quality of synthetic data generation gets better with future multilingual LLMs.

Lack of Human-Generated Data
While we annotated the degree of code-mixedness and naturalness, we did not have human-generated, naturally occurring, code-mixed sentences in response to the prompt topics.Therefore, we could not systematically compare the data distribution of our synthetic data against the human-generated data.However, since there are multiple ways in which a sentence can be code-mixed, our focus in this work is on how human-like are the sentences, and this, we believe, was adequately captured by our evaluation.

Monolingual Zero-Shot Prompting
Our study only uses prompt templates written in English to prompt language models in a zero-shot manner.In future follow-ups, we will (1) use code-mixed prompt templates such as "Generate an English-Bahasa sentence" instead of "Generate an English-Malay sentence" and (2) investigate LLMs' capabilities in generating code-mixed data with in-context few-shot examples.

Instruction-Tuned Language Models
Our work only covers instruction-tuned language models.In future work, we will include a comparison between multilingual models that are not finetuned with instructions-for example, GPT3 (davinci) (Brown et al., 2020) and BLOOM (Scao et al., 2022)-to explore the effects of instruction tuning in generating code-mixed data.

English-Centric Code-Mixing
Our study focuses on generating code-mixed data only for English-SEA language pairs.For future studies, we plan to investigate generating codemixed data for non-English language pairs commonly spoken in SEA countries (such as Malay-Chinese and Indonesian-Javanese).

Failures of BLOOM and Flan-T5-XXL
Given the lack of research transparency on why ChatGPT performs better at code-mixed text generation, we assume that the publicly available models such as BLOOM and Flan-T5-XXL are unable to code-mix due to the lack of code-mixed texts in the pretraining corpora and code-mixing tasks in the instruction-tuning datasets.Further investigation is warranted to understand the effects of code-mixed text in pretraining and instruction-tuning data on code-mixed text generation.

Presence of Synthetic Code-Mixed Data in
Future Pretraining Data As we advocate for the code-mixing ability in future generations of LLMs, we are aware of the potential risks of data feedback, where generative models that recursively train on data generated by previous generations may amplify biases and lose information about the tails of the original distribution (Shumailov et al., 2023;Taori and Hashimoto, 2022).Since these negative effects can be mitigated through human-generated content (Shumailov et al., 2023), it becomes imperative for the NLP community to collect natural code-mixed data for low-resource languages.

Ethical Considerations
Code-mixing reflects the linguistic, social, and cultural identity of a multilingual community.Researchers and practitioners should approach synthetic code-mixing with sensitivity and respect, and be cognizant of the potential risks of cultural appropriation or misrepresentation when generating code-mixed data using LLMs.Since LLMs are trained on web data, they may encode biases perpetuating stereotypes, discrimination, or marginalization of specific languages or communities.Prior work has also documented how synthetic data may play a role in feedback loops that amplify the presence of biased language generation (Taori and Hashimoto, 2022).Therefore, collaboration with linguists, language experts, and community representatives is necessary to avoid the unintentional perpetuation of stereotypes and cultural insensitivity.

A Languages Spoken in SEA
There are more than 1,200 languages spoken in SEA (Redmond et al., 2009;Maliwat, 2021), 700 of which are spoken in Indonesia (Aji et al., 2022;Cahyawijaya et al., 2022).We describe the languages the SEA languages used in the study in the following paragraphs.
Mandarin Chinese Mandarin Chinese (zh-Hans), which belongs to the Sino-Tibetan language family and uses the Hanzi script, is widely spoken in SEA due to the migration of Chinese people from the coastal provinces of southeastern China, such as Fujian, Guangdong, and Hainan.People of Chinese heritage in SEA frequently use the term "华人" (huá rén) to express their cultural identity as an ethnic group, instead of "中国人" (zhōng guó rén) which is primarily associated with nationality, even though both terms can be translated as "Chinese (people)."Singapore has the largest Chinese ethnic group among all SEA countries and Mandarin Chinese is considered one of the official languages in Singapore.
The language is characterized as linguistically "isolating" in that each Chinese character corresponds to one morpheme and that the language uses very little grammatical inflection.It uses a logographic writing system, which uses pictograms (Chinese characters) to represent meaning.Chinese is also a tonal language with four pitched tones and one neutral tone.It commonly displays a basic SVO word order and, instead of conjugating the verbs to express tenses, uses aspect particles such as 了(le) and 着(zhe) to indicate the temporal location of the sentence.
Indonesian Indonesian (ind) is the national language of Indonesia (Indonesia, 2002).It is spoken by around 300 million speakers worldwide.Indonesian is developed from the literary 'Classical Malay' of the Riau-Johor sultanate (Sneddon, 2003) and has many regional variants.Indonesian is written in Latin script with a lexical similarity of over 80% to Standard Malay.Indonesian is non-tonal and has 19 consonants, 6 vowels, and 3 diphthongs.The stress is on the penultimate syllable and the word order is SVO.It has three optional noun classifiers.Indonesian has two social registers and a rich affixation system, including a variety of prefixes, suffixes, circumfixes, and reduplication.Most of the affixes in Indonesian are derivational (Pisceldo et al., 2008).
Standard Malay Standard Malay (msa) is the national language of Malaysia, Brunei, and Singapore, and the language is spoken by approximately 290 million speakers worldwide.The word order of Standard Malay is SVO with four types of affixes, i.e., prefixes (awalan), suffixes (akhiran), circumfixes (apitan), and infixes (sisipan).Even though Standard Malay and Indonesian originate from the same Malay language and are mutually intelligible, they can differ in spelling and vocabulary.One example is loanwords.Due to the different colonial influences from the Dutch and British, Indonesian primarily absorbs Dutch loanwords whereas Malay absorbs English loanwords.Both languages can also differ in the meanings of the same written words, which are commonly referred to as interlingual homographs.For instance, "polisi" means "police" in Indonesian but "policy" in Standard Malay.
Tagalog Tagalog (tgl) is an Austronesian language spoken in the Philippines by around 82 million native speakers.It is both agglutinative and pitch-accented, giving it rich and complex morphology (Kroeger, 1993).Tagalog's standardized form, known as Filipino, is the country's official national language.The difference between Filipino and Tagalog is more sociopolitical than sociolinguistic: Commonwealth Act No. 184 of 1936 created a national committee whose purpose is to "develop a national language."This resulted in the standardization of the Tagalog language into Filipino.In practice, Filipino is indistinguishable from Tagalog, albeit with the addition of letters f, j, c, x, and z, plus loanwords (Commonwealth of the Philippines, 1936).
Vietnamese Vietnamese (vie), the national language of Vietnam, is spoken by around 85 million people worldwide.It is a tonal language belonging to the Austroasiatic language family and uses accents to denote six distinctive tones.The sentence structure of Vietnamese displays the SVO word order, and due to heavy influence from Chinese, it also uses a rich set of classifiers that are required in the presence of quantifiers.For instance, instead of writing "bốn gà," which literally translates into "four chickens," it should be "bốn con gà" where "con" is a classifier for non-human animate things.
Tamil Tamil (tam) is a Dravidian language originating from Tamil Nadu and Sri Lanka.It is spoken by the sizeable Tamil diasporas of Singapore (2.5% of population (Singapore, 2020)) and Malaysia (9% of population (Schiffman, 1998)), which resulted from histories of trade, migration, indentured servitude, and civil unrest.Tamil is an official language of Singapore (Singapore, 2020), and the only one originating from India.Tamil is notably diglossic, which means it has a formal literary system, lacks lexically distinctive stress, and is non-rhotic (Armstrong).Tamil uses SOV sentence structure.Tamil-English code-mixing exhibits interesting linguistic phenomena such as nonce loan, wherein many nonce borrowings from English occupy objects corresponding to Tamil verbs, and vice versa (Sankoff et al., 1990).
Singlish Singlish is a widely-used conversational language in Singapore.It is an English-based creole language that arose out of prolonged language contact between speakers of many different languages in the country, including Hokkien, Malay, Teochew, Cantonese, and Tamil.Singlish is spoken by around 4 million speakers, and one unique feature of the language is its heavy use of pragmatic particles borrowed from Southern Chinese dialects.One example of this is "lah," which in the sentence, "Her dress is too short lah," emphasizes the statement.

B HuggingFace Inference API
We use HuggingFace's Inference API to prompt multilingual LLMs since we do not have sufficient local compute to host models with hundreds of billions of parameters such as the 176B-parameter BLOOMZ model (Muennighoff et al., 2022).The text-to-text task is treated identically as a textgeneration task, and we set max_new_tokens (amount of new tokens to be generated) to 100, temperature to 0.7, and repetition_penalty to 1.2.

C OpenAI Inference API
We use OpenAI's official API to prompt both davinci-003 and davinci-002.Specifically, we use openai.Completion.createwith a maximum generation length of 128.We use the default values for all other parameters.

D Flan-T5-XXL Non-English Outputs
We observe that when Flan-T5-XXL generates non-English outputs, most of them are nonsensical.
Here are some of the examples and their translations.Indonesian: Ini adalah sebuah udara untuk pengobatan minyak dan di sekitar kehidupan.
Translation: This is an air for oil treatment and around life.Malay: Artificial Intelligence adalah sebuah kantor keamanan yang digunakan untuk mengidentifikasi penduduk yang memiliki anak-anak dalam diri.
Translation: Artificial intelligence is a security office used for identifying residents who have childen inside.Tagalog: Weather niya ang nagsimula sa pagsasagawa ng kaniyang kargahan ng panahon.
Translation: It was his weather that started carrying out his weather load.Vietnamese: Nhà ng tài ra mt ngi dy xut trn o trng h nhng ngi ng thng u c thit v.
Translation: The artist has created an outstanding talent in the field of talented people.

E BLOOMZ's Training Language Distribution
BLOOMZ is created by finetuning the multilingual 176B-parameter language model BLOOM (Scao et al., 2022) that is pretrained on ROOTS corpus (Laurenc ¸on et al., 2022) on a collection of prompt instructions known as xP3 (Muennighoff et al., 2022).Table 1 and Table 2 show the proportion of SEA languages investigated in our paper existing in the ROOTS and xP3 datasets respectively.Even though Indonesian and Chinese are higher in proportion than Tamil, BLOOMZ code-mix better for Tamil than the former two language with around 20% performance difference.

F Naturalness and Fluency Issues of ChatGPT's Generation
We document a non-exhaustive list of syntactic and semantic errors as well as reasons for unnaturalness     (Muennighoff et al., 2022).
in ChatGPT's generation in Table 4.

G Annotators and Inter-annotator Agreement
We have a total of 13 annotators, some of whom speak more than one SEA language.All of them are native speakers of their respective SEA languages, and most grow up in SEA.Many of our annotators are AI researchers and reside in the Global North.
All the annotators are the authors of the paper.
In Table 3, we report the inter-annotator agreement scores for naturalness annotations using Fleiss ' Kappa κ (Fleiss, 1971), which measures the agreement between a fixed number of raters when assigning categorical ratings to the items.It can be applied to settings with multiple annotators and not all raters are required to annotate all items.The closer it is to 1, the higher the agreement among annotators.
According to the guideline (Landis and Koch, 1977;Altman, 1990), English-Indonesian annotations have a fair agreement, English-Chinese and Singlish have a substantial agreement, and English-Tagalog have almost perfect agreement among the annotators.

English-Indonesian
Unnatural phrasing "chance of hujan" and "cuacanya bakal cloudy" sound unnatural.Should be "chance of raining" and "cuacanya akan menjadi cloudy" Saya suka spend time bersama family saya, especially bila kita makan makanan yang sedap seperti nasi lemak or roti canai for breakfast.I like spending time with my family, especially when we eat delicious food such as nasi lemak and roti canai.

English-Malay
Gerund; Conjunction "suka spend time" should change to "suka spending time" as the word "suka" (like.v)should be followed with gerund."or" should also be changed to "and".
I The break from english "My family" to Tagalog "ay nagplano" is unnatural.When Tagalog is the matrix language, we use Tagalog possessive so the correct form would be "Ang family ko ay nagplano ..." Yesterday, tôi đã đi out với gia đình của mình để celebrate my parents' wedding anniversary.
Yesterday, I went out with my family to celebrate my parents' wedding anniversary.
English-Vietnamese Verb phrase Instead of "đi out", it should be either "đi" or "went out".
AI, you know, can do many things lah, like make our lives easier, but also can be very pai seh if we don't use it properly.
You know while AI can do many things such as making our lives easier, it can also be very embarrassing if we don't use it properly Singlish Incorrect use of Singlish expression "pai seh" is a Hokkien word that describes a person feeling shy, sorry or embarrassed.Using it to describe AI feeling embarrassed is inappropriate.
Eh, you know lah, family very important one leh, must always stick together like sotong and chilli sauce, can or not?Do you know that family is very important, so we must always stick together like squid and chili sauce?

Analogy
Using "sotong and chilli sauce" (squid and chili sauce) as an analogy to familial bond is an unnatural expression.No one in Singapore uses such an expression.
Traffic romba kasta pattu irukku today, it's taking forever to reach my destination.
Traffic is bad today; it's taking forever to reach my destination

English-Tamil
Adjective "Traffic romba kasta pattu irukku today" means that the traffic is suffering, which is not the same as the traffic is congested.In English: Through the use of Artificial Intelligence in web applications, I can create content that is shared in both English and Tamil.(I do not have any information specific to Artificial Intelligence to mention.) Note: This sentence mixes both English and Tamil, using English for the terms "Artificial Intelligence" and "web applications" while using Tamil for the verb "create" and the phrase "shared in both English and Tamil".Code-mixing is common in bilingual communities and can enhance communication by incorporating the strengths of both languages.Data identification in artificial web applications is effective in any language, Artificial Intelligence is revolutionizing the way we interact with technology.

English-Tamil
Comma splice Both the Tamil and English independent clauses are joined by comma, which is a grammatical error of comma splice.Now I am here eating is a bad mistake, but I still crave for some hot fritters and pongal soup.
English-Tamil Different script system "Ippo naan inga" uses Latin-transliterated system but the rest of Tamil words use Tamil script system, which is unnatural writing.

Figure 1 :
Figure 1: Depiction of SEA regions, which consist of a total of 11 countries.We prompt LLMs to generate code-mixed data of languages used in six South East Asian countries (colored in dark blue): Brunei, Indonesia, Malaysia, Philippines, Singapore, and Vietnam.

Figure 17 :(Figure 17 :Figure 2 :
Figure 17: Prompt templates with different languages and topic fields and responses from different LLMs containing code-mixed / non-code-mixed sentences.We also include a template where we specify the nationality of the speaker.Note that the explanations are a part of ChatGPT's original generation.23

Figure 5 :
Figure 5: Failure cases when we prompt ChatGPT to imagine two bilingual speakers conversing about a specific topic.

Figure 6 :
Figure 6: Naturalness of code-mixed (CM) text generated from ChatGPT."Semi-Natural" indicates that the annotators see someone else from other region/culture speaking it, whereas "Natural" indicates that the annotators see themselves speaking it.

Figure 8 :
Figure 8: Singlish sentence and its translation to Standard English, both generated by ChatGPT.

Figure 9 :
Figure 9: ChatGPT's failure to follow the prompt.Underlined texts show its language misclassification.

Figure 10 :
Figure10: Underlined texts show wrong explanations about the code-mixed text.We italicize the code-mixed sentences to make it explicit to the reader.
Generation of synthetic code-mixed data to address data scarcity problem has been previously explored.Solorio and Liu (2008), Winata et al. (2019), and Tan and Joty (2021) have attempted to generate synthetic codemixed sentences through word alignment and candidate selection from a parallel corpus.Liu et al. (2020) and Adilazuarda et al. (2022) have similarly generated synthetic code-mixed sentences by replacing words in monolingual sentences with their machine-translated counterparts, whereas Pratapa et al. (2018), Rizvi et al. (2021) and Santy et al.

CM
Figure16: Prompt templates with different languages and topic fields and responses from different LLMs containing code-mixed / non-code-mixed sentences.We also include a template where we specify the nationality of the speaker.23

Filled
is revolutionizing the way we interact with technology."(Translation:"Artificial Intelligence is revolutionizing the way we interact with technology and its data identification is available in any language.

Filled
can I help you?)As an AI language model, I can generate code-mixed sentences in English-Tamil.Here's a natural English-Tamil code-mixed sentence about food: type of South Indian snack made by deepfrying vegetables in gram flour batter; -pongal -a traditional Tamil dish made with rice and lentils, often served with a savory lentil-based side dish or a sweet coconut and jaggery-based accompaniment.)This sentence combines English and Tamil words Xingwei He, Zhenghao Lin, Yeyun Gong, A Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan,Weizhu Chen, et al. 2023.Annollm: Making large language models to be better crowdsourced annotators.arXiv preprint arXiv:2303.16854.
CliffGoddard.2005.The languages of East and Southeast Asia: an introduction.Oxford University Press on Demand.F. Grosjean.1982.Life with two languages: An introduction to bilingualism.Harvard University Press.

Table 1 :
Proportion of Languages in the ROOTS corpus(Laurenc ¸on et al., 2022).

Table 2 :
Proportion of Languages in the xP3 datasets

Table 3 :
Inter-annnotator agreement scores for naturalness of ChatGPT's generated code-mixed text.N(annotators) indicates the total number of annotators and κ refers to the Fleiss' Kappa agreement score.

Table 4 :
Naturalness issues with explanations for ChatGPT's code-mixed text generation.