The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges

Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community. Initially, code-switching is intensively explored by leveraging linguistic theories and, currently, more machine-learning oriented approaches to develop models. We introduce a comprehensive systematic survey on code-switching research in natural language processing to understand the progress of the past decades and conceptualize the challenges and tasks on the code-switching topic. Finally, we summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation.


Introduction
Code-Switching is the linguistic phenomenon where multilingual speakers use more than one language in the same conversation (Poplack, 1978).The fragment of the worldwide population that can be considered multilingual, i.e., speaks more than one language, far outnumbers monolingual speakers (Tucker, 2001;Winata et al., 2021a).This alone makes a compelling argument for developing NLP technology that can successfully process code-switched (CSW) data.However, it was not until the last couple of years that CSW-related research became more popular (Sitaram et al., 2019;Jose et al., 2020;Dogruöz et al., 2021), and this increased interest has been motivated to a large extent by: 1) The need to process social media data.Before the proliferation of social media platforms, it was more common to observe code-switching in spoken language and not so much in written language.This is not the case anymore, as multilingual users tend to combine the languages they speak on social media; 2) The increasing release of voice-operated devices.Now that smart assistants are becoming more and more accessible, we have started to realize that assuming users will interact with NLP technology as monolingual speakers is very restrictive and does not fulfill the needs of real-world users.Multilingual speakers also prefer to interact with machines in a CSW manner (Bawa et al., 2020).We show quantitative evidence of the upward trend for CSW-related research in Figure 1.
In this paper, we present the first large-scale comprehensive survey on CSW NLP research in a structured manner by collecting more than 400 papers published on open repositories, such as the ACL Anthology and ISCA proceedings (see §2).We manually coded these papers to collect coarseand fine-grained information (see §2.1) on CSW research in NLP that includes languages covered (see §3), NLP tasks that have been explored, and new and emerging trends (see §4).In addition, motivated by the fact that fields like linguistics, socio-linguistics, and related fields, have studied Table 1: Categories in the annotation scheme.CSW since the early 1900s, we also investigate to what extent theoretical frameworks from these fields have influenced NLP approaches (see §5), and how the choice of methods has evolved over time (see §5.4).Finally, we discuss the most pressing research challenges and identify a path forward to continue advancing this exciting line of work (see §6).
The area of NLP for CSW data is thriving, covering an increasing number of language combinations and tasks, and it is clearly advancing from a niche field to a common research topic, thus making our comprehensive survey timely.We expect the survey to provide valuable information to researchers new to the field and motivate more research from researchers already engaging in NLP for CSW data.

Exploring Open Proceedings
To develop a holistic understanding of the trends and advances in CSW NLP research, we collect research papers on CSW from the ACL Anthology and ISCA proceedings.We focus on these two sources because they encompass the top venues for publishing in speech and language processing in our field.In addition, we also look into personal repositories from researchers in the community that contains a curated list of CSW-related papers.We discuss below the search process for each venue.

ACL Anthology
We crawled the entire the ACL Anthology repository up to October 2022. 1e then filtered papers by using the following keywords related to CSW: "codeswitch", "code switch", "code-switching", "code-switched", "code-switch", "code-mix", "code-mixed", "codemixing", "code mix" "mixed-language", "mixedlingua", "mixed language", "mixed lingua", and "mix language".ISCA Proceedings We manually reviewed publicly available proceedings on the ISCA website2 and searched for papers related to CSW using the same set of keywords as above.
Web Resources To extend the coverage of paper sources, we also gathered data from existing repositories. 3,4We can find multiple linguistics papers studying about CSW.

Annotation Process
We have three human annotators to annotate all collected papers based on multiple categories shown in Table 1.All papers are coded by a least one  annotator.To extract the specific information we are looking for from the paper, the annotator needs to read through the paper, as most of the information is not contained in the abstract.The full list of the annotations we collected is available in the Appendix (see §A).
To facilitate our analysis, we annotated the following aspects: • Languages: CSW is not restricted to pairs of languages; thus, we divide papers by the number of languages that are covered into bilingual, trilingual, and 4+ (if there are at least four languages).For a more fine-grained classification of languages, we categorize them by geographical location (see Figure 2).et al., 2010a) since they contain more than two languages (i.e., it has words from the Chinese dialect).
• Venues: There are multiple venues for CSWrelated publications.We considered the following type of venues: conference, workshop, symposium and book.As we will discuss later, the publication venue is a reasonable data point of the wider spread of CSW research in recent years.
• Papers: We classify the paper types based on their contribution and nature.We predict that we will have a high distribution of dataset/resource papers, as lack of resources has been a major bottleneck in the past.
• Datasets: If the paper uses a dataset for the research, we will identify the source and modality (i.e., written text or speech) of the dataset.
• Methods: We identify the type of methods presented in work.
• Tasks: We identify the downstream NLP tasks (including the speech processing-related tasks) presented in work.

Language Diversity
Here, we show the languages covered in the CSW resources.While focusing on the CSW phenomenon increases diversity of NLP technology, as we will see in this section, future efforts are needed to provide significant coverage of the most common CSW language combinations worldwide.

Variety of Mixed Languages
Figure 3 shows the distribution of languages represented in the NLP for CSW literature.Most of the papers use datasets with two language pairs.However, we did find a few papers that address CSW scenarios with more than two languages.We

CSW in two languages
We group the number of publications focusing on bilingual CSW based on world regions in Figure 3 (bottom).We can see that the majority of research in CSW has focused on South Asian-English, especially on Hindi-English, Tamil-English, and Malayalam-English, as shown in Table 2.The other common language pairs are Spanish-English and Chinese-English.That table also shows that many of the publications are shared task papers.This is probably reflecting efforts from a few research groups to motivate more research into CSW, such as that behind the CALCS workshop series.

Trilingual
The number of papers addressing CSW in more than two languages is still small (see 3 top), compared to the papers looking at pairs of languages.Not surprisingly, this smaller number of papers focus on world regions where  either the official languages are more than two, or these languages are widely used in the region, for example, Arabic-English-French (Abdul-Mageed et al., 2020), Hindi-Bengali-English (Barman et al., 2016), Tulu-Kannada-English (Hegde et al., 2022), and Darija-English-French (Voss et al., 2014).
4+ When looking at the papers that focus on more than three languages, we found that many papers use South East Asian Mandarin-English (SEAME) dataset (Lyu et al., 2010a), which has Chinese dialects and Malay or Indonesian words.Most of the other datasets are machine-generated using rulebased or neural methods.

Language-Dialect Code-Switching
Based on Figure 4, we can find some papers with language-dialect CSW, such as Chinese-Taiwanese Dialect (Chu et al., 2007;Yu et al., 2012) and Modern Standard Arabic (MSA)-Arabic Dialect (Elfardy and Diab, 2012;Samih and Maier, 2016;El-Haj et al., 2018).The dialect, in this case, is the variation of the language with a different form that is very specific to the region where the CSW style is spoken.

Tasks and Datasets
In this section, we summarize our findings, focusing on the CSW tasks and datasets.Table 3 shows the distribution of CSW tasks for ACL papers with at least ten publications.The two most popular tasks are language identification and sentiment analysis.Researchers mostly use the shared tasks from 2014 (Solorio et al., 2014) and 2016 (Molina et al., 2016) for language identification, and the SemEval 2020 shared task (Patwa et al., 2020) for sentiment analysis.For ISCA, the most popular tasks are unsurprisingly ASR and TTS.This strong correlation between task and venue shows that the speech processing and *CL communities remain somehow fragmented and working in isolation from one another, from the most part.explored in CSW research.Public datasets such as HinGE (Srivastava and Singh, 2021b), SEAME (Lyu et al., 2010a) and shared task datasets (Solorio et al., 2014;Molina et al., 2016;Aguilar et al., 2018;Patwa et al., 2020) have been widely used in many of the papers.Some work, however, used new datasets that are not publicly available, thus hindering adoption (see Table 4).There are two well-known benchmarks in CSW: LinCE (Aguilar et al., 2020) and Glue-COS (Khanuja et al., 2020b).These two benchmarks have a handful of tasks, and they are built to encourage transparency and reliability of evaluation since the test set labels are not publicly released.The evaluation is done automatically on their websites.However, their support languages are mostly limited to popular CSW language pairs, such as Spanish-English, Modern Standard Arabic-Egyptian, and Hindi-English, the exception being Nepali-English in LinCE.
Dataset Source Table 5 shows the statistics of dataset sources in the CSW literature.We found that most of the ACL papers were working on social media data.This is expected, considering that social media platforms are known to host informal interactions among users, making them reasonable places for users to code-switch.Naturally, most ISCA papers work on speech data, many of which are recordings of conversations and interviews.There are some datasets that come from speech transcription, news, dialogues, books, government documents, and treebanks.Some papers only release the URL or id to download the datasets, especially for datasets that come from social media (e.g., Twitter) since redistribution of the actual tweets is not allowed (Solorio et al., 2014;Molina et al., 2016) resulting in making reproducibility harder.Social media users can delete their posts at any point in time, resulting in considerable data attrition rates.There are very few papers working on the demos, theoretical work, position papers, and introducing evaluation metrics.

Paper Category
5 From Linguistics to NLP Notably, papers are working on approaches that are inspired by linguistic theories to enhance the processing of CSW text.In this survey, we find three linguistic constraints that are used in the literature: equivalence constraint, matrix-embedded language Framework (MLF), and Functional Head Constraint.In this section, we will briefly introduce the constraints and list the papers that utilize the constraints.

Linguistic-Driven Approaches
Equivalence Constraint In a well-formed codeswitched sentence, the switching takes place at those points where the grammatical constraints of both languages are satisfied (Poplack, 1980).Li andFung (2012, 2013) incorporate this syntactic constraint to a statistical code-switch language model (LM) and evaluate the model on Chinese-English code-switched speech recognition.On the same line of work, Pratapa et al. (2018a); Pratapa and Choudhury (2021) implement the same constraint to Hindi-English CSW data by producing parse trees of parallel sentences and matching the surface order of child nodes in the trees.Winata et al. (2019c) apply the constraint to generate synthetic CSW text and find that combining the real CSW data with synthetic CSW data can effectively improve the perplexity.They also treat parallel sentences as a linear structure and only allow switching on non-crossing alignments.
Matrix-Embedded Language Framework (MLF) Myers-Scotton (1997) proposed that in bilingual CSW, there exists an asymmetrical relationship between the dominant matrix language and the subordinate embedded language.
Matrix language provides the frame of the sentence by governing all or most of the most of the grammatical morphemes as well as word order, whereas syntactic elements that bear no or only limited grammatical function can be provided by the embedded language (Johanson, 1999;Myers-Scotton, 2005) Functional Head Constraint Belazi et al. (1994) posit that it is impossible to switch languages between a functional head and its complement because of the strong relationship between the two constituents.Li and Fung (2014) use the constraint of the LM by first expanding the search network with a translation model and then using parsing to restrict paths to those permissible under the constraint.

Learning from Data Distribution
Linguistic constraint theories have been used for decades to generate synthetic CSW sentences to address the lack of data issue.However, the approach requires external word alignments or constituency parsers that create erroneous results instead of applying the linguistic constraints to generate new synthetic CSW data, building a pointergenerator model to learn the real distribution of code-switched data (Winata et al., 2019c).Chang et al. (2019) propose to generate CSW sentences from monolingual sentences using Generative Adversarial Network (GAN) (Goodfellow et al., 2020) and the generator learns to predict CSW points without any linguistic knowledge.

The Era of Statistical Methods
The research on CSW is also influenced by the progress and development of machine learning.According to Figure 5, starting in 2006, statistical methods have been adapted to CSW research, while before that year, the approaches were mainly rule-based.There are common statistical methods for text classification used in the literature, such as Naive Bayes (Solorio and Liu, 2008a) and Support Vector Machine (SVM) (Solorio and Liu, 2008b).Conditional Random Field (CRF) (Sutton et al., 2012) is also widely seen in the literature for sequence labeling, such as Part-of-Speech (POS) tagging (Vyas et al., 2014), Named Entity Recognition (NER), and word-level language identification (Lin et al., 2014;Chittaranjan et al., 2014;Jain and Bhat, 2014).HMM-based models have been used in speech-related tasks, such as speech recognition (Weiner et al., 2012a;Li and Fung, 2013) and text synthesis (Qian et al., 2008;Shuang et al., 2010;He et al., 2012).

Utilizing Neural Networks
Following general NLP trends, we see the adoption of neural methods and pre-trained models growing in popularity over time.In contrast, the statistical and rule-based approaches are diminishing.
Compared to ISCA, we see more adaptation of the pre-training model.This is because ACL work is more text-based focused, where pre-trained LMs are more widely available.
Neural-Based Models Figure 5 shows that the trend of using neural-based models started in 2013, and the usage of rule/linguistic constraint and statistical methods diminished gradually through time, but they are still used even with a low percentage.RNN and LSTM architectures are commonly used in sequence modeling, such as language modeling (Adel et al., 2013;Vu and Schultz, 2014;Adel et al., 2014c;Winata et al., 2018a;Garg et al., 2018a;Winata et al., 2019c) and CSW identification (Samih et al., 2016a).DNN-based and hybrid HMM-DNN models are used in speech recognition models (Yilmaz et al., 2018;Yılmaz et al., 2018).
Pre-trained Embeddings Pre-trained embeddings are used to complement neural-based approaches by initializing the embedding layer.Common pre-trained embeddings used in the literature are monolingual subword-based embeddings, Fast-Text (Joulin et al., 2016), and aligned-embeddings MUSE (Conneau et al., 2017).A standard method to utilize monolingual embeddings is to concatenate or sum two or more embeddings from different languages (Trivedi et al., 2018).A more recent approach is to apply an attention mechanism to merge embeddings and form metaembeddings (Winata et al., 2019a,b).Characterbased embeddings have also been explored in the literature to address the out-of-vocabulary issues on word-embeddings (Winata et al., 2018b;Attia et al., 2018;Aguilar et al., 2021).Another approach is to train bilingual embeddings using real and synthetic CSW data (Pratapa et al., 2018b).In the speech domain, Lovenia et al. (2022) utilize wav2vec 2.0 (Baevski et al., 2020) as a starting model before fine-tuning.
Language Models Many pre-trained model approaches utilize multilingual LMs, such as mBERT or XLM-R to deal with CSW data (Khanuja et al., 2020b;Aguilar and Solorio, 2020;Pant and Dadu, 2020;Patwa et al., 2020;Winata et al., 2021a).These models are often fine-tuned with the downstream task or with CSW text to better adapt to the languages.Some downstream fine-tuning approaches use synthetic CSW data due to a lack of available datasets.Aguilar et al. ( 2021) propose a character-based subword module (char2subword) of the mBERT that learns the subword embedding that is suitable for modeling the noisy CSW text.Winata et al. (2021a) compare the performance of the multilingual LM versus the language-specific LM for CSW context.While XLM-R provides the best result, it is also computationally heavy.There needed to be more exploration of larger models.We see that pre-trained LMs provide better empirical results on current benchmark tasks and enables an end-to-end approach.Therefore, one can theoretically work on CSW tasks without any linguistic understanding of the language, assuming the dataset for model finetuning is available.However, the downside is that there is little understanding of how and when the LMs would fail, thus we encourage more interpretability work on these LMs in CSW setting.
6 Recent Challenges and Future Direction

More Diverse Exploration on Code-Switching Styles and Languages
A handful of languages, such as Spanish-English, Hindi-English, or Chinese-English, dominate research and resource CSW.However, there are still many countries and cultures rich in the use of CSW, which is still under-represented in NLP research (Joshi et al., 2020;Aji et al., 2022;Yong et al., 2023), especially on different CSW variations.CSW style can vary in different regions of the world, and it would be interesting to gather more datasets on unexplored and unknown styles, which can be useful for further research and investigation on linguistics and NLP.Therefore, one future direction is to broaden the language scope of CSW research.

Datasets: Access and Sources
According to our findings, there are more than 60% of the datasets are private (see Table 4), and they are not released to the public.This eventually hampers the progress of CSW research, particularly in the results' reproducibility, credibility, and transparency.Moreover, many studies in the literature do not release the code to reproduce their work.Therefore, we encourage researchers who build a new corpus to release the datasets publicly.In addition, the fact that some researchers provide urls to download the data is also problematic due to the data attrition issue we raised earlier.Data attrition is bad for reproducibility, but it is also a waste of annotation efforts.Perhaps we should work on identifying alternative means to collect written CSW data in an ecologically valid manner.

Model Scaling
To the best of our knowledge, little work has been done on investigating how well the scaling law holds for code-mixed datasets.

Zero-Shot and Few-Shot Exploration
The majority of pre-trained model approaches fine-tune their models to the downstream task.
On the other hand, CSW data is considerably limited.With the rise of multilingual LMs, especially those that have been fine-tuned with prompt/instruction (Muennighoff et al., 2022;Ouyang et al., 2022;Winata et al., 2022), one direction is to see whether these LMs can handle CSW input in a zero-shot fashion.This work might also tie in with model scaling since larger models have shown better capability at zero-shot and fewshot settings (Winata et al., 2021b;Srivastava et al., 2022).

Robustness Evaluation
Since CSW is a widely common linguistic phenomenon, we argue that cross-lingual NLP benchmarks, such as XGLUE (Liang et al., 2020) and XTREME-R (Ruder et al., 2021), should incorporate linguistic CSW evaluation (Aguilar et al., 2020;Khanuja et al., 2020b).The reasons are that CSW is a cognitive ability that multilingual human speakers can perform with ease (Beatty- Martínez et al., 2020).CSW evaluation examines the robustness of multilingual LMs in learning cross-lingual alignment of representations (Conneau et al., 2020;Libovickỳ et al., 2020;Pires et al., 2019;Adilazuarda et al., 2022).On the other hand, catastrophic forgetting is observed in pre-trained models (Shah et al., 2020), and human speakers (Hirvonen and Lauttamus, 2000;Du Bois, 2009, known as lan-guage attrition) in a CSW environment.We argue that finetuning LMs on code-mixed data is a form of continual learning to produce a more generalized multilingual LM.Thus, we encourage CSW research to report the performance of finetuned models on both CSW and monolingual texts.

Task Diversity
We encourage creating reasoning-based tasks for CSW texts for two reasons.First, code-mixed datasets for tasks such as NLI, coreference resolution, and question-answering are much fewer in comparison to tasks such as sentiment analysis, parts-of-speech tagging, and named-entity recognition.Second, comprehension tasks with the CSW text present more processing costs for human readers (Bosma and Pablos, 2020).

Conversational Agents
There has been a recent focus on developing conversational agents with LMs such as ChatGPT,5 Whisper (Radford et al., 2022)

Conclusion
We present a comprehensive systematic survey on code-switching research in natural language processing to explore the progress of the past decades and understand the existing challenges and tasks in the literature.We summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation.
We hope this survey can encourage and lead NLP researchers in a better direction on code-switching research.

Limitations
The numbers in this survey are limited to papers published in the ACL Anthology and ISCA Proceedings.However, we also included papers as related work from other resources if they are publicly available and accessible.In addition, the category in the survey does not include the code-switching type (i.e., intra-sentential, inter-sentential, etc.) since some papers do not provide such information.

Ethics Statement
We

A Annotation Catalog
We release the annotation of all papers we use in the survey.

A.1 *CL Anthology
Bilingual Table 7 shows the annotation for papers with African-English.Table 8 shows the annotation for papers with East Asian-English languages.Table 9 shows the annotation for papers with European-English languages.Table 10 shows the annotation for papers with Middle Eastern-English languages.Table 11 and Table 12 show the annotation for papers with South Asian-English languages.Table 13 shows the annotation for papers with South East Asian-English languages.Table 14 shows the annotation for papers with a combination of a language with a dialect.Table 15 shows the annotation for papers with languages in the same family.Table 16 shows the annotation for papers with languages in different families.
Trilingual Table 17 shows the annotation for papers with three languages.
4+ Table 18 shows the annotation for papers with four or more languages.

A.2 ISCA Proceeding
Bilingual Table 19 shows the annotation for papers with African-English.Table 20 shows the annotation for papers with East Asian-English languages.Table 21 shows the annotation for papers with European-English languages.Table 22 shows the annotation for papers with Middle Eastern-English languages.Table 23 shows the annotation for papers with South Asian-English languages.Table 24 shows the annotation for papers with South East Asian-English languages.Table 25 shows the annotation for papers with a combination of a language with a dialect.Table 26 shows the annotation for papers with languages in the same family.
Table 27 shows the annotation for papers with languages in different families.
Trilingual Table 28 shows the annotation for papers with three languages.
4+ Table 29 shows the annotation for papers with four or more languages.

Figure 1 :
Figure 1: Number of publications over time in *CL and ISCA venues.We collect the papers on October 2022.Top: Relative to all *CL and ISCA papers.Bottom: absolute number, broken down into conferences vs workshops.It does not include papers published after.The graphs do not show the number of publications published in journals and symposiums.

Figure 2 :
Figure 2: Language Categories.*NE denotes Non English.We show fine-grained categories in green and blue.

Figure 3 :
Figure 3: (Top): Number of publications across the type of language combination (bilingual, trilingual or 4+.(Bottom): Number of publications on fine-grained bilingual category with English as the L2 language.

Figure 4 :
Figure 4: Number of publications of bilingual codeswitched languages that do not contain English.* msa stands for Modern Standard Arabic.The first two are the combination of a language with its dialect.

Table 2 :
Most common code-switching languages in *CL and ISCA venues.
‡The count does not include the dialect or South East Asian Mandarin -English(Lyu

Table 3 :
Most common tasks in ACL venues.ST denotes shared task.

Table 4 :
Publications that introduce new corpus.

Table 5 :
The source of the CSW dataset in the literature.

Table 6 :
Paper Type of the CSW papers.
Ankur Bapna, Yu-an Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, and Yu Zhang.2021.Slam: A unified encoder for speech and language modeling via speech-text joint pre-training.arXiv preprint arXiv:2110.10329.Anshul Bawa, Monojit Choudhury, and Kalika Bali.2018.Accommodation of conversational codechoice.In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 82-91.Injy Hamed, Ngoc Thang Vu, and Slim Abdennadher.2020.Arzen: A speech corpus for code-switched egyptian arabic-english.In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4237-4246.Silvana Hartmann, Monojit Choudhury, and Kalika Bali.2018.An integrated representation of linguistic and social functions of code-switching.In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
use publicly available data in our survey with permissive licenses.No potential ethical issues in this work.Kavita Asnani and Jyoti Pawar.2016.Use of semantic knowledge base for enhancement of coherence of code-mixed topic-based aspect clusters.In Proceedings of the 13th International Conference on Natural Language Processing, pages 259-266.Neetika Bansal, Vishal Goyal, and Simpel Rani.2020a.Language identification and normalization of code mixed english and punjabi text.In Proceedings of the 17th International Conference on Natural Language Arup Baruah, Kaushik Das, Ferdous Barbhuiya, and Kuntal Dey.2020.Iiitg-adbu at semeval-2020 task 12: Comparison of bert and bilstm in detecting offensive language.In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1562-1568.Marine Carpuat.2014.Mixed language and codeswitching in the canadian hansard.In Proceedings of the first workshop on computational approaches to code switching, pages 107-115.Dana-Maria Iliescu, Rasmus Grand, Sara Qirko, and Rob van der Goot.2021.Much gracias: Semisupervised code-switch detection for spanish-english: How far can we get?NAACL 2021, page 65.

Table 8 :
*CL Catalog in East Asian-English.

Table 10 :
*CL Catalog in Middle Eastern-English.

Table 19 :
ISCA Catalog in African-English.