SentMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Sentiment Analysis

Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data between three languages Bangla, English, and Hindi. We carry out a comprehensive evaluation using SentMix-3L. We show that zero-shot prompting with GPT-3.5 outperforms all transformer-based models on SentMix-3L.


Introduction
Code-mixing and code-switching are very commonly observed in both text and speech.Codemixing means the practice of using words from multiple languages within a single utterance, sentence, or discourse, and code-switching refers to the deliberate alteration between multiple languages within the same context (Thara and Poornachandran, 2018).The first case is spontaneous and the second case is purposeful.However, both are widely observed in bilingual and multilingual communities.
According to Anastassiou (2017), several factors are behind these two phenomena, which include social, convenience, linguistic, and cognitive reasons.Socially, this often serves as a sign of group identity which allows individuals to navigate multiple social and cultural affiliations.In terms of linguistics, it is a very common scenario to not be able to find any word for a specific term in one language, whereas another word from another language can help to communicate better.Additionally, there are several cases even in a monolingual community, *These two authors contributed equally to this work.
when Code-mixing might be the convenient way to express something.
In most occurrences, code-mixing is bilingual.In an early research, Byers-Heinlein and Lew-Williams ( 2013) states that, it is very likely that by the year 2035, over half of the children enrolled in kindergarten will have grown up speaking a language other than English.Another study conducted by Jeffery and van Beuningen (2020) shows that it is very common in European countries like Germany, Spain, and Italy to use bilingualism in practice.However, in cosmopolitan cities and areas like New York, London, Singapore, and others, codemixing with three or even more languages is fairly common.This is also observed in countries like Luxembourg, and regions such as West Bengal, and South-East India where more than two languages are commonly used on a daily basis.
Several research works have been conducted on building code-mixed datasets and performing several downstream tasks on such datasets.These datasets include both synthetic and natural ones.However, most of them are bilingual in nature.In this paper, we present SentMix-3L, a Bangla-English-Hindi dataset annotated for sentiment analysis.
The main contributions of our work are as follows: • We introduce SentMix-3L, a novel threelanguage code-mixed test dataset with gold standard labels in Bangla-Hindi-English for the task of Sentiment Analysis, containing 1,007 instances.1 • We provide a comprehensive experimental analysis with several monolingual, bilingual, and multilingual models on SentMix-3L.
We are presenting this dataset exclusively as a test set due to the unique and specialized nature of the task.Such data is very difficult to gather and requires significant expertise to access.The size of the dataset, while limiting for training purposes, offers a high-quality testing environment with goldstandard labels that can serve as a benchmark in this domain.Given the scarcity of similar datasets and the challenges associated with data collection, SentMix-3L provides an important resource for the rigorous evaluation of text-based models, filling a critical gap in multi-level Code-mixing research.
In our experiments, we also prepare a synthetic train and a development dataset to evaluate several models.

Related Work
There have been some works conducted on Bangla-English, Hindi-English, and Bangla-Hindi Codemixing and Code-switching separately.Most of them are case studies and surveys that show the common occurrences of Bangla-English (Alam, 2006;Hasan et al., 2015;Hossain and Bar, 2015;Begum and Haque, 2013;Mahbub-ul Alam and Quyyum, 2016), Hindi-English (Singh, 1985;Bali et al., 2014;Thara and Poornachandran, 2018) and Bangla-Hindi (Ali et al., 2019;Jose et al., 2020) Code-mixing in a wide variety of areas and situations.
Few works are done on sentiment analysis tasks for these types of cases.The work of Khan et al. ( 2022) presents a Bangla-English Sentiment Analysis dataset primarily related to COVID-19.Their dataset is called CoVaxBD and it contains 1113 samples.Their experiments show that the best result of the dataset is generated by BERT with a development accuracy of 97.3%.However, a lot of the data are purely in Bangla and they only experiment using the BERT and multilingual BERT model (Devlin et al., 2019) while only providing development accuracy as the performance metric.Another recent work by Tareq et al. (2023), consists of 18,074 Bangla-English Code-mixed data from online for the purpose of sentiment analysis.They augment the dataset using their own approach and achieve their best result of an 87% weighted F1 score by implementing XGBoost with Fasttext embedding.Their experiments lack the evaluation of transformer-based models.
An early work by Sitaram et al. (2015) focuses on Hindi-English Code-mixing dataset for Sentiment Analysis.However, it contains 345 data samples in total and only 180 of them are code-mixed.
They get their best results of 91.01%accuracy using Recursive Neural Tensor Network (RNTN).Joshi et al. (2016) compile a dataset of 3879 texts and get their best results using a subword-LSTM approach.Also, the work of (Yadav et al., 2020) includes a dataset of 6357 texts and Bi-LSTM helps them to get their optimal result.However, none of these works present how the transformer-based models perform on their datasets.
In summary, there are no works or datasets on sentiment analysis for code-mixed Bangla-English-Hindi altogether.SentMix-3L is a novel addition in this particular domain of research.

The SentMix-3L Dataset
In generating the dataset, we choose a controlled data collection method, asking the volunteers to freely contribute data in Bangla, English, and Hindi.This decision stems from several challenges of extracting such specific code-mixed data from the vast corpus available on social media or other online platforms.While the data are not rare, identifying and isolating them from large, unstructured corpora is a very labor-intensive and error-prone process.Our approach ensures data quality and sidesteps the ethical concerns associated with using publicly available online data.Such types of datasets are often used when it is very difficult to mine them from existing corpora.As examples, for fine-tuning LLMs on instructions and conversations, semi-natural datasets like Databricks (2023) and Nie (2023) have become popular.
Data Collection A group of 10 undergraduate students who are fluent in all 3 languages in all four language skills -listening, reading, writing, and speaking.We ask each of them to prepare 250 to 300 social media posts or tweets.They are allowed to use any language including Bangla, English, and Hindi to prepare posts on several daily topics like politics, sports, education, social media rumors, etc.We also ask them to switch languages if and wherever they feel comfortable doing it.The inclusion of emojis, hashtags, and transliteration are also encouraged.The students had the flexibility to prepare the data as naturally as possible.Upon completion of this stage, we filter 1863 samples that contain at least one word or subword from each of the three languages using langdetect (Mazzocchi, 2012) an open-sourced Python tool for language detection.Data Annotation We annotate the dataset in two steps to prepare high-quality labels for the dataset.First, we recruit three students from social science, computer science, and linguistics as annotators who are also fluent in all 3 languages in all four language skills.They annotate all the 1863 samples with one of the three labels (Positive, Neutral, and Negative) with a raw agreement of 65.3%.We only take these 1182 data, where all three annotators agree on the labels.Second, we gather a second group of annotators consisting of two NLP researchers with the same level of fluency and skills.After their annotation, we calculate a raw agreement of 0.85, a Cohen Kappa score of 0.78 and only keep the data where both annotators agree.After the two stages, we end up with a total of 1007 data.

Label
No

Synthetic Train and Development Set
We present SentMix-3L as a test dataset, hence for experimental purposes, we build a synthetic train and development set that contains Code-mixing for Bangla, English, and Hindi.We originally take the Amazon Review Dataset (Ni et al., 2019) as seed data and pick 100K data instances randomly.The dataset labels are ratings on a 1 to 5 scale.We convert them into Positive (rating > 3), Neutral (rating = 3), and Negative (rating < 3) for our task.We carefully choose an equal number of instances for Positive, Neutral, and Negative labels.We then use two separate methodologies called Random Code-mixing Algorithm by Krishnan et al. (2021) and r-CM by Santy et al. (2021) to generate the synthetic Code-mixed dataset.
Bilingual Models BanglishBERT (Bhattacharjee et al., 2022) and HingBERT (Nayak and Joshi, 2022) is used as bilingual models which are trained on both Bangla-English and Hindi-English respectively thus effective for the purpose of code mixing tasks including where any two of these languages are involved.
Multilingual Models We use mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) as multilingual models which are respectively trained on 104 and 100 languages.These two models are very effective while we are working in a trilingual Bangla, Hindi and English domain.Moreover, we use IndicBERT (Kakwani et al., 2020) and MuRIL (Khanuja et al., 2021) which covers 12 and 17 Indian languages respectively including Bangla-English-Hindi. Thus these two models and the respective list of languages justify the inclusion of them for our targeted tri-lingual code-mixing task.
We also perform hyper-parameter tuning while using all the models to prevent overfitting and ensure optimal F1 score.
Prompting We use prompting with GPT-3.5turbo model (OpenAI, 2023) from OpenAI for this task.We use the API for zero-shot prompting (see Figure 1) and ask the model to label the test set.
Role: "You are a helpful AI assistant.You are given the task of sentiment analysis.
Definition: Sentiment Analysis is the process of identifying and categorizing opinions expressed in a piece of text.You will be given a text to label either 'Positive' or 'Negative' or 'Neutral'.
Task: Generate the label for this "text" in the following format: <label> Your_Predicted_Label <\label>.Thanks."Additionally, we run the same experiments separately on synthetic and natural datasets splitting both in a 60-20-20 way for training, evaluating, and testing purposes.

Results
In this experiment, synthetic data is used as train set, and natural data is used as test set.The F1 scores of monolingual models range from 0.47 to 0.55 where BERT performs the best.Among the two bilingual models BanglishBERT scores 0.56 which is better than HingBERT.XLM-R is the best multilingual model with an F1 score of 0.59.On the other hand, a zero shot prompting technique on GPT 3.5 turbo performs the best with a 0.62 weighted F1 score.These results are available in Table 3.

Error Analysis
We observe Other tokens in almost 40% of the whole dataset, as shown in Table 2.These tokens occur due to transliteration which poses a challenge for most of the models since not all of the models are pre-trained on transliterated tokens.Banglish-BERT did better than HingBERT since it recognizes both Bangla and English tokens and the total number of tokens for Hindi-English is less than Bangla-English tokens (see Table 3).Also, misspelled words and typos are also observed in the datasets, making the task even more difficult.Some examples are available in Appendix A which are classified wrongly by all the models.

Conclusion and Future Work
In this paper, we presented SentMix-3L, a Bangla-English-Hindi code-mixed offensive language identification dataset containing 1,007 instances.We also created 100,000 synthetic data in the same three languages for training.We evaluated various monolingual models on these two datasets.Our results show that prompting GPT3.5 generates the best result on SentMix-3L.When using synthetic data for both training and testing, multilingual models such as mBERT and XLM-R perform well.In the future, we would like to expand SentMix-3L so that it can serve as both training and testing data.Additionally, we are working on pre-training Bangla-English-Hindi trilingual code-mixing models for offensive language identification.

Table 2 :
SentMix-3L Data Card.The row Avg represents the average number of tokens with its standard deviation in row Std Dev.

Table 3 :
Weighted F-1 score for different models: training on synthetic, testing on natural data.

Table 4 :
Weighted F-1 score for different models: training and testing on synthetic data.