OffMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Offensive Language Identification

Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several works have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages. We experiment with several models on this dataset and observe that BanglishBERT outperforms other transformer-based models and GPT-3.5.


Introduction
Code-mixing and code-switching are common linguistic phenomena observed both in speech and text form.While the two terms are often used interchangeably, code-mixing is defined as the use of words or morphemes from multiple languages within a single utterance, sentence, or discourse whereas code-switching refers to the deliberate alternation between multiple languages within the same context (Thara and Poornachandran, 2018).The first case is often spontaneous while the second case is purposeful.However, both are widely observed in bilingual and multilingual communities.
As described in Anastassiou (2017), several social, linguistic, and cognitive factors are behind these two phenomena.Socially, this often serves as a sign of group identity which allows individuals to navigate multiple social and cultural affiliations.Linguistically, it is common for bilingual speakers to not be able to find a word for a specific concept in one language thus using a word from another language to help communication.Additionally, there *These two authors contributed equally to this work.WARNING: This paper contains examples that are offensive in nature.
are several cases even in monolingual communities, when code-mixing might be the convenient way to express a concept as in the case of English loan words such as feedback used in various languages.
Most commonly, code-mixing is a bilingual phenomenon.Byers-Heinlein and Lew-Williams (2013), for example, estimates that by the year 2035, over half of children enrolled in kindergarten in California will have grown up speaking a language other than English.Another study conducted by Jeffery and van Beuningen (2020) shows that bilingualism is a common practice in European countries such as Germany and Spain.However, in cosmopolitan cities and areas like New York, London, Singapore, and others, code-mixing with three or more languages is fairly common.This is also observed in countries like Luxembourg, and regions such as West Bengal, and South-East India where more than two languages are commonly used on a daily basis.
Several papers have presented code-mixed datasets for various NLP tasks (KhudaBukhsh et al., 2020;Krishnan et al., 2022).However, most of these datasets are bilingual leaving the processing of code-mixing in three or more languages largely unexplored.In this paper, we present a Bangla-Hindi-English dataset annotated for offensive language identification.To the best of our knowledge, this is one of the first datasets to contain codemixing between more than two languages.
The main contributions of this paper are as follows: • We introduce OffMix-3L, a novel threelanguage code-mixed test dataset in Bangla-Hindi-English for offensive language identification.OffMix-3L contains 1,001 instances annotated by speakers of the three languages.We made OffMix-3L freely available to the community. 1 • We provide a comprehensive evaluation of several monolingual, bilingual, and multilingual models on OffMix-3L.
We present OffMix-3L exclusively as a test set due to the unique and specialized nature of the task.The size of the dataset, while limiting for training purposes, offers a high-quality testing environment with gold-standard labels that will serve as a benchmark in this domain.Given the scarcity of similar datasets and the challenges associated with data collection, OffMix-3L provides an important resource for the rigorous evaluation of offensive language identification models, filling a critical gap in multilevel code-mixing research.
Code-mixing between these three languages has also been studied separately in NLP.
There have been few studies conducted on offensive language identification for Bangla-English code-mixed data.The work by Jahan et al. (2019) focused on detecting Bangla-English code-mixed and transliterated offensive comments on Facebook.Another Bangla-English dataset is gathered by Wadud et al. (2021), where they collected 2,200 instances.
Comparatively more work has been carried out for Hindi-English Code-mixing.Sreelakshmi et al. (2020) uses fastText (Joulin et al., 2016) to represent 10,000 instances collected from different sources.Other offensive language datasets collected from Facebook and Twitter were introduced by Bohra et al. (2018); Kumar et al. (2018); Rani et al. (2020).Mundra and Mittal (2022) proposes Fused Attention-based Network (FA-Net), which introduces a fusion of attention mechanism of collective and mutual learning between local and sequential features for Hindi-English offensive language and hate speech classification.Gupta et al. (2021) uses Character Level Embeddings, GRU, and attention layer to offensive language identification in Hindi-English code-mixed.
To the best of our knowledge, no existing work focuses specifically on Hindi-Bangla code-mixing.
However, some studies focused on multiple Indian languages code-mixing altogether including Bangla and Hindi.The work by Vasantharajan and Thayasivam (2021) focuses on offensive language identification in Dravidian languages.A few similar works include Ravikiran and Annamalai (2021); Sai and Sharma (2020); Kumar et al. (2020); Kumari and Singh (2020); Ranasinghe and Zampieri (2021).
In summary, to the best of our knowledge, there has been no work on offensive language identification for code-mixed Bangla-English-Hindi.There have also been no offensive language datasets made available for these three languages.OffMix-3L fills this gap by providing the community with a novel resource for these three languages.OffMix-3L provides the community with the opportunity to evaluate how state-of-the-art models perform on Bangla-English-Hindi.

The OffMix-3L Dataset
We choose a controlled data collection method, asking the volunteers to freely contribute data in Bangla, English, and Hindi.This decision stems from several challenges of extracting such specific code-mixed data from social media and other online platforms.Our approach ensures data quality and sidesteps the ethical concerns associated with using publicly available online data.Such types of datasets are often used when it is difficult to mine them from existing corpora.As examples, for fine-tuning LLMs on instructions and conversations, semi-natural datasets like Databricks (2023) and Nie (2023) have become popular.
Data Collection A group of 10 undergraduate students fluent in the three languages was asked to prepare 250 to 300 social media posts each.They were allowed to use any language including Bangla, English, and Hindi to prepare posts on several daily topics like politics, sports, education, social media rumors, etc.We also ask them to switch languages if and wherever they feel comfortable doing so.The inclusion of emojis, hashtags, and transliteration was also encouraged.The students had the flexibility to prepare the data as naturally as possible.Upon completion of this stage, we gathered 1,734 samples that contained at least one word or sub-word from each of the three languages using langdetect (Mazzocchi, 2012) an open-sourced Python tool for language identification.Data Annotation We annotate the dataset in two steps.Firstly, we recruited three students from social science, computer science, and linguistics fluent in the three languages to serve as annotators.They annotated all 1,734 samples with one of the two labels (Non-Offensive and Offensive) with a raw agreement of 63.7%.We then take 1,106 instances, where all three annotators agree on the labels, and use them in a second step.To further ensure high-quality annotation, we recruit a second group of annotators consisting of two NLP researchers fluent in the three languages.After their annotation, we calculate a raw agreement of 91% (Kvålseth, 1989), a Cohen Kappa score of 0.82.After the two stages, we only keep the instances where both annotators agree, and we end up with a total of 1,001 instances.The label distribution is shown in Table 1.

Label
No

Synthetic Train and Development Set
We present OffMix-3L as a test dataset and we build a synthetic train and development set that contains Code-mixing for Bangla, English, and Hindi.We use two English training datasets annotated with the same labels as OffMix-3L, namely OLID (Zampieri et al., 2019) and SOLID (Rosenthal et al., 2021).We randomly select 100,000 data instances randomly and we carefully choose an equal number of Non-Offensive and Offensive instances.We then use the Random Code-mixing Algorithm (Krishnan et al., 2022) and r-CM (Santy et al., 2021) to generate the synthetic Code-mixed dataset.
Bilingual Models BanglishBERT (Bhattacharjee et al., 2022) and HingBERT (Nayak and Joshi, 2022) are used as bilingual models as they are trained on both Bangla-English and Hindi-English respectively.
Multilingual Models We use mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) as multilingual models which are respectively trained on 104 and 100 languages including Bangla-English-Hindi. We also use IndicBERT (Kakwani et al., 2020) and MuRIL (Khanuja et al., 2021) which cover 12 and 17 Indian languages, respectively, including Bangla-English-Hindi. We also perform hyper-parameter tuning while using all the models to prevent overfitting.
Prompting We use prompting with GPT-3.5turbo model (OpenAI, 2023) from OpenAI for this task.We use the API for zero-shot prompting (see Figure 1) and ask the model to label the test set.
Additionally, we run the same experiments separately on synthetic and natural datasets splitting both in a 60-20-20 way for training, evaluating, and testing purposes.

Results
In this experiment, synthetic data is used as a training set and natural data is used as the test set.The Role: "You are a helpful AI assistant.You are given the task of offensive text classification.
Definition: An offensive text is something that upsets or embarrasses people because it is rude or insulting.You will be given a text to label either 'Offensive' or 'Non-Offensive'.
Task: Generate the label for this "text" in the following format: <label> Your_Predicted_Label <\label>.Thanks."F1 scores of monolingual models range from 0.43 to 0.66 where BERT performs the best.mBERT is the best of all the multilingual models with an F1 score of 0.63.Besides, a zero-shot prompting technique on GPT 3.5 turbo provides a 0.57 weighted F1 score.The best task fine-tuned model is Hate-BERT with the F1 score of 0.60.Among all the models BanglishBERT scores 0.68 which is the best achieved F1 score.These results are available in Table 3.We perform the same experiment using synthetic data for training and testing.We present results in Table 4. Here, mBERT and XLM-R with 0.88 F1 scores are the best-performing models.

Error Analysis
We observe Other tokens in almost 39% of the whole dataset, as shown in  for most of the models since not all of the models are pre-trained on transliterated tokens.Banglish-BERT did well since it recognizes both Bangla and English tokens.However, the total number of tokens for Hindi-English is less than Bangla-English tokens, justifying HingBERT's inferior performance compared to BanglishBERT (see Table 3).Also, misspelled words and typos are also observed in the datasets, which are, for the most part, unknown tokens for the models, making the task even more difficult.Some examples are available in Appendix A which are classified wrongly by all the models.

Conclusion and Future Work
In this paper, we presented OffMix-3L, a Bangla-English-Hindi code-mixed offensive language identification dataset containing 1,001 instances.We also created 100,000 synthetic data in the same three languages for training.We evaluated various monolingual models on these two datasets.Our results show that when training on synthetic data and testing on OffMix-3L, BanglishBERT performs the best.When using synthetic data for both training and testing, multilingual models such as mBERT and XLM-R perform well.In the future, we would like to expand OffMix-3L so that it can serve as both training and testing data.Additionally, we are working on pre-training Bangla-English-Hindi trilingual code-mixing models for offensive language identification.

Table 2 :
OffMix-3L Data Card.The row Avg represents the average number of tokens with its standard deviation in row Std Dev.

Table 3 :
Weighted F-1 score for different models: training on synthetic and tested on natural data (OffMix-3L).

Table 2 .
These tokens occur due to transliteration which poses a challenge

Table 4 :
Weighted F-1 score for different models: training on synthetic and tested on synthetic data.