My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus significantly outperform the existing state-of-the-art BERT models. This is the first work that presents artifacts for code-mixed Marathi research. All datasets and models are publicly released at https://github.com/l3cube-pune/MarathiNLP .


Introduction
The modern world has been engulfed by the presence of social media platforms like Twitter and Facebook (Salem and Mourtada, 2011). Moreover, websites like YouTube have witnessed considerable user interaction in the comments section of videos (Siersdorfer et al., 2010). These posts and comments closely reflect the thoughts of the general public. It is common amongst users to discuss social, political, and other topics over such social media. This leads to users using a mixed language for communicating over social media platforms.
Code-mixing is known as the mixing of words from multiple languages while retaining the script * First author, equal contribution of a single language. Most commonly, the Latin script is used to encapsulate the words of multiple languages. For example, a given text can be of the Marathi language written in Latin script, as opposed to the Devnagari script, which is the original script of the Marathi language (Joshi, 2022a). Code-mixed data is inherently difficult to process and analyze due to its linguistic complexity, variance in spelling and grammar, and long-tailed distribution of uncommon terms and phrases, which are often specific to the geography and demographic of the source location. It is observed that a large number of tweets, comments, and posts on social media are code-mixed in nature. Thus, with the advent of social media analytics, effectively analyzing code-mixed data has gained the utmost importance.
Marathi is a language which has its origins in Maharashtra, a state in India. Due to the state's geographic and demographic expanse, Marathi has evolved into a language with multiple varieties and dialects. Recently, there has been some focus on Marathi NLP based on the Devanagari script (Joshi, 2022b,a;Kulkarni et al., 2021;Litake et al., 2022). However, a large chunk of tweets, posts, and comments in Marathi are in codemixed form. In spite of this, no efforts have been made to curate models and datasets pertaining to Marathi code-mixed data in the past. This work presents the following.
2. The supervised dataset contains labels for code-mixed Marathi-English hate classification, sentiment detection, and language identi-fication. These datasets were manually annotated by native Marathi speakers.
3. Finally, we release a plethora of codemixed MeBERT-based pre-trained and finetuned models for downstream tasks trained on these novel corpora. These models include MeBERT 2 , MeBERT-Mixed 3 , MeBERT-Mixed-v2 4 , MeRoBERTa-Mixed 5 , and MeRoBERTa 6 . The supervised models include MeSent-RoBERTa 7 , MeHate-RoBERTa 8 , and MeLID-RoBERTa 9 . This work is a major milestone towards democratizing NLP for the Marathi language. Additionally, we present several ablations with fine-tuned models. This is the first work to present a large unsupervised corpus, multiple pre-trained models, and high-quality supervised datasets. This work is a strong foundation in the domain of Marathi and code-mixed Marathi NLP.

Related Works
The use of regional scripts, such as Devanagari, Gurmukhi, Bengali, etc., presents a significant challenge in India due to keyboards primarily designed for the Roman script and the population's familiarity with it. The demand for code-mix datasets and models tailored to regional languages has increased exponentially. These resources play a crucial role in enabling enhanced analysis and moderation of social media content that is code-mixed.
In the realm of language models, BERT-based architectures (Vaswani et al., 2017), including variations such as RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2019), have gained popularity due to their application in pre-training and finetuning on various tasks. Multilingual models like multilingual-BERT and XLM-RoBERTa (Conneau et al., 2019) have specifically focused on data representations that are multilingual and cross-lingual in nature, offering improvements in accuracy and latency. However, these models are pre-trained on less than a hundred thousand real code-mix texts. While previous research efforts have addressed code mixing in other Indian languages, the specific domain of code-mix Marathi remains largely unexplored. Notably, there is a scarcity of prior work and an absence of a dedicated code-mix Marathi dataset. However, other Indian languages have seen some notable contributions. For instance, Hande et al. (2020) presents KanCMD a code-mixed Kannada dataset for sentiment analysis and offensive language detection. Chakravarthi et al. (2021) and have released datasets encompassing Tamil-English and Malayalam-English code-mixed texts. Nayak and Joshi (2022) have made available Hing-Corpus, a Hindi-English code-mix dataset and also open-sourced pre-trained models trained on codemix corpora. Srivastava and Singh (2021) provide HinGE, a dataset for the generation and evaluation of code-mixed Hinglish text, and demonstrate techniques for algorithmically creating synthetic Hindi code-mixed texts.
In the realm of transliteration, there have been attempts to pre-train language models using transliterated texts. However, these models often underperform due to the rule-based nature of most transliteration techniques, which struggle to account for the diverse spelling variations present in real-life code-mixed texts (Santy et al., 2021).

MeCorpus -Pretraining Data Creation
We introduce MeCorpus, a new pre-training corpus of 5 million code-mixed Marathi sentences. These sentences are extracted from the social media platforms Youtube and Twitter. We also used synthetic data obtained by transliterating Devanagari's tweets. The complete data collection process is illustrated in Figure 1.

Twitter data
A part of the pretraining corpus is obtained from the social networking site Twitter. We utilize snscrape, a scraper for social networking sites to scrape the data from Twitter. We use a keywordbased approach to curate the data. We use frequently used Marathi words as keywords and fetch all the tweets containing the given word. Proper care is taken to manually check that the word is predominantly exclusive to the Marathi language and doesn't occur in texts from other languages. We fetch a fixed number of tweets belonging to a certain keyword and discard the keyword if the tweets fail to satisfactorily meet the aforementioned con-  Figure 1: Dataset creation process for our 5 million code-mixed corpora.
ditions after manual verification. Otherwise, we scrape all the tweets containing the keyword and add them to our corpus. This manual verification process ensures that the curated data largely contains Marathi text.
The Twitter data amounts to over a million tweets. All of the tweets at least partly contain code-mixed Marathi. A significant number of the tweets exhibit code-switching between Marathi and English. A small portion of the tweets also contains code-switched Hindi-Marathi text. We anonymize the data before using it for pre-training. The username mentions are replaced with the '@USER' text. We also remove links and hashtags from the tweets. The Twitter corpus contains 50M tokens.

YouTube data
Youtube comments are an excellent source of Codemixed Marathi data. We scrape all the comments from 200 marathi youtube channels using the youtube-comments-scraper library. This gave us a mix of English, Devanagari, and codemixed Marathi sentences. We then removed the Devanagari and English comments to obtain the code-mixed Marathi data. This data was then preprocessed and used in our pre-training dataset. Devanagari words were identified and removed by checking their utf-8 encoding. We remove all comments which have more than 80% Devanagari words. This gives us comments that are either English or Marathi-English code-mixed. We used a fast text classifier to identify sentences that are English. We remove these sentences and were left with code-mixed Marathi sentences. Thus at the end of both filtering steps, we are left with 2,278,097 of the original 7,599,588 comments.

Transliteration
We scraped around 1.7 million Marathi Devanagari tweets from Twitter using the snscrape library. We then used the indic-trans Python library to transliterate 1,685,233 of these Devanagari tweets to Codemixed Marathi and added them to our dataset.

MeEval -Downstream dataset creation
We aim to create a large dataset of code-mix Marathi-English data, annotated with sentiment and hatefulness. In this study, we selected a set of tweets from a larger corpus of 1,037,659 tweets obtained from the social media platform Twitter. Half of the tweets chosen were posted on Twitter before 2013, and the other half were posted after 2013. This helped to provide a more diachronic distribution of tweets, as the number of tweets posted in the past few years far out-number the old tweets. The tweets were selected randomly apart from this criteria. This ensures a realistic representation of the sentiment, hate, and profanity distributions present in the real-world data across the past several years. To annotate the data, four annotators fluent in Marathi, Hindi, and English languages were chosen. The Cohen's Kappa (Cohen, 1960) for the annotators is 0.86. The collected data was labeled according to three distinct categories: sentiment, hate, and language identification. We followed a set of guidelines while labeling the data for ensuring the veracity of the annotation. We annotated the data after anonymizing it. This helped remove any bias or knowledge of the entity posting it. We also disregarded any additional information which could be inferred by us based on external context but is not apparent by reading the text by itself. Here, we outline the dataset statistics and annotation procedure. The dataset statistics are described in Table ??.

MeSent Dataset
The code-mixed Marathi-English sentiment data is termed as MeSent Dataset. Tweets expressing good or heartening emotions such as thankfulness, happiness, applause, and appreciation are labeled positive. Tweets expressing negative or disheartening emotions like strong dissent, disappointment, sorrow, derision, and hate are labeled negative. Plain facts, statements, and simple responses are labeled neutral. If a tweet contains conflicting emotions, the stronger emotion is chosen.
While annotating the data, we removed unsuitable and ambiguous tweets. Finally, we selected 4,000 tweets from each sentiment category, leading to the dataset containing a total of 12,000 tweets.

MeHate Dataset
For the hatefulness annotation, we labeled any tweets expressing strongly negative feelings such as insults, mockery, abuse, intimidation, and threats as hateful. Any tweet not containing such hateful content is labeled as non-hateful. We use 1 for hateful content and 0 for non-hateful content. The MeHate dataset contains 1384 hateful and 1384 non-hateful tweets, totaling 2768 tweets. We also release the full 12k labeled tweets with the majority of non-hate labels.

MeLID dataset
Additionally, a Language Identification (LID) dataset is created. Each word within the selected tweets is labeled based on its language as Marathi, English, or Other. The Other category contains invalid words, words from languages other than English or Marathi, and literals such as numbers and proper nouns. The MeLID dataset contains 11,814 tweets. For all three supervised datasets, we provide a pre-defined train, test, and validation split of 80:10:10.

Models trained on code-mixed MeCorpus
We train several well-known models on our novel pretraining corpora. In this section, we outline these models and their training details. We used pre-trained BERT, RoBERTa, mBERT, MuRIL, and XLM Roberta as the base models and trained them on the novel MeCorpus using the Masked Language Modelling (MLM) objective. For MLM training, we train the models for two epochs at a learning rate of 2e − 5, with a weight decay of 0.01 and a mask probability of 0.15.
The resulting models were named similarly to the original models, prefixed with "me", which stands for Marathi-English. Therefore, the models MeBERT, and MeRoBERTa are the BERT, mBERT, MuRIL, XLM-RoBERTa, and RoBERTa models trained on the MeCorpus respectively. Note that these models are further "finetuned" on the MeCorpus using the MLM training objective.

Results
We fine-tune our MeBERT models on the MeSent, MeHate, and MeLID datasets as mentioned in section 4 and test them on the respective test data. The same process is repeated for their base models and a few state-of-the-art Marathi models like Indic-BERT (Kakwani et al., 2020), Marathi-Tweets-BERT (Gokhale et al., 2022), and Marathi Codemixed Abusive MuRIL (Das et al., 2022). The results obtained from this are showcased in Table 2. It is observed that MeRoBERTa-Mixed outperforms all other models on the MeHate evaluation set with an F1 score of 78.07%. For the sentiment analysis corpus MeSent, MeRoBERTa outperforms the others by obtaining an F1 score of 67.27%. Testing the models on the MeLID dataset, MeRoBERTa outperforms the other models by obtaining an F1 score of 88.41%. The newly pre-trained code-mixed MeBERT-based models consistently outperform their base models as well as the state-of-the-art Marathi models.

Conclusion
This work lays the necessary groundwork for future work on code-mixed Marathi. We introduce a novel pretraining corpus of 5 million code-mixed tweets. In addition to that, we present five new models trained on this code-mixed corpus. Furthermore, we present three supervised datasets of 12,000 tweets for hate classification, sentiment analysis, and language identification annotated by native Marathi speakers. We also present thorough ablations and show that our code-mixed MeBERT models outperform the previous state-of-the-art models by a considerable margin.

Limitations
A major problem while dealing with Romanized Marathi is the lack of a singular correct spelling of words. A Marathi word can be written in several ways in Marathi, all of which are equally valid and correctly convey meaning despite having significantly different spellings. Developing efficient approaches to tackle this issue will lead to a significant increase in performance on NLP tasks dealing with code-mixed languages. Our keywordbased scraping method uses words primarily from the western Maharashtra dialect of Marathi, which might not sufficiently represent samples from other Marathi dialects. Efforts to increase the dataset to include examples from other dialects will make the dataset more diverse and robust.

Ethics Statement
All of the data used in our experiments has been scraped by legal and valid means, adhering to the provided guidelines. We anonymized the data before usage to protect the privacy of the original authors of the data. This data might contain biases and thus must be used with care. This data also contains strong language which might be unsuitable for some applications. This data should be used only for research purposes and not for training any model for deployment.