MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research directions, we will make the dataset and the code publicly available upon publication.


Introduction
Over the years, we have seen enormous downstream applications of multi-sentential datasets in the areas such as question-answering (Joshi et al., 2017;Tapaswi et al., 2016), summarization (Sharma et al., 2019;Cachola et al., 2020), machine translation (Bao et al., 2021), etc.The existing state-of-the-art methods prove challenging to scale effectively and efficiently on multi-sentential long sequence text (Ainslie et al., 2020), which unplugs several exciting research avenues.Unfortunately, to a large extent, the majority of the research on multi-sentential data is dominated by a few popular monolingual languages such as English, Chinese, and Spanish.Due to this, code-mixed languages (among other low-resource and under-explored languages) suffer from non-existent works in the aforementioned areas of interest.We posit that due to several inherent challenges, the NLP community hold back on building multisentential datasets for the low-resource and codemixed languages.One of the most significant bottlenecks in building such resources is the unavailability of MCT on traditional and widely popular data sources such as social media platforms where the short-length and noisy code-mixed text is available in abundance.It presents several challenges such as the difficulty in curating a large-scale multisentential dataset at ease.Another major challenge is the lack of metrics to measure the degree of codemixing in the multi-sentential framework.The existing metrics such as code-mixing index (Das and Gambäck, 2014) and multilingual-index (Barnett et al., 2000) already suffers from major limitations (Srivastava and Singh, 2021a) in the short-length text format.In such a scenario, it gets mystifying to build a retrieval pipeline to identify MCT and we need to depend heavily on the expertise of human annotators which is a time and cost-demanding exercise.In this work, we address both of these challenges.As a representative use case, we base our work on Hinglish, a popular code-mixed language in the Indian subcontinent.But the insights from our exploration could be extended to other code-mixed language pairs.To address the first challenge, we identify two non-traditional multilingual data sources 1 i.e., political speeches and press releases along with Hindi daily news articles (discussed in detail in Section 3). Figure 1 shows example Hinglish MCTs from two multilingual data sources.To address the second challenge, we propose a token-level languageaware pipeline and extend a widely popular metric (i.e., code-mixing index) measuring the degree of code-mixing in a multi-sentential framework.We demonstrate the effectiveness of the proposed pipeline with a minimal task-specific annotation which significantly reduces the overall human effort (discussed in detail in Section 4).
Eventually, we build a novel multi-sentential dataset for the Hinglish language with 85k MCTs identified from 67k articles.In Table 1, we compare MUTANT with four other Hinglish datasets (Srivastava and Singh, 2020;Khanuja et al., 2020;Mehnaz et al., 2021;Srivastava and Singh, 2021b) proposed for a variety of tasks such as machine translation, natural language inference, generation, and evaluation.The MUTANT dataset has a significantly higher average number of sentences along with longer MCT (high average number of tokens).Alongside, the dataset notably consists of a higher number of data instances which is a rarity for the code-mixed datasets (Srivastava and Singh, 2021a Dainik Jagran (DJ): Dainik Jagran is another popular Indian Hindi newspaper.According to World Press Trends 2016, DJ is ranked 5th in the world by circulation.Similar to the DB website, they have also created a repository of articles on their official website9 .Here, we extract 311836 of these articles from the website that were uploaded between April 2013 to May 2022.In Table 2, we present the category-wise distribution of the articles scraped from the DJ website.

Experimental Setup
Problem definition: Given a multilingual article A comprising of q multi-sentential text spans (MST) i.e.A = {M 1 , M 2 , ..., M q }, we predict a binary outcome L CM for each MST Figure 2 shows the architecture of the MCT identification pipeline.Next, we discuss the various components of this pipeline in detail.

Token-level language annotation (TLA)
We exploit the token-level language information to identify MCT given a multilingual article A. We annotate the words in A using a code-mixed language identification tool.Specifically, we use L3Cube-HingLID (Nayak and Joshi, 2022) for this task.A word w i ∈ A can take either of the three language tags from the set {English, Hindi, Other}.Given that L3Cube-HingLID works only on the Roman script text, we use a Devanagari to Roman script transliteration tool10 for the tokens written in Devanagari script.In Table 3, we report the percentage of Hindi and English tokens.With an exception of the AAP dataset, Hindi is the predominant language in all the data sources.

Code-Mixing Index (CMI)
In the literature, we observe several metrics that has been proposed to measure the degree of codemixing in text such as code-mixing index (CMI, (Das and Gambäck, 2014)), multilingual-index (Mindex, (Barnett et al., 2000)) and integration-index (I-index, (Guzmán et al., 2017)).Each of these metrics has its own merits and limitations (Srivastava and Singh, 2021a).In this work, we use the most widely used CMI metric due to the ease of interpretation and the suitability for the task.CMI, by definition, measures the degree of code-mixing in a text as: Here, w i is the number of words of the language i, max{w i } represents the number of words of the most prominent language, n is the total number of tokens, u represents the number of languageindependent tokens (such as named entities, abbreviations, mentions, and hashtags).The CMI score ranges from 0 to 100.A low CMI score suggests the prevalence of only one language in the text whereas a high CMI score indicates a high degree of code-mixing.

Small annotated dataset (SAnD)
We create a small manually annotated dataset comprising all seven data sources.The objective of the annotation is to assign a binary label to each MST such that we can identify if the MST is code-mixed or not from the assigned label.
More formally, SAnD = {A 1 : l 1 , A 2 : l 2 , ..., A u : l u }, represents u manually annotated MST11 where l i ∈{0,1} ∀i∈ [1,u].Here, l i =1, if A i is code-mixed, otherwise 0.  For this annotation task, we have selected a small number of articles (60 each from D speech and D news ) randomly from the scraped articles.We leave it to the judgment of the annotator to decide if a sentence (and subsequently the MST) is code-mixed or not.The annotator has expertlevel proficiency in Hindi, English, and Hinglish languages.In Table 4, we show the distribution of the annotated articles for each data source.In total, we annotate 120 articles and 568 MST where we identify 121 MST (21.3%) as code-mixed.

Estimating multilinguality
Though CMI is widely used in numerous previous works, we couldn't find any discussion on the ideal CMI score thresholding criteria to identify a good code-mixed text.The problem becomes even more challenging when we use the CMI metric in a multi-sentential framework along with constraints P 1 and P 2 (ref §2).Various works (Khanuja et al., 2020) have used empirically identified CMI thresholds to measure the degree of code-mixing in the text.But, we couldn't find any experimental justification for their findings.Dual MEC score: Here, we propose a novel adoption of the CMI metric in a constrained multisentential framework.For MST M p with k sentences, we compute the scores for dual multilinguality estimation criteria (MEC) as: 1. Sentence-level CMI (CM I): We compute CM I(s i ) for the sentence s i ∈M p using the language-information of all the words in s i and the formulation given in 1.

Multilinguality ratio (M R): We compute C M R
for the MST M p as: Here, N cm and k are the number of code-mixed and total sentences in M p respectively.Figure 3 shows the mean and standard deviation of dual MEC scores on seven different data sources.Formulation: We identify if the sentence s i is code-mixed or monolingual using CM I(s i ) score as: Here, α∈[0, 100] is the sentence-level CMI score threshold and f cm (.) estimates the code-mixing status (1 being code-mixed and 0 being monolingual) of the sentence under consideration.Using 3, we compute N cm as: Using 2 and 4, we compute M R(M p ) as: We formulate the following function to identify if MST M p with k sentences is code-mixed: Here, β∈[0, 1] is the multilinguality ratio threshold and g cm (.) estimates the code-mixing status (1 being code-mixed and 0 being monolingual) of the MST under consideration.

Dual MEC threshold computation
The dual MEC formulation helps us to identify the MCT in a constrained setting by jointly modeling the sentence-level and MST-level multilinguality information.As discussed in Section 4.4, the ideal thresholds α and β are a conundrum that needs further exploration.Here, we propose to use the SAnD dataset to identify the dual MEC thresholds (α and β).Algorithm 1 shows the procedure to compute the thresholds.The algorithm takes SAnD dataset D with u labeled MST.We represent the parameter search space for α and β with α cand and β cand respectively.α cand ranges from α low to α high with a step-size of α step whereas β cand ranges from β low to β high with a step-size of β step .Based on our empirical observation, we set (α low , α high , α step ) with (0, 50, 1) and (β low , β high , β step ) with (0, 0.5, 0.025).
We perform the grid search on each threshold combination of (α i , β j ) to identify the best combination.For each threshold combination, we identify the accuracy of identifying the MCT in D leveraging f cm (.) and g cm (.) formulations.We select the threshold combination with the highest accuracy as the final threshold (α and β).Table 5 shows the best-identified thresholds on various data sources of the SAnD dataset.Figure 4 shows the mean and standard deviation of the accuracy on various dual MEC threshold combinations for different data sources.

Dual MEC threshold generalization
As evident from Table 5, the thresholds α and β vary across the data sources.So, it is important to identify which of these identified thresholds will result in a robust and stable performance across datasets.Here, we experiment with five dual MEC threshold generalisation techniques: 1. Local Average (LA): For the data source D i , we take the mean sentence-level CMI score and mean MR score as the dual MEC thresholds.2. Global Average (GA): For the data source D i , we take the mean sentence-level CMI score and mean MR score of the corresponding category data-source (D speech or D news ) as the dual MEC thresholds.3. Average of LA and GA (ALG): For the data source D i , we take the average of LA and GA identified thresholds as the dual MEC thresholds.4. Single data source generalization (SDG): In this approach, we generalize the dual MEC thresholds identified locally on a single data source D i (using Algorithm 1) to identify MCT globally on other data sources.5. Multi data source generalization (MDG): In this approach, we use the dual MEC threshold information from multiple sources and use the majority voting to identify the best thresholds.For the data source D i , we use the thresholds identified on three data sources (using Algorithm 1), namely D i , D speech (if D i ∈D speech , else ȃD news ), and D speech + D news .We then make an independent prediction on each of the three thresholds and take majority voting for the final classification of M p .

MUTANT: A Multi-sentential Code-mixed Hinglish Dataset
We evaluate the performance of MCT identification pipeline and the five dual MEC threshold generalization techniques using the three subsets of the SAnD dataset: D speech , D news , and D speech + D news .We report the following metric scores on each of the seven data sources: 1. Accuracy: We compute accuracy as the ratio of the total correct prediction of MCT and non-MCT to the total number of MST.We multiply this ratio by 100 and report the accuracy percentage.A high accuracy % is preferred.2. False MCT Rate (FMR): We define FMR as the ratio of incorrectly identified MCT to the total number of actual monolingual MST.We report the FMR% and a low FMR% is preferred.3. Diversity@10 (D@10): We define D@10 as the percentage of articles in data source D i having more than 10% correctly identified MCT.A high D@10 score is preferred.We report the results in Tables 6, 7, 8.The meanbased threshold generalization techniques (LA, GA, and ALG) consistently show poor performance on all the metrics.Given the nature of the problem, we prefer a low rate of misidentification of mono-   lingual MST as the MCT and at the same time a high number of actual MCT should also be identified.MDG threshold generalization technique satisfies both conditions with low FMR and high accuracy on all the datasets.D@10 depicts if the threshold generalization technique is influenced by the presence of a few outliers in the dataset.SDG and MDG both show competitive results on the D@10 metric outperforming the mean-based threshold generalization techniques by a large margin.The constant poor performance of mean-based threshold generalization against SDG and MDG also shows the efficacy of the proposed threshold computation strategy (Algorithm 1).Finally, to build the MUTANT dataset, we use the MCT identification pipeline with the MDG threshold generalization technique.Table 9 shows the statistics of the MUTANT dataset.To facilitate future work on this novel task of MCT identification, we will release the MUTANT dataset along with the initially scraped data from all the data sources and the annotated SAnD dataset.The MUTANT dataset can be used for various tasks including but not limited to question-answering, text summarization and machine translation for Hinglish texts.This dataset could be used as a pre-training dataset to train efficient NLU models for various tasks on Hinglish data.

Analysis and Discussion
In this section, we qualitatively evaluate the MU-TANT dataset by employing two human evaluators, different from the one used for the SAnD to avoid any biases in the evaluation.Both evalua-   tors are proficient in English, Hindi, and Hinglish languages.We randomly sample five articles from each of the seven source datasets and share the originally scraped articles containing both identified MCT and monolingual MST with both evaluators.
During the evaluation, we do not disclose which of the MSTs is identified as MCT and share the following guidelines: 1.Any MST containing only Hindi words or only English words is monolingual.2. Any named entity, date, number, or word common in both English and Hindi languages should be considered a language-independent word.In Table 10, we report our findings from the qualitative evaluation study.Out of a total of 419 MST, we observe the complete agreement on 321 monolingual MST and 55 code-mixed MST resulting in ≈90% complete agreement.A complete agreement means that both annotators agree that any particular MST is code-mixed or not.On MST with CA, we further compute the three metric scores using MDG.The results strengthen our earlier findings from Section 5.In Figure 5, we report two example MCT incorrectly identified by our MCT identifica-Figure 5: False positive MCT.We color code the tokens as: Hindi, English, and language independent.tion pipeline.In the first example, both evaluators show complete agreement whereas in the second example there is a disagreement between the evaluators.We attribute this behavior to the poor state of the current code-mixed LID systems (Srivastava and Singh, 2021a) and since the CMI metric and our dual MEC formulation depend heavily on the code-mixed LID tools, the final results get affected.This limitation further provides an opportunity for future works to explore the problem from different perspectives such as a token-level languageindependent MCT identification pipeline.It will also be interesting to see how this pipeline performs with other code-mixed languages, especially in a low-resource setting.

Conclusion
In this paper, we present a novel task of identifying MCT from multilingual documents.We propose an MCT identification pipeline by extending CMI to the multi-sentential framework and leveraging the pipeline we build a dataset for the Hinglish language.We highlight several challenges in building such resources and our insights will be useful to future works in code-mixed and low-resource languages.

Limitations
The limitations with the MUTANT dataset include but are not limited to: • Contrary to the previous works, all the data sources comprises the non social media sites.This could potentially limit the diversity in the code-mixed text as observed on social media platforms.• In the current form, the dataset is limited to only one code-mixed language.We believe the proposed technique to extract MCT could be expanded to other code-mixed languages in the future.• The data sources could potentially have their own biases (topical, style of writing, etc).We expect future works to be cautious while generalizing the results obtained on this dataset.

Figure 1 :
Figure 1: Example MCT and the corresponding article's title form two multilingual data sources: (A) Dainik Jagran news article and (B) Man-ki-baat speech transcript.We color code the tokens as: English, Hindi, and language independent.

Figure 3 :
Figure 3: The mean and standard deviation of the dual MEC score for different data sources.The CMI score is scaled between 0 to 1.
(α and β) along with the accuracy of identifying MCT on various data sources in the SAnD dataset.

Figure 4 :
Figure 4: The mean and standard deviation of the accuracy on various dual MEC threshold combinations.The red dot corresponding to each data source indicates the accuracy against the best-identified thresholds.

Table 1 :
Comparison of the MUTANT dataset with the currently available datasets in the Hinglish language.

Table 2 :
). Dainik Bhaskar is one of the most popular Hindi newspapers in India.It is ranked 4th in the world by circulation according to World Press Trends 2016 7 .They have digitized the daily newspapers on their website 8 .Articles on DB website have been divided into many categories such as 'Entertainment' and 'Sports'.We have extracted 115324 articles uploaded on the website between February 2019 to May 2022.In Table2, we present the category-wise distribution of the articles scraped from the DB website.Number of articles in various news categories in the DB and DJ datasets.
PM speech (PMS): Majority of the Indian Prime Minister speeches (different from MKB speeches) are stored digitally on the PM India website 6 .We have extracted 694 of these speeches that are recorded between November 2016 to October 2021.3.2Hindinews articlesHere, we scrape data from two major Hindi news daily websites.Collectively, we denote this data source as D news .Dainik Bhaskar (DB):

Table 3 :
Distribution of the scraped articles from various data sources.AW: average number of words.AC: average number of characters.%E: percentage of English tokens.%H: percentage of Hindi tokens.

Table 6 :
Results on D speech dataset.

Table 9 :
MUTANT dataset statistics.A: Articles, M: MCT, and H: Headings.The INC and MKB datasets contain generic and very-low informative headlines and we do not include them in the final dataset.

Table 10 :
Qualitative evaluation of the MUTANT dataset.