MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation

Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text. Due to the high linguistic diversity, code-mixing presents several challenges in evaluating standard natural language generation (NLG) tasks. Various widely popular metrics perform poorly with the code-mixed NLG tasks. To address this challenge, we present a metric independent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments on the generated code-mixed text. As a use case, we demonstrate the performance of MIPE on the machine-generated Hinglish (code-mixing of Hindi and English languages) sentences from the HinGE corpus. We can extend the proposed evaluation strategy to other code-mixed language pairs, NLG tasks, and evaluation metrics with minimal to no effort.


Introduction
Code-mixing (hereafter 'CM') is a commonly observed communication pattern for a multilingual speaker to mix words and phrases from multiple languages. CM is widespread across various language pairs across the globe, such as Spanish-English (Spanglish) and Hindi-English (Hinglish). Various studies (Baldauf, 2004) have predicted the high growth in the number of CM speakers, which would surpass the number of native speakers in various globally popular languages (e.g., English).
We observe a growing interest in the computational linguistic community to study the CM NLG tasks. Recently, various resources and systems have been proposed that explore different dimensions of the CM NLG (Yang et al., 2020;Gautam et al., 2021;Gupta et al., 2021;Rizvi et al., 2021;Gupta et al., 2020;Jawahar et al., 2021). Evaluation of the CM NLG tasks is challenging due to the high linguistic diversity and lack of standardization. To address this challenge, Srivastava and Singh (2021b) has proposed HinGE corpus for the Hinglish CM text generation and evaluation (see Section 2 for details). HinGE corpus demonstrates the inefficacy of various widely popular metrics on the CM dataset.
In this paper, we choose five evaluation metrics (see Section 3 for details) as discussed in (Srivastava and Singh, 2021b) to demonstrate the efficacy of MIPE. Our proposed metric independent pipeline (MIPE) augments these metrics and addresses four major linguistic bottlenecks: (i) spelling variations, (ii) language switching, (iii) missing words, and (iv) the limited number of reference sentences associated with the CM NLG systems. The main contributions are: • We identify four major reasons for the poor quality performance of various widely popular evaluation metrics for the code-mixed NLG evaluation.
• We propose a metric independent evaluation pipeline MIPE that addresses the identified bottlenecks in CM NLG evaluation. Furthermore, we show its efficacy in generating highly correlated metric scores against human scores.
The rest of the paper is organized as follows. In Section 2, we discuss the dataset for the CM NLG evaluation task. In Section 3, we present the MIPE pipeline addressing the four major bottlenecks for effective CM NLG evaluation. We discuss the results in Section 4. In Section 5, we discuss the current state and future direction. We conclude the discussion in Section 6.

Dataset
Recently, we observe various works to address the underlying challenges with the CM NLG. Numerous resources and systems have been proposed recently to advance the field. In our experiments, we use the HinGE corpus proposed in (Srivastava and Singh, 2021b). The HinGE corpus contains 1,976 English-Hindi parallel sentences from the IIT-B parallel corpus (Kunchukuttan et al., 2018). Corresponding to each of the English-Hindi parallel sentences, HinGE has two variants of CM Hinglish sentences: • Human-generated Hinglish sentences: (Srivastava and Singh, 2021b) have employed eight human annotators to generate the Hinglish sentences. Each parallel sentence pair is annotated by a single human annotator. Human annotators have generated at least two Hinglish sentences corresponding to each parallel sentence pair. On average, 2.5 Hinglish sentences are generated for each parallel sentence pair.
• Machine-generated Hinglish sentences: Srivastava and Singh (2021b) proposes two rulebased algorithms to generate the CM sentences. They leverage the matrix-frame theory to generate the Hinglish sentences where Hindi is the matrix language and English tokens are embedded. The proposed algorithms differ significantly at the level of granularity (i.e., word and phrase). We will use the Figure 1: Example of the CM sentences generated by the annotator along with WAC and PAC generated CM sentence from the parallel English-Hindi sentence pair. Two human annotators rate the machine-generated sentence on a scale of 1-10.
acronyms WAC (word-aligned code-mixing) and PAC (phrase-aligned code-mixing) for the two algorithm variants in the rest of the paper.
In addition to the machine-generated Hinglish sentences, HinGE has a human rating corresponding to each generated sentence. The human rating varies between 1-10, indicating low to high generation quality. Two human annotators have rated each of the machine-generated CM sentences. Figure 1 shows the example CM sentences generated by humans and two rule-based algorithms along with the rating to the machine-generated CM sentences. Figure 2 shows the distribution of the human ratings to the machine-generated Hinglish sentences. WAC-generated sentences receive a relatively high rating (> 6) as compared to PAC. In addition, WAC showed a low degree of human disagreements than PAC.  TER values represent better generation performance. Tables 1 and 2 show the comparison of five metric scores against the human ratings against WAC and PAC (see scores present in columns with heading 'Without MIPE'). In addition, (Srivastava and Singh, 2021b) present a correlation study between the human ratings and the metric scores. For this purpose they divide the ratings into three buckets:
• Bucket 3: Human rating between 6-10. Table 4 shows the correlation between the human ratings and the metric scores for WAC and PAC (see scores present in columns with heading 'Without MIPE'). The correlation scores show a scope to build systems that shows a high correlation with human judgment.

MIPE
As discussed in the previous section, the widely used evaluation metrics fail to capture the linguistic diversity of the CM data. Based on the empirical observation on the 10 datasets used in (Srivastava and Singh, 2021a), we identify four major reasons for the failure of NLG evaluation metrics on the CM data. We propose a metric independent evaluation pipeline MIPE, for effective evaluation. Using MIPE, we first reduce the spelling variations (see Section 3.1) and the language switching (see Section 3.2) in the candidate Hinglish sentence. Next, we introduce a penalty (see Section 3.3) on the evaluation score based on the degree of importance of the missing words in the candidate Hinglish sentence. Finally, we address the challenge of a limited number of reference sentences (see Section 3.4) by segmenting the candidate and the reference sentences into phrases and leveraging the paraphrasing capability. Figure 3 shows the architecture of the proposed evaluation pipeline.

Spelling variations
The first challenge to effective evaluation is the non-standard spellings of the code-mixed words. E.g., words kanekt, connect, and connekt conveys the same meaning in a Hinglish sentence. Due to a lack of writing standards for the code-mixed languages, the speakers often use their phonetic understanding of the source languages to write the CM sentences. Hence, in most spelling variations, the addition, omission, and substitution of letters indicate that the phonetics remains almost the same. Specifically, we observe three major reasons for spelling variations, • R1: character repetition • R2: replacement with similar-sounding character • R3: vowel omission To address these problems, we normalize words such that similar-sounding words are grouped. We leverage the concept of Phonetic Dissimilarity (PDS, Toutanova and Moore (2002)) to address the spelling variations in the CM language. Our proposed PDS algorithm is a variant of the popular dynamic programming-based edit distance algorithm. Similar to edit distance, PDS quantifies the dissimilarity between two strings by counting the minimum number of edit operations (addition, deletion, and substitution) required to transform one string into the other. In PDS, we assign different costs to each edit operation based on the phonetic characteristics of the corresponding characters of the two words and the edit operation under consideration. To access the phonetic characteristics, we use a corpus of all possible pronunciations of the English alphabets 1 . Algorithm 1describes PDS between a word w 1 (in candidate CM sentence) and w 2 (in reference CM sentences). To address R1, we remove repeating characters from both words. By default, we keep addition and Figure 3: The architecture of the proposed CM NLG evaluation system MIPE. Machine-generated candidate CM sentences are generated by two rule-based algorithms (WAC and PAC). We reduce the spelling variation and language switching for both the candidate and reference sentences based on phonetics. A penalty is applied to the words in the candidate sentence which are not present in any of the reference sentences. We account for limited reference sentences by chunking the candidate and reference sentences into trigram phrases. The words in the candidate trigram phrases are assigned a score based on their presence in the reference phrases. The candidate phrase score is used to account for the limited reference sentences. Metric score is calculated based on the modified reference and candidate sentences. A missing word penalty and a penalty for limited reference sentences is added (or subtracted) from the modified metric score. It should be noted that the penalty is subtracted from the modified metric score if the lower metric score shows better performance (e.g., WER and TER.).
deletion cost = 1 and substitution cost = 2. To address R2, we decrease the substitution cost to ρ sub for similar-sounding characters as substitution of one of these characters is highly likely.
To address R3, we decrease the addition cost of vowels to ρ add and the deletion cost of vowels to ρ del , where ρ add > ρ del . This is due to the empirical observation that the omission of vowels is much more likely than an addition. Further, we decrease the addition and deletion costs of a possible silent character to ρ sil . We consider the minimum of PDS(w1, w2) and PDS(w2, w1) as the final PDS score to identify the spelling variation between words w1 and w2. In our experiments, we keep ρ sub = ρ add = ρ sil = 0.75, and ρ del = 0.25.

Similar Words
Identifying similar words in the same or different languages is a challenging task in the CM languages. For example, two phrases "in the market" and "in the bazaar" convey the same semantics, but most automatic evaluation metrics will fail to identify the semantic similarity. To address the challenge of token-level similarity, we need a common representation of words across the source languages. To mitigate this problem, we propose a Similar Word Search (SWS) procedure. Algorithm 2 shows the description of the SWS procedure. Given a word from the candidate CM sen-tence as an input, the SWS procedure returns all similar words from the corresponding reference sentences. We select that word from the reference sentences, which yields the minimum PDS value. The SWS procedure returns a word from the reference set if the minimum PDS value is less than σ thres . Otherwise, it computes pairwise cosine distance (in the cross-lingual word embedding space) between each word in a set of reference words and the input word. To create the cross-lingual embedding space, we use the pretrained word vectors of dimension 300 for English and Hindi from fastText (Bojanowski et al., 2017). For the shared representation, we use VecMap (Artetxe et al., 2018) to learn the mapping in an unsupervised fashion with the default settings. We use the English and Hindi sentences from the IIT-B parallel corpus (Kunchukuttan et al., 2018). In case the cosine similarity is greater than σ cos , the SWS procedure returns the word from the reference set; else, we assume that no similar word exists in the reference set. In our experiments, we keep σ thres = 2, and σ cos = 0.5.

Missing words
Generally, the generated candidate sentence misses some words resulting in a significant impact on the automatic evaluation scores. Some words are more important than others, but most metrics consider them equal (M 1). Furthermore, most metrics match exact words with no flexibility in spelling variations and language switching (M 2). Here, we address both these problems to apply a missing word penalty to the metric score with some writing style flexibility. To address M 1, we use WAC procedure 2 to generate a large Hinglish corpus (hereafter 'Paral-lelCorp') of 2,132,184 sentences. For creating the parallel corpus, we collect English sentences from multiple sources 3,4,5,6 and translate them (if not already translated) into Hindi language using Google Translate API. We calculate IDF-values (Inverse Document Frequency) of each word in the Hinglish corpus. The words with low IDF values occur rarely and hence carry more semantic information. If a word is not present in the Parallel-Corp, we consider it semantically important. To address M 2, we relax the exact match condition 2 We employ WAC due to its capability to generate high-quality sentences (as shown in (Srivastava and Singh, 2021b)). Also, the Hinglish sentence generated by WAC has words from only the source English and Hindi sentences which in turn doesn't influence the IDF values of the generated words to a large extent.
3 https://www.kaggle.com/kazanova/ sentiment140 4 https://www.kaggle.com/arkhoshghalb/ twitter-sentiment-analysis-hatred-speech 5 http://www.cfilt.iitb.ac.in/iitb_ parallel/ 6 https://bit.ly/2XQjrU6 by postulating that either the word is present in the candidate sentence or its variant is present in the sentence. Here, we allow two types of variations (i) minor spelling variations and (ii) language switch (for more details, see Sections 3.1 and 3.2). We use the SWS procedure to find a word variant keeping a maximum distance value of 1. Algorithm 3 shows the description of the missing word penalty (MWP) in detail. For each word w in a reference sentence, we check the presence of w and its variants in the candidate sentence. In case w is not present, we add w's IDF value as the penalty for the absence. We repeat the procedure for each reference sentence and take the minimum penalty among all reference sentences. We reduce the MWP score from the metric score for a given evaluation metric.

Limited Reference Sentences
A sentence can be paraphrased in numerous ways by interchanging subject and predicate, active and passive voice, and first, second, and third-person perspectives. With the code-mixed text, the paraphrasing possibilities significantly increase. For an automatic evaluation, it is infeasible to generate all possible paraphrases as reference sentences. Even though HinGE dataset has at least two reference sentences against a candidate sentence but it is insufficient to include all the possibilities. Thus, paraphrasing drastically limits the evaluation capabilities of various metrics. To address this problem, we present an algorithm PhraseScore. Algorithm 4 shows the description of the PhraseScore method. We split the candidate sentence and the set of reference sentences into trigram phrases. If word w in the candidate phrase exists in one of the reference phrases, we add the IDF value of the w in the phrase score for that phrase. Else, we subtract the IDF value as a penalty. This phrase score is aggregated, normalized over the number of phrases in the candidate sentence, and divided by the penalty of missing words in the candidate sentence. To prevent division by zero, we add 0.0001 to the penalty. In case a word is not present in the IDF dictionary, we assign it a relatively high value (µ miss ) to indicate that it is a rare word of high importance. Finally, we increase the metric score by adding the candidate sentence's Phras-eScore. In our experiments, we keep µ miss =20. Due to the unavailability of a paraphrasing system for a code-mixed language, the formulation of PhraseScore algorithm depends on the assumption that the trigram phrases in a sentence can be reordered to create new sentences.

Results and Evaluation
We evaluate WAC and PAC procedures augmented with MIPE pipeline against all the five metrics (as discussed in Section 2). Table 1 and 2 shows the effect of MIPE against the five metrics. As expected, all metrics show better scores with the MIPE augmentation. The metric scores after the MIPE shows a high correlation 7 against the metric scores without MIPE (see Table 3). This shows that improvements in the metric scores is constant throughout and are not by chance. Table 4 shows the effect of MIPE on the correlation with the human scores. We use the same criteria to bucket the human ratings as discussed in Section 2. We observe a higher correlation in all the three buckets for WAC augmented with MIPE. This improvement is consistent throughout all the metrics. For PAC augmented with MIPE, we observe a decrease in correlation in the second bucket, which can be attributed to (i) a relatively large number of poor quality (low human scores) sentences generated by PAC, and (ii) rating poor quality CM sentence is a challenging task for humans due to lower readability of the sentence. For the rest of the buckets, PAC with MIPE shows a higher correlation with the human scores.

Current State and Future Directions
The results discussed in Section 4 demonstrate a need to build metrics, theories, and experiments for better CM NLG evaluation. Some of the    challenges and limitations of the proposed MIPE pipeline for effective CM NLG includes: • Due to the unavailability of resources in other CM language pairs, the MIPE pipeline is tested on a single CM language. We need to extend the proposed evaluation strategy to other CM language pairs.
• The presence of two different languages in a single CM sentence increases the paraphrasing possibility to a much larger extent. We need metrics that attend to the CM sentences beyond the bag of words model. These metrics should also be able to account for paraphrasing.
• There are various other reasons (beyond the four reasons discussed in this paper) that influence the evaluation of CM NLG tasks such as named-entities, transliteration, etc. The MIPE pipeline doesn't currently account for these limitations.
• The code-mixed sentences in the HinGE dataset are not collected from the social media platforms. The code-mixed data from the social media platform tends to be more noisy and distorted which could influence the performance of MIPE pipeline.
As discussed, currently there are several limitations with the CM NLG evaluation which need to be addressed in order to build effective CM NLG systems for multilingual societies. Some of the lessons learned and the future directions for the CM NLG evaluations are: • The limited resource availability is one of the major bottlenecks in the CM NLG tasks and evaluation. Currently, the available resources are smaller in size compared to the monolingual NLG tasks.
• In contrast to the MIPE augmentation pipeline, we need systems that can leverage the noisy nature of the code-mixed text. The currently proposed MIPE pipeline addresses the various challenges independently and attempts to reconstruct the noisy CM text for effective evaluation.
• The two languages participating in CM influence the various constructs of the target CM sentence such as grammar, syntax, etc. The current experimentation with only one CM language needs to be explored with other CM languages.
• Recently, we observe a rise in the availability of multilingual language models (LMs). These LMs could be used to build effective CM NLG evaluation systems.
• The current evaluation metrics seem to perform poorly with the CM languages. We need to build dedicated metrics for the CM NLG evaluation tasks that can leverage the linguistic diversity of the CM data.

Conclusion
In this paper, we present a metric independent evaluation pipeline for efficient code-mixed NLG evaluation. The proposed pipeline shows a high correlation between the human scores and the underlying evaluation metrics. Besides the four significant challenges to CM NLG evaluation, in the future, we also plan to address other challenges such as code-mixed existence of named-entities, informal writing style, and missing context.