Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well with human scores on all desirable criteria, for most NLG tasks. Given this situation, we propose CheckLists for better design and evaluation of automatic metrics. We design templates which target a specific criteria (e.g., coverage) and perturb the output such that the quality gets affected only along this specific criteria (e.g., the coverage drops). We show that existing evaluation metrics are not robust against even such simple perturbations and disagree with scores assigned by humans to the perturbed output. The proposed templates thus allow for a fine-grained assessment of automatic evaluation metrics exposing their limitations and will facilitate better design, analysis and evaluation of such metrics. Our templates and code are available at https://iitmnlp.github.io/EvalEval/


Introduction
As the number of tasks and benchmarks for NLG have increased (Gehrmann et al., 2021), the challenges in evaluating NLG systems have also continued to grow (Liu et al., 2016;Nema and Khapra, 2018;Sai et al., 2019). One reliable way of evaluating NLG systems is to collect human judgements. However, this is a time consuming and expensive process (Freitag et al., 2021;Deriu et al., 2019;Howcroft et al., 2020). Hence, automatic evaluation metrics such as BLEU (Papineni et al., 2002) which are quicker to compute have become popular, despite being less reliable (Callison-Burch et al., 2006;Reiter, 2018).
The survey by Sai et al. (2020b) shows that more than 35 automatic evaluation metrics have been proposed for NLG since 2014, however, there is no careful evaluation of the ability of such metrics to assess the quality of the output of an NLG system on multiple desired criteria. For example, consider the task of dialog evaluation, where humans are asked to score the output on multiple criteria such as fluency, adequacy, coherence, informativeness, engagingness, consistency, etc. Contrast this with automatic evaluation metrics such as BLEU, BLEURT (Sellam et al., 2020), DEB (Sai et al., 2020a), ADEM (Lowe et al., 2017), etc., which assign a single score to the output. What does this score indicate? More specifically, does a low DEB score indicate that the output is not fluent or does it indicate that the output is fluent but not coherent or neither fluent nor coherent? Hence, a single overall score assigned by automatic evaluation metrics is not very informative in deciding which aspects of model improvement should one focus on.
The question then is why do current automatic evaluation metrics produce only a single overall score? This is simply because of a conscious choice made while designing automatic evaluation metrics. In particular, current works only focus on evaluating whether the scores assigned by the proposed metric correlate well with the overall quality scores assigned by humans as opposed to all relevant criteria. In this work, we make a case for shifting the focus to all relevant criteria while evaluating such metrics. To this end, we first do a systematic study involving 6 NLG tasks, 18 different human evaluation criteria (fluency, coverage,coherence, consistency, etc) and 25 automatic evaluation metrics. We take existing English datasets containing human judgements for various tasks and criteria and make two important observations. First, for a given task, human scores for different criteria often have a low correlation, thereby suggesting that these criteria cannot be clubbed together and evaluated using a single score assigned by an automatic evaluation metric. Second, none of the automatic evaluation metrics have a high correlation with human scores for any of the desired criteria for a given task.
The above results highlight a lacunae in the evaluation of automatic evaluation metrics wherein their ability to assess the output on multiple criteria is not evaluated. In this work, we propose a flexible framework which allows a systematic evaluation of the capabilities of an automatic evaluation metric. In particular, we propose CheckList-style templates (Ribeiro et al., 2020) which evaluate the robustness of the metrics to certain perturbations targeting specific criteria. We illustrate this idea with an example in Table 1. In row 2 of Table 1 a gold standard output is perturbed by changing named entities, thereby affecting its factual correctness which is important for data-to-text generation. If an automatic evaluation metric indeed evaluates factual correctness then its score should drop when presented with such a perturbed output.
For the 6 NLG tasks mentioned earlier, we create 34 such perturbation templates covering 18 different evaluation criteria.We then instantiate these templates to create large-scale test cases. For every perturbation, we also collect human judgements to understand how much would a human change his/her score when shown such a perturbed output. We find that for several perturbations, the scores assigned by automatic evaluation metrics do not agree with the scores assigned by humans, thereby indicating that current automatic evaluation metrics are not robust to such perturbations (i.e., they do not really evaluate the desired criteria). Overall, we believe that the proposed templates provide a better framework for a more fine-grained evaluation of automatic evaluation metrics which goes much beyond computing correlations with human scores.

Criteria used in Human Evaluations
The goal of this work is to carefully evaluate automatic evaluation metrics with a focus on their ability to capture the diverse set of criteria used by humans while assessing NLG systems. To begin with, we describe the criteria that the output of an NLG system must satisfy for the 6 NLG tasks that we consider in this work, viz., machine translation (MT), dialog generation (DG), automatic summarisation (AS), question generation(QG), data-to-text generation (D2T) and image captioning (IC). Over the years, different works have proposed different criteria for evaluating NLG systems. In this work, we consider a popular set of criteria for each task as summarised in Sai et al. (2020b) and presented in Table 2. Given the wide variety of criteria used for each task, one obvious question to ask is whether we really need so many criteria or is a single overall score enough.One could argue that it is obvious from the definitions of the criteria that each of them is unique and a good score on one (say, fluency) may not necessarily imply a good score on another (say, coverage). However, we provide a quantitative argument for this by computing the correlations between human scores for different criteria as described below.

Correlations between different criteria
We use existing publicly available datasets containing human judgement scores on multiple criteria for each of the 6 tasks described earlier. For example, (Castro Ferreira et al., 2020) contains 3025 samples of outputs generated by data-to-text generation systems that participated in the WebNLG 2020 challenge. For each of these samples, the organisers asked humans to rate the output based on 5 criteria, viz., fluency, data coverage, relevance, correctness and text structure. We use these scores to compute the correlations between the scores of all the 5 2 pairs of criteria. We repeat this for the other tasks using the datasets described in Table  3 2 3 Using these annotations, we compute the pairwise Kendall tau correlations between all criteria 2 For AS, we could not find a dataset containing human judgements for the set of criteria in Sai et al. (2020b). Hence, we use the dataset provided by Fabbri et al. (2020). 3 Note that all of the datasets mentioned in Table 3 were collected using well established methods to ensure that the annotations were of high quality. Some of these datasets do not explicitly report the Inter Annotator Agreement (IAA) scores whereas others (Fabbri et al., 2020;Castro Ferreira et al., 2020;Nema and Khapra, 2018) report a good IAA score ranging from 0.63-0.71.

Image Captioning
Relevance: The caption should be specific and related to the image.   for each task as seen in Figure  We see that, across tasks, for most pairs of criteria, the correlation is moderate (between 0.3 and 0.5) to low (< 0.3). The highest correlation of 0.76 is observed between interestingness and enjoyability for dialogue generation. However other criteria such as avoiding repetition, inquisitiveness, and making sense have low correlations with most of the other criteria. We make similar observations for the correlations between the criteria for other tasks. Even for IC the correlation between the 2 criteria of thoroughness and correctness is 0.41. For MT, the commonly used criteria of fluency and adequacy were found to be highly correlated with Pearson correlation co-efficient of 0.69 (Banchs et al., 2015). This justifies why WMT evaluations now ask humans to give to only a single score indicating overall quality. However, given the low to moderate correlations between criteria for other tasks, a similar strategy is not prudent for these tasks. Takeaway: For tasks whose linguistic criteria show a low or at-best moderate correlation with each other, a single score assigned by a automatic metric is inadequate for a comprehensive assessment.

Perturbation Checklists
So far we have established that if automatic evaluation metrics are to be used as a substitute for human evaluations as a whole, then they should be capable We are going to embark on an adventure. We're going to embark on an adventure.
All tasks Invariance Numerals to words The flight will be delayed by 2 hours. The flight will be delayed by two hours. Dropping out words or phrases I was being followed. I followed.
Add extra text This book is so inspiring. This book is so inspiring, I forgot .

MT Adequacy
Negation/antonyms It will rain on Monday. It will not rain on Monday. Repeat phrases Beethoven was a German musician Beethoven was a German musician and German musician .

Relevance
Perturb names Phillips was a child prodigy. James was a child prodigy. of evaluating the output on multiple desired criteria. However, the current recipe of proposing and evaluating evaluation metrics does not take this into account. To enable such a systematic evaluation of automatic evaluation metrics, we propose perturbation checklists. Similar to the original Checklist paper (Ribeiro et al., 2020), the idea is to evaluate the performance of the evaluation metric in detecting criteria-specific changes in the output.
We design such perturbation templates for each relevant criteria for each of the 6 tasks as shown in Table 4. For example, consider the criteria fluency which is relevant for all the tasks. Now consider a perturbation template for this criteria which simply drops the stop words in the output. Such a perturbation would definitely affect the fluency of the output. If an automatic evaluation metric is capable of assessing fluency, then this drop in the fluency of the output should get reflected in the score assigned by the metric. More formally, let p be the original output andp t c be the output obtained by applying the perturbation template t for the criteria c.
Further, let f e (p) be the score assigned by a given evaluation metric e to the output p, normalised to be in the range [0, 1]. If the metric e is capable of assessing fluency then we would expect f e (p t c ) to be lower than f e (p). Now, further let h(p) and h(p t c ) be the scores (also normalised to have range [0, 1]) assigned to the original and perturbed outputs by human annotators. We then define a metric s t c (e) which captures the ability of the metric e to detect the perturbation t for the desired criteria c.
The score s t c (e) as defined above thus captures the deviation between a human's perception about the drop in the quality and the metric e's perception about the drop in the quality.
We design a total of 34 such perturbation templates across all the criteria and all the tasks. Each template is manually created by us and targets a specific criteria. We also present invariant templates that do not affect any criteria although they modify the sentences. For perturbations resulting from such invariant templates the score of the metric should not drop. The invariant and fluency-based

End-to-End
Trained BLEURT (Sellam et al., 2020) AS SUPERT  BLANC (Vasilyev et al., 2020) IC SPICE (Anderson et al., 2016) TIGEr ( templates are common for all the tasks considered in this work. Table 4 shows sample perturbations generated by each of the templates. (Please refer to appendix C for a more comprehensive list of the proposed perturbations with examples for each task). These perturbed sentences are generated automatically using the checklist framework (Ribeiro et al., 2020). This framework contains modules for performing simple string manipulations such as dropping stop words, replacing/dropping named entities, masking words or replacing them by other words/phrases. We also extend the framework with additional modules for jumbling words, changing numbers to words, subject-verb disagreement, changing gender, reordering sentences, adding spurious text, and adding redundancy at the word/ phrase /sentence-level.

Experimental setup
We first do a coarse grained evaluation of several metrics by computing their correlations with the scores assigned by humans for multiple criteria. Note that unlike existing studies which study such correlations for a small number of metrics (typically, n-gram based metrics) for a specific task (say, MT) and a single criteria (typically, overall quality), we do a more comprehensive study involving a combination of 6 tasks, 25 metrics and multiple criteria. Apart from this coarse grained evaluation which simply looks at correlations, we also do a more fine-grained evaluation of the robustness of these metrics to different criteria-specific perturbations as summarised in Table 4. This fine-grained evaluation augments the coarse-grained evaluation and helps us understand the evaluation capabilities of these metrics. Below, we describe the datasets and evaluation metrics used in our work.
Datasets. For the coarse-grained evaluation, we use the datasets containing human judgements as described earlier in Table 3 in Section 2. For the fine-grained evaluation, we use datasets containing multiple ground truth references which can then be perturbed using our templates. For MT, we use the expanded version of newstest2017 Chinese to English dataset (Hassan et al., 2018) which contains two references for each sentence. For QG, we use the SQuAD dataset (Rajpurkar et al., 2016) which contains multiple questions for each passage. For AS, we use the curated personal narrative corpus (Ouyang et al., 2017). For DG, we use DailyDialog++ (Sai et al., 2020a) which contains two-speaker conversations on generic topics. For IC, we use the COCO component of the Composite dataset containing 5 reference captions for each image (Aditya et al., 2015). Lastly, for D2T, we use the Triples-to-Text data of the WebNLG 2020 challenge dataset (Castro Ferreira et al., 2020).
Applying perturbations. We take the reference sentences from the above task-specific datasets and apply perturbations using the Checklist framework described earlier in section 3. We first preprocess the sentences by performing tokenization, part-ofspeech tagging, named entity recognition, etc. The targeted part of the sentence is then modified either by leveraging simple string manipulation functions or by masking and generating the words/ phrases using the predictions by RoBERTa (Liu et al., 2019). We provide more implementation details in appendix B.
Automatic Evaluation Metrics. We study a total of 25 evaluation metrics belonging to different classes as shown in Figure 2. For BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin, 2004), CIDEr (Vedantam     Fabbri et al. (2020). For all the task-specific metrics in Figure 2, we use the official codes from the respective papers.
Collecting human judgements. For eq 1, we need human judgement scores. We collect these with the help of 15 annotators who were computer science graduates with a background in the field of Natural Language Processing (NLP) and are also proficient in English language. For each task and criteria, the annotators were provided with the corresponding perturbation templates and asked to provide a penalty score indicating by how much a perturbation would alter the meaning/essence of a sentence on a scale of 0-10. A score of 0 indicates that there is no difference between the original and modified sentences upon application of the perturbation whereas a score of 10 indicates that the perturbation drastically alters sentences. These scores are nor-malised to be in the range [0, 1]. The term h(p t c ) in eq 1 for perturbation t is computed by subtracting the mean of all normalised human penalty scores from 1.
The standard deviation of the normalised annotator scores for each perturbation lies between 0.03 to 0.2. We also measure the inter annotator agreement by splitting annotators randomly into 2 groups and computing the kendall tau correlation score between the average scores of the 2 groups following Liu et al. (2016). This process was run 5 times with different seeds and the final inter-annotator correlation score was found to be 0.79.

Results and Discussion
We now discuss the results of our experiments. Figure 3 shows the correlations of different automatic evaluation metrics with multiple evaluation criteria, across 6 different NLG tasks. Our main observations from the figure are: Mostly poor correlations of metrics across criteria. We observed that across all tasks and all criteria, most of the metrics have poor correlations. In particular, out of the 271 correlation values reported in Figure 3, 228 are poor (<0.3), 35 are moderate (between 0.3 and 0.5) and only 2 are high (>0.5). Surprisingly even some very recently            proposed metrics such as DEB (Sai et al., 2020a), BLANC (Vasilyev et al., 2020) and MaUde (Sinha et al., 2020) do not correlate well with human judgements on other criteria. This is despite the fact that these are task-specific metrics which use the modern machinery of pre-trained BERT-based models and are fine-tuned on human judgements for overall quality. This vindicates our stand that simply tuning for overall quality does not lead to good correlations with other criteria. We do observe a few decently correlated metrics for some of the tasks along a few criteria. Specifically, the moderate correlations are found (i) in D2T for BLEURT along the dimensions of fluency, correctness and text structure, (ii) in IC for all of the task-specific metrics, majority of the embedding-based metrics and for METEOR, (iii) in QG for most metrics (except SMS) along fluency, for BERTScore, Mover-score, SMS and BLEURT along answerability, as well as along completeness, on which we find some of the highest correlations. Pre-training and/or training often helps. While almost all the metrics have poor correlations with different criteria across tasks, we observe that the ones which use a pre-trained component such as static or contextualised word/sentence embeddings and/or use task-specific training data perform better. For example, BLEURT which uses a pre-trained BERT and is fine-tuned using human judgements for MT, is among the top performing metrics across all the tasks. These findings are also consistent with those reported in the WMT20 shared task on evaluation metrics for MT (Mathur et al., 2020b). Task-agnostic metrics versus task-specific metrics. For the tasks of QG, AS, IC and D2T we find that task-agnostic metrics such as BERTScore, Moverscore and BLEURT are consistently among the top performing metrics (i.e., their correlation scores are either the best or close to the best scores for a given task and criteria) along with the taskspecific metrics. This is interesting as these taskagnostic metrics were not fine-tuned on any task specific human judgements and were originally proposed for a different task (MT). Among the taskspecific metrics, SUPERT for AS has a relatively better correlation with consistency than all other metrics. Overall, there seems to be scope for more work/improvements on task-specific metrics to capture the criteria peculiar to each task.

Insights from fine-grained evaluation
We now complement the above analysis with a more fine-grained analysis using perturbation templates. To do so, we use the perturbation templates described in Section 3 and plot the deviation between metric scores and human scores using the formula in Equation 1. These results are presented in Figure 4 and summarised below. Correlations do not reveal everything. In the previous section we observed that BERT-based metrics such as BERTScore, BLEURT and MoverScore are among the top performing metrics across tasks and criteria. However, our anslysis with perturbation templates reveals that even these metrics are not robust to very simple perturbations. For example, for the task of MT consider the perturbations of adding negations, changing names, changing numeric values or replacing by antonyms in the output which can significantly alter the meaning of the sentence and thereby affect adequacy. However, BERTScore, BLEURT and MoverScore are not able to perceive this drop in quality and have a substantial deviation from human scores. We make similar observations across tasks that such metrics are not able to detect these simple perturbations. Task-specific nuances are not captured. Our analysis also shows that existing metrics are not capable of addressing well known task specific grievances. For example, for the task of DG, it is known that many NLG systems generate generic responses which leads to poor engagement with the users. However, none of the metrics are sensitive to perturbations producing 'generic responses' such as ok, thanks or back-off responses such as I'm sorry, can you repeat? 4 Similarly, for the task of QG, Nema and Khapra (2018) show that the answerability of a question is affected if we drop/replace question words or change named entities in the question. However, we find that most metrics (including the task specific QBLEU4/QROUGE) are not sensitive to such perturbations with very high deviation from human scores. On similar lines, ViLBERTScore which is a state of the art evaluation metric for IC is not sensitive to perturbations in gender, order of objects or attributes used for describing objects. This is of concern as many IC systems are known to produce generic captions containing genders, attributes and objects which are most prevalent in the training data. Similarly, for the task of D2T, where coverage and factual correctness are important we observe that most metrics are unable to detect perturbations which add extra/random text to the output or drop named entities (which often contain the most important information). Lastly, for AS it is important that an evaluation metric should penalise summaries which are not coherent or contain redundant sentences or do not have referential clarity. However, we observe that most metrics are not sensitive to perturbations which reorder the sentences or repeat sentences/phrases or replace nouns with pronouns (affecting referential clarity). Different metrics have different skills. While no single metric is capable of detecting all types of perturbations, we observe that some metrics are more robust to certain perturbations. BLEURT and Moverscore are robust to jumbling of words, but BERTScore is not, revealing their differences in detecting fluency. Moverscore, BERTScore and the embedding-based metrics like Greedy Matching and Embedding Average are quite robust to simple transformations of converting numbers to corresponding words, which is an important criteria for the task of D2T, while BLEURT is relatively less robust to it. Similarly, while BERTScore performs poorly for many perturbations, it is able to respect alternative references, i.e., similar to humans, it does not drop its score when presented with alternative correct references from the dataset (last row in Figure 4a to 4f). An interesting observation from the IC task is that SoTA metrics like SPICE and ViLBERTScore show a complementary behaviour on our set of perturbation criteria (third to last and last column in Figure 4f). This opens up interesting avenues for future research where different automatic metrics could be combined to take advantage of their relative strengths.

Related Work
Some of the related work, particularly the relevant datasets, human evaluation criteria, and automatic metrics were already discussed earlier and hence not covered again here. We refer the readers to two recent surveys (Sai et al., 2020b;Çelikyilmaz et al., 2020) for a detailed overview of automatic evaluation metrics as well as related work on criticising the use of automatic evaluation metrics. We mention a few such important works here. BLEU is one of the most widely analysed metric with several studies showing that it does not correlate well with human judgements for machine translation (Callison-Burch et al., 2006). This issue of poor correlations of metrics with human judgements has been reported on not just BLEU, but also on various other metrics, across several NLG tasks including Question Generation (Nema and Khapra, 2018), Data-to-Text generation (Dhingra et al., 2019), Dialogue generation (Liu et al., 2016), and Summarisation (Kryscinski et al., 2019). Apart from poor correlations, Kryscinski et al. (2019) criticize the automatic metrics for abstractive summarization since they don't check for factual inconsistencies in the summaries. Similarly Wiseman et al.
(2017) discuss the lack of a reliable measurement of faithfulness in the context of Data-to-Text Generation. In case of dialogue, several n-gram-based and embedding-based metrics have been shown to fall short in capturing the diversity of the valid responses (Liu et al., 2016;Sai et al., 2020a). The alternative of trained metrics, such as ADEM have been shown to be susceptible to adversarial attacks (Sai et al., 2019).
Similar to the main message of our work, some recent works have also called for a more robust evaluation of automatic evaluation metrics (Choshen and Abend, 2018;Mathur et al., 2020a). Ethayarajh and Jurafsky (2020) also critically examine the current approaches towards NLP leaderboards and point towards having multiple metrics along different dimensions such as fairness, efficiency, robustness, etc.

Conclusion
We conduct a large-scale study involving 6 tasks, 25 automatic evaluation metrics and 18 human evaluation criteria and observe that (i) different criteria such as fluency, coverage, etc are often not correlated and (ii) existing metrics have a low correlation with most criteria across different tasks. Based on these observations, we suggest an alternative framework for evaluating evaluation metrics which goes beyond computing correlations with the human scores for overall quality. More specifically, we propose perturbation templates which allow a more fine-grained evaluation of such metrics and help in understanding their strengths and more importantly their limitations. We hope that future work on designing evaluation metrics will use our perturbation checklist for evaluating the effectiveness of the proposed metric in assessing different relevant criteria.

A Criteria correlations
The pearson correlations among the criteria are presented in Figure 6. Most of the correlation ranges are similar for pearson correlation and kendall tau correlation, except for D2T task. We refer the studies on such correlations (Mathur et al., 2020a), discussing various points such as the influence of outliers and noisy points on the correlations. Additionally, we observe that the expertise of the annotators also influences the criteria-criteria correlations. In particular, we were able to study this in case of AS using the data released by Fabbri et al. (2020) containing both expert and crowdsourced annotations. From Figure 5, we observe that the scores by expert annotators have far lesser correlations amidst various criteria than the crowdsourced annotations. Our perturbation templates mainly draw from the official github repository 5 of the checklist paper and are also publicly available 6 . The implementation involves preprocessing with the help of tokenization, POS tagging, NER recognition, etc. Synonyms, antonyms, etc., are obtained with the help of WordNet framework. Additionally, the masked language model of RoBERTa is used to mask and predict replacements for the targeted words. For example, the application of the template for 'dropping stop words' involves the tokenization of the sentence using the NLTK word tokenizer as the first step. The list of tokens is compared with the set of stopwords provided by NLTK to filter out the stop words from the list of tokens. The modified sentence is then reconstructed using the string join function by iterating over the tokens in the modified list. Similarly, for the template of 'changing the at- tributes' in case of image captioning, the sentence is first tokenized, then the adjectives are identified using part-of-speech tagging (again a functionality provided by NLTK). The list of 'related words' (i.e., hyponyms of hypernyms or 'sibling words') are obtained using WordNet framework. Unless the list returns empty, one of the entries in the list is used to replace the original adjective. In order to 'change question to an assertive statement', the question words (such as who, what, why, when, etc) are replaced with a 'mask' token and the '?' character at the end is replaced with '.' using string replace function. This modified sentence is then fed to RoBERTa model which generates different predictions to be used in place of the 'mask' token. One of the suggested words is used to form the modified assertive sentence. In case of perturbations involving dropping words, we additionally decide if we're dropping stop words, adjectives, question words, etc in the particular perturbation and estimate the extent of effect it'll have on each criteria. The perturbations of adding text, appends random words / phrases / sentences to a given text to account for not just the cases where there is missing information, but also cases where there is spurious wrong information, even if it accompanies / follows the correct version. The complete implementations / details of our perturbation templates are hosted publicly 7 . Note that some of the perturbations cannot be applied to every sentence in the dataset. For example, the template of "changing names" cannot be applied if there are no named entities in a particular sentence. We hence shortlist only the successfully modified samples from the dataset for analysing the metrics' performance on each perturbation.  Figure 7 is a more comprehensive version of Figure  3. It shows the correlations of the complete set of metrics considered in this study with various criteria across different tasks. Dropping words (such as prepositions/articles, etc) The bank is willing to approve the loan. Bank willing to approve the loan.

All tasks Fluency
Spelling errors Make the most of every opportunity presented to you.
Make the most of evry opportunity presented to you. Dropping out words or phrases I was being followed. I followed.
Adding extra wrong information This book is so inspiring. This book is so inspiring, I forgot .
Negation / antonyms It will rain on Monday. It will not rain on Monday.

MT Adequacy
Repeat phrases My relatives are in town. My relatives are in town, my relatives .

Dropping words
Here is the no parking sign.
Here is the sign.

Negation and antonyms
This book is so inspiring . This book is so uninspiring .

Informativeness
Use hyponyms to create misinformation The girl my brother Andy met through MySpace turned out to be completely made up .
The girl my friend Andy met through MySpace turned out to be completely made up. Flow / coherence Reorder sentences The pandemic was spreading uncontrollably.
Vaccines are being developed and tested rapidly.
Vaccines are being developed and tested rapidly. The pandemic was spreading uncontrollably. Non-Redundancy Repeat sentences My relatives are in town. My relatives are in town. My relatives in town.

Referential clarity
Replace nouns by pronouns The pandemic was spreading uncontrollably.
Vaccines are being developed rapidly.
It was spreading uncontrollably. Repeat phrases Beethoven was a German musician Beethoven was a German musician and German musician .

Random text
Beethoven was a German musician The cricketer was born in 1990.

Relevance
Perturb names Phillips was a child prodigy. James was a child prodigy.

Replace with synonyms
The mangoes are delicious . The mangoes are tasty .

Contractions
We are going to embark on an adventure. We're going to embark on an adventure.

Expansions
There weren't any clear winners of the contest There were not any clear winners of the contest.

All tasks Invariance
Numerals to words Aron Ralston who was trapped for 127 hours. Aron Ralston who was trapped for one hundred twenty seven hours.