DEMETR: Diagnosing Evaluation Metrics for Translation

While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlations with human quality judgments than BLEU, are opaque in comparison. In this paper, we shed light on the behavior of these learned metrics by creating DEMETR, a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories. All perturbations were carefully designed to form minimal pairs with the actual translation (i.e., differ in only one aspect). We find that learned metrics perform substantially better than string-based metrics on DEMETR. Additionally, learned metrics differ in their sensitivity to various phenomena (e.g., BERTScore is sensitive to untranslated words but relatively insensitive to gender manipulation, while COMET is much more sensitive to word repetition than to aspectual changes). We publicly release DEMETR to spur more informed future development of machine translation evaluation metrics


Introduction
Automatically evaluating the output quality of machine translation (MT) systems remains a difficult challenge.The BLEU metric (Papineni et al., 2002), which is a function of n-gram overlap between system and reference outputs, is still used widely today despite its obvious limitations in measuring SOURCE (de) : Murray verlor den ersten Satz im Tiebreak, nachdem beide Männer jeden einzelnen Aufschlag im Satz gehalten hatten.

REF :
Murray lost the first set in a tie break after both men held each and every serve in the set.
MT : Murray lost the first set in the tiebreak after both men held every single serve in the set.
PERTURBED MT : Murray won the first set in the tiebreak after both men held every single serve in the set.Figure 1: An example perturbation (antonym replacement) from our DEMETR dataset.We measure whether different MT evaluation metrics score the unperturbed translation higher than the perturbed translation; in this case, BLEURT and BERTSCORE accurately identify the perturbation, while COMET-QE fails to do so.semantic similarity (Fomicheva and Specia, 2019;Marie et al., 2021;Kocmi et al., 2021;Freitag et al., 2021).Recently-developed learned evaluation metrics such as BLEURT (Sellam et al., 2020a), COMET (Rei et al., 2020), MOVERSCORE (Zhao et al., 2019), or BARTSCORE (Yuan et al., 2021a) seek to address these limitations by either fine-tuning pretrained language models directly on human judgments of translation quality or by simply utilizing contextualized word embeddings.While learned metrics exhibit higher correlation with human judgments than BLEU (Barrault et al., 2021), their relative lack of interpretability leaves it unclear as to why they assign a particular score to a given translation.This is a major reason why some MT researchers are reluctant to employ learned metrics in order to evaluate their MT systems (Marie et al., 2021;Gehrmann et al., 2022;Leiter et al., 2022).

BLEURT (Ref
In this paper, we build on previous metric explainability work (Specia et al., 2010;Macketanz et al., 2018;Fomicheva and Specia, 2019;Kaster et al., 2021;Sai et al., 2021a;Barrault et al., 2021;Fomicheva et al., 2021;Leiter et al., 2022) by introducing DEMETR, a dataset for Diagnosing Evaluation METRics for machine translation, that measures the sensitivity of an MT metric to 35 different types of linguistic perturbations spanning common syntactic (e.g., incorrect word order), semantic (e.g., undertranslation), and morphological (e.g., incorrect suffix) translation error categories.Each example in DEMETR is a tuple containing {source, reference, machine translation, perturbed machine translation}, as shown in Figure 1.The entire dataset contains of 31K total examples across 10 different source languages (the target language is always English).The perturbations in DEMETR are produced semi-automatically by manipulating translations produced by commercial MT systems such as Google Translate, and they are manually validated to ensure the only source of variation is associated with the desired perturbation.
We measure the accuracy of a suite of 14 evaluation metrics on DEMETR (as shown in Figure 1), discovering that learned metrics perform far better than string-based ones.We also analyze the relative sensitivity of metrics to different grades of perturbation severity.We find that metrics struggle at times to differentiate between minor errors (e.g., punctuation removal or word repetition) with semantics-warping errors such as incorrect gender or numeracy.We also observe that the referencefree2 COMET-QE learned metric is more sensitive to word repetition and misspelled words than severe errors such as entirely unrelated translations or named entity replacement.We publicly release DEMETR and associated code to facilitate more principled research into MT evaluation.

Diagnosing MT evaluation metrics
Most existing MT evaluation metrics compute a score for a candidate translation t against a reference sentence r.3These scores can be either a simple function of character or token overlap between t and r (e.g., BLEU), or they can be the result of a complex neural network model that embeds t and r (e.g., BLEURT).While the latter class of learned metrics4 provides more meaningful judgments of translation quality than the former, they are also relatively uninterpretable: the reason for a particular translation t receiving a high or low score is difficult to discern.In this section, we first explain our perturbation-based methodology to better understand MT metrics before describing the collection of DEMETR, a dataset of linguistic perturbations.

Using translation perturbations to diagnose MT metrics
Inspired by prior work in minimal pair-based linguistic evaluation of pretrained language models such as BLIMP (Warstadt et al., 2020), we investigate how sensitive MT evaluation metrics are to various perturbations of the candidate translation t.Consider the following example, which is designed to evaluate the impact of word order in the candidate translation: reference translation r: Pronunciation is relatively easy in Italian since most words are pronounced exactly how they are written.machine translation t: Pronunciation is relatively easy in Italian, as most words are pronounced exactly as they are spelled.perturbed machine translation t ′ : Spelled pronunciation as Italian, relatively are most is as they pronounced exactly in words easy.
If a particular evaluation metric SCORE is sensitive to this shuffling perturbation, SCORE(r, t ′ ), the score of the perturbed translation, should be lower than SCORE(r, t).5 Note that while other minor translation errors may be present in t, the perturbed translation t ′ differs only in a specific, controlled perturbation (in this case, shuffling).

Creating the DEMETR dataset
To explore the above methodology at scale, we create DEMETR, a dataset that evaluates MT metrics on 35 different linguistic phenomena with 1K perturbations per phenomenon. 6Each example in DEMETR consists of (1) a sentence in one of 10 Data sources and filtering: We utilize Xto-English translation pairs from two different datasets, WMT (Callison-Burch et al., 2009;Bojar et al., 2013Bojar et al., , 2015Bojar et al., , 2014;;Akhbardeh et al., 2021;Barrault et al., 2020) and FLORES (Guzmán et al., 2019), aiming at a wide coverage of topics from different sources.WMT has been widely used over the years as a popular MT shared task, while FLORES was recently curated to aid MT evaluation.We consider only the test split of each dataset to prevent possible leaks, as both current and future metrics are likely to be trained on these two datasets.We sample 100 sentences (50 from each of the two datasets) for each of the following 10 languages: French (fr), Italian (it), Spanish (es), German (de), Czech (cs), Polish (pl), Russian (ru), Hindi (hi), Chinese (zh), and Japanese (ja). 8We pay special attention to the language selection, as newer MT evaluation metrics, such as COMET-QE or PRISM-QE, employ only the source text and the candidate translation.We control for sentence length by including only sentences between 15 and 25 words long, measured by the length of the tokenized reference translation.Since we re-use the same sentences across multiple perturbations, we did not include shorter sentences because they are less likely to contain multiple linguistic phenomena of interest. 9As the quality of sampled sentences varies, we manually check each source sentence and its translation to make sure they are of satisfactory quality. 10 Translating the data: Given the filtered collection of source sentences, we next translate them into English using the Google Translate API. 11We manually verify each translation, editing or resampling the instances where the machine translation contains critical errors. 12Through this process, we obtain 1K curated examples per perturbation (100 sentences × 10 languages) that each consist of source and reference sentences along with a machine translation of reasonable quality.

Perturbations in DEMETR
We perturb the machine translations obtained above in order to create minimal pairs, which allow us to investigate the sensitivity of MT evaluation metrics to different types of errors.Our perturbations are loosely based on the Multidimensional Quality Metrics (Burchardt, 2013, MQM) framework developed to identify and categorize MT errors.Most perturbations were performed semi-automatically by utilizing STANZA (Qi et al., 2020), SPACY 13 or GPT-3 (Brown et al., 2020), applying handcrafted rules and then manually correcting any errors.Some of the more elaborate perturbations (e.g., translation by a too general term, where one had to be sure that a better, more precise term exists) were performed manually by the authors or linguistically-savvy freelancers hired on the Upwork platform. 14Special care was given to the plausibility of perturbations (e.g., numbers for replacement were selected from a probable range, such as 1-12 for months).See Table 2 for descriptions and examples of most perturbations; full list in Appendix A. We roughly categorize our perturbations into the following four categories: • ACCURACY: Perturbations in the accuracy category modify the semantics of the translation by either incorporating misleading information (e.g., by adding plausible yet inadequate text or changing a word to its antonym) or omitting information (e.g., by leaving a word untranslated).
• FLUENCY: Perturbations in the fluency category focus on grammatical accuracy (e.g., word form agreement, tense, or aspect) and on overall cohesion.Compared to the mistakes in the accuracy category, the true meaning of the sentence can be usually recovered from the context more easily.
• MIXED: Certain perturbations can be classified as both accuracy and fluency errors.Concretely, this category consists of omission errors that not only obscure the meaning but also affect the grammaticality of the sentence.One such error is subject removal, which will result not only in an ungrammatical sentence, leaving a gap where the subject should come, but also in information loss.
• TYPOGRAPHY: This category concerns punctuation and minor orthographic errors.
Examples of mistakes in this category include punctuation removal, tokenization, lowercasing, and common spelling mistakes.
• BASELINE: Finally, we include both upper and lower bounds, since learned metrics such as BLEURT and COMET do not have a specified range that their scores can fall into.Specifically, we provide three baselines: as lower bounds, we either change the translation to an unrelated one or provide an empty string,15 while as an upper bound, we set the perturbed translation t ′ equal to the reference translation r, which should return the highest possible score for reference-based metrics.
Error severity: Our perturbations can also be categorized by their severity (see Table 1).We use the following categorization scheme for our analysis experiments: • MINOR: In this type of error, which includes perturbations such as dropping punctuation or using the wrong article, the meaning of the source sentence can be easily and correctly interpreted by human readers.
• MAJOR: Errors in this category may not affect the overall fluency of the sentence but will result in some missing details.Examples of major errors include undertranslation (e.g., translating "church" as "building"), or leaving a word in the source language untranslated.
• CRITICAL: These are catastrophic errors that result in crucial pieces of information going missing or incorrect information being added in a way unrecognizable for the reader, and are also likely to suffer from severe fluency issues.Errors in this category include The last word is being repeated four times.Punctuation is added after the last repeated word.automatic minor hypernym The language most of the people working in the Vatican City use on a daily basis is Italian, and Latin is often used in religious ceremonies.
The language most of the people working in the Vatican City use on a daily basis is Italian, and Latin is often used in religious activities.
A word translated by a too general term (undertranslation).Special care was given in order to assure the word used in perturbed text is more general, and incorrect, translation of the original word.manual with suggestions from GPT-3 major untranslated The Polish Air Force will eventually be equipped with 32 F-35 Lightning II fighters manufactured by Lockheed Martin.The Polish Air Force will eventually be equipped with 32 F-35 Lightning II fighters produkowane by Lockheed Martin.
One word is being left untranslated.We manually assure that each time only one word is left untranslated.manual major completeness She is in custody pending prosecution and trial; but any witness evidence could be negatively impacted because her image has been widely published.
She is _____ pending prosecution and trial; but any witness evidence could be negatively impacted because her image has been widely published.
One prepositional phrase is being removed.Whenever possible, we remove the shortest prepositional phrase in order to assure that the perturbed sentence is not much shorter than the original translation.automatic (Stanza) with manual check major addition _____ Plants look their best when they are in a natural environment, so resist the temptation to remove "just one." Power plants look their best when they are in a natural environment, so resist the temptation to remove "just one." One word is being added.We make sure that the added word does not disturb the grammaticality of the sentence but changes the meaning in a significant way.
manual critical antonym He has been unable to relieve the pain with medication, which the competition prohibits competitors from taking.He has been unable to relieve the pleasure with medication, which the competition prohibits competitors from taking.
One word (noun, verb, adj., or adv.)A number is being replaced with an incorrect one.Special attention was given to keep the numerals with resonable/common range for the given category (e.g., 0-100 for percentages; 1-12 for months).We also assure that the replacement will not create an illogical sentence (e.g., replacing "1920" with "1940" in "from 1920 to 1930") manual critical mistranslation gender He has been unable to relieve the pain with medication, which the competition prohibits competitors from taking.She has been unable to relieve the pain with medication, which the competition prohibits competitors from taking.
Exactly one feminine pronoun in the sentence (such as "she" or "her") is being with a masculine pronouns (such as "he" or "him") or vice-versa.This includes reflexive pronouns (i.e., "him/herself") and possessive adjectives (i.e., "his/her").
automatic with manual check critical

FLUENCY cohesion
Scientists want to understand how planets have formed since a comet collided with Earth long ago, and especially how Earth has formed.Scientists want to understand how planets have formed _____ a comet collided with Earth long ago, and especially how Earth has formed.
A conjunction, such as "thus" or "therefore" is removed.Special attention was given to keep the rest of the sentence unperturbed.Affix of the word is being changed keeping the stem kept constant (e.g., "bad" to "badly") which results in the part-of-speech shift.The degree to which the original meaning is affected varies, however, the intended meaning is easily retrivable from the stem and context.manual minor grammar swap order I don't know if you realize that most of the goods imported into this country from Central America are duty free.I don't know if you realize that most of the goods imported this into country from Central America are duty free.
Two neighboring words are being swapped to mimic word order error.automatic (spaCy) minor grammar case She announced that after a break of several years, a Rakoczy horse show will take place again in 2021.
Her announced that after a break of several years, a Rakoczy horse show will take place again in 2021.
One pronoun in the sentence is being changed into a different, incorrect, case (e.g., "he" to "him").automatic (spaCy) with manual check minor grammar function word Last month, a presidential committee recommended the resignation of the former CEP as part of measures to push the country toward new elections.Last month, an presidential committee recommended the resignation of the former CEP as part of measures to push the country toward new elections.
A preposition or article is being changed into an incorrect one to mimic mistake in function words usage.While most perturbations result in minor mistakes (i.e., the original meaning is easily retrivable) some may be more severe.
automatic with manual check minor-major grammar tense Cyanuric acid and melamine were both found in urine samples of pets who died after eating contaminated pet food.Cyanuric acid and melamine are both found in urine samples of pets who died after eating contaminated pet food.
A tense is being change into an incorrect one.We consider past, present, as well as the future tense (although this may be classified as modal verb in English) manual major grammar aspect He has been unable to relieve the pain with medication, which the competition prohibits competitors from taking.
He is being unable to relieve the pain with medication, which the competition prohibits competitors from taking.
Aspect is being changed to an incorrect one (e.g., perfective to progressive) without changing the tense.manual major grammar interrogative This is the tenth time since the start of the pandemic that Florida's daily death toll has surpassed 100.Is this the tenth time since the start of the pandemic that Florida's daily death toll has surpassed 100?
Affirmative mood is being changed to interrogative mood.manual major

MIXED omission adj/adv
Rangers closely monitor shooters participating in supplemental pest control trials as the trials are monitored and their effectiveness assessed.
Rangers _____ monitor shooters participating in supplemental pest control trials as the trials are monitored and their effectiveness assessed.
An adjective or adverb is being removed.subject deletion or replacement of a named entity.

Performance of MT evaluation metrics on DEMETR
We test the accuracy and sensitivity of 14 popular MT evaluation metrics on the perturbations in DEMETR.We include both traditional stringbased metrics, such as BLEU or CHRF, as well as newer learned metrics, such as BLEURT and COMET.Within the latter category, we also include two reference-free metrics, which rely only on the source sentence and translation and open possibilities for a more robust MT evaluation.The rest of this section provides an overview of the evaluation metrics before analyzing our findings.Detailed results of each metric on every perturbation are found in Table A3.

Evaluation metrics
String-based metrics can be used to evaluate any language, provided the availability of a reference translation (see Table 3).Their score is a function of string overlap or edit-distance, though it may not be always easily interpretable (Müller, 2020).Only BLEU 16 allows for multiple references in order to account for many possible translations of a sentence; however, it is rarely used with more than one reference due to the lack of multireference datasets (Mathur et al., 2020).Learned metrics, on the other hand, are much less transparent.BERTSCORE relies on contextualized embeddings, while PRISM employs zero-shot paraphrasing.COMET and BLEURT directly fine-tune pretrained language models on human judgments provided as Direct Assessments or MQM annotations. 17

Perturbation accuracy
First, we measure the accuracy of each metric on DEMETR.For each perturbation, we define the accuracy as the percentage of the time that SCORE(r, t) 16 For all string-based metrics we use the HuggingFace implementations available at https://huggingface.co/ evaluate-metric.In the case of BLEU, we use the Sacre-BLEU version 2.1.0(Post, 2018). 17We use the HuggingFace implementation of BERTSCORE, BLEURT-20, COMET, and COMET-QE.For BLEURT-20, we use BLEURT-20, the most recent and recommended checkpoints, for COMET and COMET-QE we use the SOTA models from WMT21 shared task, wmt21-comet-mqm and wmt21-comet-qe-mqm checkpoints, and for BERTScore we use roberta-large.For PRISM, we use the implementation available at https://github.com/thompsonb/prismMetric # Params Language string-based metrics BLEU (Papineni et al., 2002) any CER (Morris et al., 2004) any CHRF (Popović, 2015) any CHRF2 (Popović, 2017) any METEOR (Banerjee and Lavie, 2005) any ROUGE-2 (Lin, 2004) any TER (Snover et al., 2006) any pre-trained metrics BARTSCORE (Yuan et al., 2021b) 406M 50 BERTSCORE (Zhang* et al., 2020) 355M 104 BLEURTT-20 (Sellam et al., 2020b) 579M 104 COMET (Rei et al., 2021) 580M 100 PRISM (Thompson and Post, 2020) 745M 39 pre-trained reference-free metrics COMET-QE (Rei et al., 2021) 569M 100 PRISM-QE (Thompson and Post, 2020) 745M 39 Table 3: Details of metrics tested on DEMETR.We report the parameter count for the largest available checkpoint of each learned metric.For learned metrics, we report the maximum number of languages that each can accept as input.While most of the learned metrics leverage pretrained multilingual language models (e.g., mBERT), it is important to note that they have not been validated against human judgments of MT quality on all of these languages (e.g., BLEURT-20 is only validated on 13 languages).The best performing string-based metric is CHRF2, which corroborates results reported in Kocmi et al. (2021).
PRISM-QE achieves better accuracy than COMET-QE for reference-free metrics: Of the two reference-free metrics we evaluate, we notice that COMET-QE struggles with some perturbations.Most notably, its accuracy when given a random translation (i.e., a translation that does not match the source sentence) oscillates around 50% (chance level) across all languages.Furthermore, COMET-QE shows low accuracy on gender (i.e., masculine pronouns replaced with feminine pronouns or vice-versa), number (i.e., a number replaced for another, reasonable number), and interrogatives (i.e., change of affirmative mood into interrogative mood).COMET-QE also strongly prefers (88%) the translation stripped of final punctuation over the complete sentence, in comparison to 0% for PRISM-QE.In terms of accuracy, PRISM-QE performs exceptionally well on all perturbations, achieving lower accuracies (yet still around 80%) only for Hindi-a language it was not trained on.

Sensitivity analysis
While the accuracy of a metric on DEMETR is useful to know, it also obscures the sensitivity of a metric to a particular perturbation.Are metrics more sensitive to CRITICAL errors than MINOR ones?Are different learned metrics comparatively more or less sensitive to a particular perturbation?In this section, we explore these questions and highlight interesting observations, focusing primarily on the behavior of learned metrics.
Measuring sensitivity: Since each of our metrics has a different score range, we cannot naïvely just compare their score differences to analyze sensitivity.Instead, we compute a ratio that intuitively answers the following question: how much does SCORE drop on this perturbation compared to the catastrophic error of producing an empty string?We choose the empty string as a control since it is the perturbation that results in the largest SCORE drop for most metrics.Concretely, for a given reference translation r i , machine translation t i , and perturbed translation t ′ i , we compute a ratio z i as: (1) Then, for each perturbation category, we aggregate the example-level ratios to obtain z by simply taking a mean, z = i z i N , where N is the number of examples for that perturbation (in most cases, 1K).20 Figure 2 contains a heatmap plotting this z ratio for each perturbation and learned metric, and forms the core of the following analysis.
BERTSCORE is relatively more sensitive to some minor errors than it is to critical errors: Although we observe that BERTSCORE drops only by a small absolute number for most perturbations, it is actually quite sensitive to many perturbations, especially when passing an unrelated translation and a shuffled version of the existing translation -two of the most drastic perturbations.It also shows higher sensitivity to untranslated words (i.e., codemixing) than to the remaining perturbations, which is to be expected as BERTSCORE uses a multilingual model.However, its sensitivity to incorrect numbers (0.044), gender information (0.067), or aspect change (0.099) is lower than sensitivity to less severe errors, such as tokenized sentence (0.26) or lower-cased sentence (0.33) -a trend visible in other metrics, though not to such an extent.
COMET-QE, a metric adapted to MQM scoring, does not perform well on DEMETR: COMET-QE trained on MQM ratings (i.e., on the identification of mistakes similar to those included in DEMETR) varies in its sensitivity to perturbations.While it is sensitive to a sentence with shuffled words, it is not sensitive to a different, unrelated translation (an observation in line with its accuracy).COMET-QE also seems to be insensitive to minor errors such as the removal of the final punc-tuation, but also to some major or critical errors such as gender and number replacement. 21Furthermore, COMET-QE is much more sensitive to word repetition (0.46-0.72) and word swap (0.41) than to some critical or major errors, such as named entity replacement (0.16) or sentence negation (0.16).Overall, COMET-QE behaves very differently from most of the other metrics, and in ways that are difficult to explain.
Overall, all metrics struggle to differentiate between minor and critical errors: While all metrics other than COMET-QE are very sensitive to the two baselines (different translation and shuf-fled words) when compared to other perturbations (0.44-2.20), they struggle to differentiate the severity of some critical errors, such as an addition of a plausible but meaning-changing word (0.032-0.12) or incorrect number (0.0038-0.07).These ratios are lower than of some minor errors such as a word repeated four times (0.086-0.72).In fact, BERTSCORE, COMET, and COMET-QE are more sensitive to word repetition than to an addition of a word which ultimately critically changes the meaning.

Related Work
Our work builds on the previous efforts to analyze the performance of MT evaluation metrics, as well as efforts to curate diagnostic datasets for NLP.
Analysis of MT evaluation metrics: Fomicheva and Specia (2019) show that metric performance varies significantly across different levels of MT quality.Freitag et al. (2020) demonstrate the importance of reference quality during evaluation.Kocmi et al. (2021) investigate the performance of pretrained and string-based metrics, and conclude that learned metrics outperform string-based metrics, with COMET being the best-performing metric at the time.However, Amrhein and Sennrich (2022) explore COMET models in more depth finding, just as in the current study, that the models are not sensitive to number and named entity errors.Hanna and Bojar (2021), on the other hand, find that BERTSCORE is more robust to errors in major content words, and less so to small errors.Finally, Kasai et al. (2021) introduce a leaderboard for generation tasks that ensembles many of the metrics used here.
Diagnostic datasets: A number of previous studies employed diagnostic tests to explore the performance of NLP models.Marvin and Linzen (2018) evaluate abilities of LSTM based language models to rate grammatical sentence higher than ungrammatical ones by curating a dataset of minimal pairs in English.Warstadt et al. (2020) also utilize the concept of linguistic minimal pairs to evaluate the sensitivity of language models to various linguistic errors.Ribeiro et al. (2020) curate a checklist of perturbations to test the robustness of general NLP models.Specia et al. (2010) introduce a simplified dataset of translations by four MT systems annotated for their quality in order to evaluate MT evaluation metrics.Sai et al. (2021b) also propose a checkliststyle method to test the robustness of evaluation metrics for MT; however, they limit themselves to Chinese-to-English translation.Furthermore, many of the perturbations introduced in Sai et al. (2021b) does not control for a single aspect, as DEMETR does, and are not manually verified.Macketanz et al. (2018), on the other hand, design a linguistic test suite to evaluate the quality of MT from German to English, which WMT21 (Barrault et al., 2021) utilizes as a challenge dataset for MT evaluation metrics.Finally, Barrault et al. (2021) create a nine-category challenge set from a Chinese to English corpus, in order to test MT evaluation metrics, that are being submitted to the shared task.

Conclusion
We present DEMETR, a dataset designed to diagnose MT evaluation metrics.DEMETR consists of 31K semi-automatically generated perturbations that cover 35 different linguistic phenomena.Our experiments showed that learned metrics are notably better than any string-based metrics at distinguishing perturbed from unperturbed translations, which confirms results reported in other studies (Kocmi et al., 2021;Fomicheva and Specia, 2019).We further explore the sensitivity of learned metrics, showing that even the best-performing metrics struggle to distinguish between minor errors such as word repetition and critical errors such as incorrect number, aspect, and gender.We will publicly release DEMETR to spur more informed future development of machine translation evaluation metrics.

Limitations
While DEMETR incorporates a wide range of linguistic phenomena, including various semantic, pragmatic, and morphological errors, all examples included in DEMETR are of translations into-English.It is likely that other translation directions may introduce other errors or metrics may be more/less sensitive to them.Furthermore, we decided to utilize sentence level translation as most metrics evaluate the translation on the sentence level and to highlight specific errors, which could be less apparent in the paragraph level setup.However, sentence level data cannot model discourse level errors, which remain an open problem in both machine translation and its evaluation.Furthermore, as DEMETR was constructed using WMT and FLORES the domains incorporated in DEMETR are restricted to the ones present in these two datasets (i.e., mostly news and informational materials).Finally, even though in most cases multiple correct translations of the source sentence exist, we provide only one reference.We decided not to include multiple reference due to the time restrictions as well as the fact that the only metric currently supporting multiple references is BLEU.

Ethical Considerations
Some perturbations were conducted manually with a help of freelancers hired on Upwork.The freelancers were informed of the purpose of this experiment.They were paid an equivalent of $15 per hour.We also adjusted this hourly rate to cover the 20% Upwork charge, which the platform charges the freelancers.The last word is being repeated twice.Punctuation is added after the last repeated word.automatic minor 2 repetition Gordon Johndroe, Bush's spokesman, referred to the North Korean commitment as "an important advance towards the goal of achieving verifiable denuclearization of the Korean penisula."Gordon Johndroe, Bush's spokesman, referred to the North Korean commitment as "an important advance towards the goal of achieving verifiable denuclearization of the Korean penisula penisula penisula penisula." The last word is being repeated four times.Punctuation is added after the last repeated word.automatic minor

hypernym
The language most of the people working in the Vatican City use on a daily basis is Italian, and Latin is often used in religious ceremonies.
The language most of the people working in the Vatican City use on a daily basis is Italian, and Latin is often used in religious activities.
A word translated by a too general term (undertranslation).Special care was given in order to assure the word used in perturbed text is more general, and incorrect, translation of the original word.
manual major 4 untranslated The Polish Air Force will eventually be equipped with 32 F-35 Lightning II fighters manufactured by Lockheed Martin.The Polish Air Force will eventually be equipped with 32 F-35 Lightning II fighters produkowane by Lockheed Martin.
One word is being left untranslated.We manually assure that each time only one word is left untranslated.manual major 5 completeness She is in custody pending prosecution and trial; but any witness evidence could be negatively impacted because her image has been widely published.
She is _____ pending prosecution and trial; but any witness evidence could be negatively impacted because her image has been widely published.
One prepositional phrase is being removed.Whenever possible, we remove the shortest prepositional phrase in order to assure that the perturbed sentence is not much shorter than the original translation.automatic (Stanza) with manual check major 6 addition _____ Plants look their best when they are in a natural environment, so resist the temptation to remove "just one." Power plants look their best when they are in a natural environment, so resist the temptation to remove "just one." One word is being added.We make sure that the added word does not disturb the grammaticality of the sentence but changes the meaning in a significant way.manual critical 7 antonym He has been unable to relieve the pain with medication, which the competition prohibits competitors from taking.He has been unable to relieve the pleasure with medication, which the competition prohibits competitors from taking.
One A number is being replaced with an incorrect one.Special attention was given to keep the numerals with resonable/common range for the given category (e.g., 0-100 for percentages; 1-12 for months).We also assure that the replacement will not creat illogical sentence (e.g., replacing "1920" with "1940" in "from 1920 to 1930") manual critical 11 mistranslationgender He has been unable to relieve the pain with medication, which the competition prohibits competitors from taking.She has been unable to relieve the pain with medication, which the competition prohibits competitors from taking.
Exactly one feminine pronoun in the sentence (such as "she" or "her") is being with a masculine pronouns (such as "he" or "him") or vice-versa.This includes reflexive pronouns (i.e., "him/herself") and possessive adjectives (i.e., "his/her").A conjunction, such as "thus" or "therefore" is removed.Special attention was given to keep the rest of the sentence unperturbed.Suffix of the word is being changed keeping the root constant (e.g., "bad" to "badly") which results in the part-of-speech shift.The degree to which the original meaning is affected varies, however, the intended meaning is easily retrivable from the perturbed word.
manual minor 14 grammarorder swap I don't know if you realize that most of the goods imported into this country from Central America are duty free.I don't know if you realize that most of the goods imported this into country from Central America are duty free.
Two neighboring words are being swapped to mimic word order error.automatic (spaCy) minor 15 grammarcase She announced that after a break of several years, a Rakoczy horse show will take place again in 2021.
Her announced that after a break of several years, a Rakoczy horse show will take place again in 2021.
One pronoun in the sentence is being changed into a different, incorrect, case (e.g., "he" to "him").automatic (spaCy) with manual check minor 16 grammarfunction word Last month, a presidential committee recommended the resignation of the former CEP as part of measures to push the country toward new elections.Last month, an presidential committee recommended the resignation of the former CEP as part of measures to push the country toward new elections.
A preposition or article is being changed into an incorrect one to mimic mistake in function words usage.While most perturbations result in minor mistakes (i.e., the original meaning is easily retrivable) some may be more severe.
automatic with manual check minor-major 17 grammartense Cyanuric acid and melamine were both found in urine samples of pets who died after eating contaminated pet food.Cyanuric acid and melamine are both found in urine samples of pets who died after eating contaminated pet food.
A tense is being change into an incorrect one.We consider past, present, as well as the future tense (although this may be classified as modal verb in English) manual major 18 grammaraspect He has been unable to relieve the pain with medication, which the competition prohibits competitors from taking.
He is being unable to relieve the pain with medication, which the competition prohibits competitors from taking.
Aspect is being changed to an incorrect one (e.g., perfective to progressive) without changing the tense.It was the last game for the All Blacks, who had won the trophy two weeks earlier.

Unrelated translation.
Automatic base 34 unintelligible Cyanuric acid and melamine were both found in urine samples of pets who died after eating contaminated pet food.Pets urine in of and acid were both died melamine found pet after who eating food contaminated cyanuric samples.
Shuffled words.Automatic base 35 reference Last month, a presidential committee recommended the resignation of the former CEP as part of measures to push the country toward new elections.Last month a presidential commission recommended the prior CEP's resignation as part of a package of measures to move the country towards new elections.
Reference passed as the translation.Automatic base   et al., 2020).Degrees of Freedom (DF) are estimated using the Welch-Satterthwaite equasion for Degrees of Freedom.The accuracy on the baseline perturbation 35 (reference as translation) was reversed, as one can expect the metric to prefer translation identical with the reference.
S. Supreme Court last year blocked the Trump administration from including the citizenship question on the 2020 census form.The U.S. Supreme Court last year blocked the Trump administrate from including the citizenship question on the 2020 census form.

Figure 2 :
Figure 2: A heatmap of the sensitivity of learned metrics to different perturbations in DEMETR.The numbers are the ratios z computed as described in Section 4. Higher values denote higher relative sensitivity to the perturbation and are marked by a darker color.The error severity categories are arranged from minor (bottom part) through major (middle part) to critical (upper part).The last two errors are baselines.
understand how planets have formed since a comet collided with Earth long ago, and especially how Earth has formed.Scientists want to understand how planets have formed _____ a comet collided with Earth long ago, and especially how Earth has formed.
S. Supreme Court last year blocked the Trump administration from including the citizenship question on the 2020 census form.The U.S. Supreme Court last year blocked the Trump administrate from including the citizenship question on the 2020 census form.

Table 1 :
List of perturbations included in DEMETR with their corresponding error severity.Details can be found in Appendix A source languages, (2) an English translation written by a human translator, (3) a machine translation produced by Google Translate, 7 and (4) a perturbed version of the Google Translate output which introduces exactly one mistake (semantic, syntactic, or typographical).
Catri said that 85% of new coronavirus cases in Belgium last week were under the age of 60.Catri _____ that 85% of new coronavirus cases in Belgium last week were under the age of 60.In 1940 he stood up to other government aristocrats who wanted to discuss an "agreement" with the Nazis and he very ably won.In 1940 he stood up to other government _____ who wanted to discuss an "agreement" with the Nazis and he very ably won.
critical Table 2: A subset of perturbations in DEMETR along with examples (detailed changes are highlighted in purple).A full list of perturbations is provided in Table A1 and Table A2 in Appendix A.
Since all perturbed sentences are less correct versions of the original machine translation, we expect all metrics to perform well on this task.Table4contains the accuracies averaged across both error severity as well as overall.Interesting results include: (Marie et al., 2021)E(r, t ′ ). 18EU, often the only metric employed to evaluate the MT output(Marie et al., 2021), achieves an overall accuracy of only 78.70%.To illustrate their struggles, the accuracy of string-based metrics ranges from 54% to 84% on the adjective/adverb removal perturbation, where a single adjective or adverb is omitted.
t know if you realize that most of the goods imported into this country from Central America are duty free.I don't know if you realize that most of the goods imported into this country from Central America are duty free free.
Stephen Colbert welcomed 17-year-old Thunberg to his show on Tuesday and conducted a lengthy interview with the Swede.Late night presenter John Oliver welcomed 17-year-old Thunberg to his show on Tuesday and conducted a lengthy interview with the Swede.General in Houston was established in 1979 and is the first Chinese consulate in the United States.The Chinese Consulate General in Houston was established in 1997 and is the first Chinese consulate in the United States.
understand how planets have formed since a comet collided with Earth long ago, and especially how Earth has formed.Scientists want to understand how planets have formed since a comet collided with Earth long ago, and expecially how Earth has formed.t know if you realize that most of the goods imported into this country from Central America are duty free.I don't know if you realie that most of the goods imported into this country from Central America are duty free.a.m. on July 26, the reporter saw at the scene of Jiangkouhe Lianxu that the local area had made various preparations before flood distribution.At 9:30 a.m. on July 26 , the reporter saw at the scene of Jiangkouhe Lianxu that the local area had made various preparations before flood distribution .
I don't know if you realize that most of the goods imported into this country from Central America are duty free.I don't know if you realize that most of the goods imported into this country from _____ are duty free.

Table A3 :
A two-samples Welsch t-test is conducted on each metric to compare SCORE(r, t) and SCORE(r, t ′ ) (see Section 2.1) of each perturbation type.The tests are implemented in Python using the package scipy (Virtanen