MuLER: Detailed and Scalable Reference-based Evaluation

We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER’s validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.


Introduction
Reference-based evaluation of text generation plays a uniquely important role in the development of machine translation (Papineni et al., 2002), summarization (Lin, 2004), and simplification (Xu et al., 2016) among many other sub-fields of NLP.It allows a scalable, cheap evaluation that often correlates at the system-level with human evaluation.
However, reference-based evaluation metrics tend to produce a bottom line score, allowing little to no ability for a fine-grained analysis of the systems' strengths and weaknesses.Such an analysis is important, for example, for targeted development * equal contributors 1 Our codebase is found here: https://github.com/tai314159/MuLER. efforts that focus on improving specific phenomena, or for better identifying scenarios in which the system is reliable (Liu et al., 2021).We propose a novel evaluation methodology, Multi-Level Evaluation with Reference (MuLER), that presents a detailed picture of text generation system's performance.Given a feature that can be detected automatically on the target side, and a referencebased metric, MuLER allows to scalably measure the system's performance on words and spans that contain this feature.
MuLER thus yields a decomposition of any evaluation metric, to more focused measurements of the system's performance on span-level and word-level features, such as POS tags, named entity types, sentence sentiment etc.Moreover, the methodology and code can be expanded to features of choice.
In providing a per-phenomenon picture of system performance, MuLER is similar to challenge set approaches to evaluation (see §6).However, MuLER takes a more naturalistic approach, and narrows the evaluation to the test examples that arXiv:2305.14991v1[cs.CL] 24 May 2023 contain a particular feature.
Given an evaluation metric (e.g., BLEU) for a text generation task (e.g., MT) and a feature of interest of the system's output (e.g., performance on adjectives), MuLER operates as follows (see §2.1):It masks the feature in both the reference and the prediction by the same token (e.g, replace each adjective with a placeholder "ADJ").This can be seen as an oracle adaptation to the output, that changes the span with the feature to agree with the reference.MuLER's score is the (normalized) difference between the metric score over the masked texts and the score over the original ones.
We present results from MT as well as summarization and synthetic paraphrasing.In addition, we perform synthetic experiments to validate MuLER's effectiveness and usability.Our experiments validate that MuLER can measure performance on a particular feature ( §5), and reveal some previously unreported patterns in established MT systems ( §4).For example, while translation of nouns and verbs improves over the years, translation of named entities does not necessarily improve §4.2.

Methodology
The MuLER methodology seeks to gain insight as to the performance of a text generation system s according to a given metric σ on instances with the feature f .The feature is a dimension along which the system is evaluated, and that can be automatically detected given text.Examples of features here may be POS tags, named entity types, morphological categories, among others.
MuLER operationalizes this notion as improvement in the score of s according to σ, if s would have correctly predicted all instances of this feature.For scale, this improvement is compared to the overall possible improvement (The score is defined at §2.3).To assess that MuLER creates an oracle where the feature f is perfect and an antioracle where it is fully wrong (c.f.§2.2).

Feature Tagger -Formal Definition
Let f be a feature of interest.Let S = {s 1 , ..., s n } be a corpus of output sentences (produced by the evaluated system), R = {r 1 , ..., r n } be a set of corresponding references.Let τ be a function from sentences x ∈ S ∪ R that replaces each span containing a feature f with a special mask token M f (we assume the spans with the f feature are nonoverlapping).Denote the i-th token in τ (x) with τ (x) (i) .Then, for each token τ (x) (i) : 2.2 Oracle and Anti-Oracle Masking.
Let σ be a reference-based evaluation metric that takes sets of system outputs and references S and R and returns a real value.We can define two masking strategies that represent the best possible performance on sub-spans marked by f , or the worst performance, by applying τ to S and R.
We refer to the optimistic masking strategy as oracle masking and denote it by This strategy coincides with eq. 1.For example, if we take f to be common nouns: Reference: John likes apples and oranges.Output: John loves bananas and apples.τmax(reference) = John likes NOUN and NOUN.τmax(output) = John loves NOUN and NOUN.Now, to minimize rather than maximize σ(R, C) by masking spans with the feature f , we apply different masks to the outputs and the references.This strategy generally decreases σ, as it deletes existing correspondences between the reference and the outputs.We refer to this masking strategy as anti-oracle masking and denote it with τ min .
Repeating the example above (NOUN and NOUN' are different tokens): reference: John likes apples and oranges.output: John loves bananas and apples.The average score with each oracle would be:

MuLER Score
Using these definitions, we may now define the MuLER score.Let I ⊆ {1, ..., n} be the indices for which both r i ∈ R and c i ∈ C contain a span with the feature f .We define the MuLER score as: We compute MuLER variants only on indices in which both the reference and the output contain f (prevents division by zero).
Intuitively, MuLER captures the potential gains obtained by the best f , where the numerator of the score captures the absolute gains from improving f .MuLER is therefore a unitless metric, that measures how much of the potential gain is realized by improving the generated spans with the feature f .
For simplicity of notation, we assume a single reference per sentence, but the formulation generalizes straightforwardly to multi-reference settings.

Motivating Discussion
MuLER intends to assess a system's ability per feature exhibited in the text.Ideally, features could be analyzed both in a single system ( §4.1) and between systems ( §4.2).However, the latter may require special treatment.To illustrate this claim, imagine two MT systems, one nearly perfect and another produces random outputs.The perfect system has little to gain by masking spans of a feature f .andhence the numerator of MuLER will be around zero.However, this is also the case for the random system, since there is hardly any margin for improvement.Even if some words are correctly predicted, the malformed context means a low sentence score.This hints that the numerator is not comparable between systems with substantially different performance and therefore should be normalized.
In order to better capture the systems' overall performance, we leverage the anti-oracle masking, noting that σ(R, C) is in the interval [min σ (R, C), max σ (R, C)] (except for edge cases, App.§7).The length of this max-min interval can be interpreted as the quality in which the system manages to translate the contexts of spans bearing the feature f (the farther the oracle and the anti-oracle are apart, the better the system is in translating the contexts).To illustrate this point, consider the two extremes.For a high performing system the distance between min σ (R, C) and max σ (R, C) is expected to be substantial.There is a lot to lose from an error.However, a horrible system will have a small distance as the minimum and the maximum will both be around zero.
Hallucination Score.MuLER is defined only for sentences in which the reference and the candidate contain the feature f .Hence, it checks the quality of generation but not cases of over/under generation.To account for such cases and ensure the system even generates the feature, we define a hallucination score.
The hallucination score consists of 3 numbers; add (η 1 (f )), hit (η 2 (f )), and miss (η 3 (f )) scores.η 1 (f ) is the number of sentences in which the feature f appears in the reference more times than it appears the output, η 2 (f ) is the number of sentences in which the feature f appears in the output more times than in the reference and η 3 (f ) is the number of other sentences with equal amount of times.See §4.6 for usage example of the score.

Leveraging Sentence Scorers
Often, instead of a tagger, a continuous scoring function is available for f .A scorer operates on tokens or sentences to capture a certain aspect of the text (such as sentiment or concreteness).We propose a way to utilize scorers to analyze the system's generation abilities.
Let σ : S → R be a scoring function, where S = {s 1 , ..., s n } is a set of sentences.For a set of references R = {r 1 , ..., r k } and a set of candidates C = {c 1 , ..., c k }, where c i is the candidate of r i we define a score s σ the following way:

Experimental Setup
Evaluation Metrics.As reference-based metrics, we consider BLEU (Papineni et al., 2002), BERTScore (Zhang et al., 2019) and ROUGE (Lin, 2004).BLEU was developed to measure machine translation quality, and focuses on precision.ROUGE is made for summarization and focuses on recall.Both are based on overlapping n-grams, while BERTScore, a metric for text generation quality, is based on similarity between contextualised embeddings.For these metrics, the basic unit of evaluation is a sentence, as it compares between a reference sentence (a human translation) and a candidate sentence (an output of a system).
Features.We experiment with several feature types, each separated into different features: POS tagging, NER and dependency features (see App. §9 -for full description).
Sentence scorers.As dedicated scorers, we look at sentiment analysis, concreteness, valence, dominance and arousal (c.f.App.A.) Released Library Specifications.Upon acceptance, we will share a library of code.The library allows using the metrics used in this paper as well as easily defining new ones.It reports MuLER variants as well as hallucination scores ( §2.4).
Gender.We make use of the WinoGender dataset (Rudinger et al., 2018) where each sentence has a variation of male, female and neutral.
Paraphraes.We use the minimal paraphrase pairs corpus by Patel et al. (2021).It contains parallel corpora with two syntactic variation types: active versus passive sentences and adverbial clause versus noun phrases.The changes to the sentences are minimal, specifically, the semantic meaning remains identical.See App E.1 for more details.
4 Experiments with Naturalistic Data

Single Model Analysis
A key point of MuLER is the ability to compare the performance of various features on a single model.Such an analysis can reveal the system's strengths and weaknesses and potentially lead to a targeted development effort on specific features, or be used for debugging purposes.It enables the user to decide where to invest his efforts.Fig. 2 shows a standard MuLER report for two systems.

Comparison Across Systems
We compare WMT systems through years, architectures and performance patterns.
MuLER Similarity to Other Measures.We compute Pearson correlation between negative MuLER scores and BLEU, for every source language, over all submissions of WMT (2014WMT ( −2017)).We use negative MuLER so that high correlation means improvements in both performance measures (e.g., BLEU and MuLER), as reference and candidate similarity is indicated by high BLEU but low MuLER.Fig. 3 shows that BLEU and MuLER are not always correlated.We see that arousal, concreteness, dominance, sentiment and valence scores are in high agreement between MuLER and BLEU.However, some features, e.g., most of the named entity types, are not.This suggests that overall BLEU improvements do not necessarily mean better named entity translations.
We also see that different languages behave differently with respect to the type of features for which MuLER and BLEU are highly correlated.For example, in Chinese, BLEU is more correlated with MuLER, over many different POS tags.This could be explained by differences in the structure of the languages (e.g., syntax).A possible explanation might be that Chinese is simpler to translate in terms of overlapping unigrams (i.e., when syntax is ignored).We do the same analysis comparing MuLER to indices-BLEU (BLEU over the indices in which the feature appears both in the reference and the output) and their max(R, C) − min(R, C) term.We get similar results (see App. 10).
Systems Over Time.We compare WMT systems (see §3.1) from different years and language pairs with MuLER.Overall, there is a consistent trend (see Figs. 4,5,6): as BLEU improves, MuLER improves.However, this trend is not uniform across all features.For certain phenomena, improvement is not consistent with system quality.This is shown by a near-zero or positive correlation between MuLER and the max(R, C)−min(R, C) term (indicative of the system's performance on the sentences containing f ).
Surprisingly, we find that nouns and verbs are among the hardest POS tags to translate.On the face of it, this is unexpected, as they account for the most frequent POS tokens in training.Potentially, being open class makes them harder, nouns are  common, but each noun by itself is rare.This may also explain why determiners that are frequent are easy and why adverbs are harder than the more frequent auxiliary.Similar trends are presented when comparing MuLER to the total BLEU score of the systems (Fig. 6).

Manual Analysis
To verify the effectiveness of MuLER, we perform manual analysis and compare pairs of systems that are roughly equal in their overall performance (under BLEU), but greatly differ on a given feature f (under MuLER).We compare 5 pairs of systems and a total of 201 sentences (App.§10).
We consistently see that systems with lower MuLER scores (i.e., better performance) translate feature f better (see Table 1).This means that the neighborhood of f in the candidate sentence is more similar to the reference, not only the masked span itself.Interestingly, we encounter many cases in which the span of f is the same in the reference and both candidates, but the overall translation (i.e., the neighborhood) is better in the one with the lower MuLER.Table 10 shows that out of 97 sentences where quality differed, the system MuLER predicts to be better, indeed translates better in 91.3% of the sentences.

MuLER with ROUGE -Summarization
We compute MuLER on 3 summarization models (App.§B) and various features.Fig. 7 shows a standard MuLER report, computed under the ROUGE metric.We see that strengths and weaknesses vary between the different systems.Moreover, we see that the concreteness score is always lower than the other scores provided by the sentence scorers (i.e, valence, dominance, arousal and sentiment).Inherently, we expect summarization outputs to be concrete and this is indeed revealed by MuLER.

MuLER with LM-based Metrics
To validate that MuLER could be easily adapted to LM-based metrics, in addition to BLEU, we perform our analysis for the task of MT, also with BERTScore.We randomly choose 5 systems from  ).The two leftmost columns are the average proportion of the synthetic features in the reference and output.The "average proportion" column indicates the average frequency of the features (e.g, NOUN/VERB) in the reference and the output (as described in §5).WMT 2019 submission; "online-G.0"for German-English.WMT-2020 for Chinese-English.Preliminary experiments show that MuLER can be straightforwardly extended (App.§C) to such metrics.

Paraphrases and Gender
We employ MuLER on special cases to demonstrate its usefulness.
Minimal Paraphrases.We compare Minimal Paraphrases (App.§E.1) as if they were an output and reference.Evidently, the hallucination score identifies phrasing differences (see Fig. 8).Adverbial clause sentences have more verbs, while noun phrases have more nouns and thus their miss and hit scores complement each other, while the use of auxiliaries remains similar.The scores also recognize voice changes from active to passive, those require additional auxiliaries while keeping the same verbs and nouns.
WinoGender.Gender choice is critical for many applications.Where sentences with different gender receive a high BLEU score (0.8), the gender feature of MuLER is 1.0 -representing the systems perfect inability to capture the correct gender.

Validation Experiments
In this section, we perform various synthetic experiments to check the validity of MuLER.For a given feature f , let F be the set of words tagged as f (e.g., nouns) under τ , and α ∈ [0, 1].

Range and Monotonicity of MuLER.
We expect MuLER to fall in the interval [σ(min(R, C)), σ(max(R, C))] and to improve as the quality of translation on the feature f improves (monotonicity).That is, if a system outputs the right translation for α cases of F (and wrong on 1 − α cases accordingly), then we expect We support this claim using synthetic data experiments.We define a hybrid version of MuLER using a combination of oracle (O) and anti-oracle (AO) masking strategies ( §2.1).We split F into two sets roughly containing α and 1 − α of its elements, by partitioning according to sorted first letter.That is, we choose η to be the first letter in the English Alphabet for which the set of all words in F that start with a-η is of size ≥ αF.We split F to 2 sets; one containing all words that start with the letter a-η, and its complement.We mask α of the occurrences of f using AO-strategy, and the rest using O-strategy, both in the reference and the candidate.This construction emulates a range of systems that improve on f as a function of α.
Tables 4,12, 13 show that this hybrid score is indeed always located according to X in the interval [min σ (R, C), max σ (R, C)] (e.g., if X = 2 then it's in the middle of the interval).
Specificity of MuLER.We set to verify that MuLER is not sensitive to random features in the text.We expect that features that appear in random subsets of the text with the same frequency will have roughly the same score.To verify this, we create synthetic features with the same frequency in F as real ones (e.g, nouns/verbs) and compute MuLER over them.Let U be the unique list of words in the union of R and C. For 1 ≤ j ≤ 1000: we split U to p equally sized groups {U 1 , ..., U p } (we ignore the remainder).Indeed, as seen in Table 2, the average proportion of U i in R and C is roughly the same.For 1 ≤ i ≤ p we compute M uLER(R, C) by masking only the words in U i (both in R and C).At each run we have p scores {(m 1 , ..., m p ) j } 1000 j=1 from which we choose one randomly.In total, we get 1000 scores: M = { m1 , ..., m1000 }.We compute the variance and standard deviation for M (see Table 2).We find that the variance and std are around zero across values of p, for p ∈ {2, ..., 6} (see App. §14).Meaning, MuLER is not specified to random phenomena.Moreover, the results are different compared to real linguistic phenomena with the same frequency (e.g, nouns/verbs, see Table 2).These findings suggest that MuLER is not sensitive to variation that does not reflect variation in quality.
Robustness to Feature Frequency.We start by validating that MuLER score is less sensitive to the frequency of f .
We split F into two sets roughly containing α and 1 − α of its elements, by partitioning according to sorted first letter (as explained before).We then mask α of F and ignore the rest of the instances.This allows us to test MuLER on a feature with similar performance (a random sample of the original feature) but different frequency, namely α frequency of the feature f across F (this is not true when doing the split at the sentence-level).We see in Table 3 that MuLER is robust to changes in frequencies (of nouns and verbs), compared to abl-MuLER -an ablated version of MuLER which is defined as MuLER's numerator.This holds across various frequencies and features (see Table 15).This suggests that MuLER is a more suitable score for measuring system performance and that its signal is not due to the frequency of the feature (it may play a role, but not a central one).

Related Work
Automatic metrics are useful to assess systems and we base our work on them (see §3).Other lines of work study a specific property and propose evaluation measures for it.For example, addressing hallucinations (Kryscinski et al., 2020), asserting factual consistency (Gabriel et al., 2020;Honovich et al., 2021;Pagnoni et al., 2021) or measuring grammaticality (Vadlapudi and Katragadda, 2010) or meaning preservation (Choshen and Abend, 2018b).We share the aspiration to a more fine-grained form of evaluation with these works.
There are methods for analyzing performance in a more fine-grained manner.For example, evaluation with minimal changes to the input (Warstadt et al., 2020), challenge sets (Macketanz et al., 2018;Emelin and Sennrich, 2021), evaluation dependent on the domain data (Choshen and Abend, 2018a), understanding the inner workings of networks (Tenney et al., 2019;Slobodkin et al., 2021;Voita et al., 2020), dedicated sets of metrics (Gehrmann et al., 2021) and more (Ribeiro et al., 2020).Few methods highlight patterns rather than predefined properties, by contrasting texts (e.g.reference and output) (Gralinski et al., 2019;Lertvittayakumjorn et al., 2021).In a sense, MuLER stands in the middle between those, it highlights a closed set of traits, but it is extendable.

Conclusion
We presented a novel methodology (MuLER) to decompose any reference-based score into its finegrained components.MuLER filters and dissects naturalistic data to highlight phenomena in the generated text.We validated MuLER using a set of synthetic experiments ( §5).Applying MuLER to off-the-shelf systems we see ( §4) that different systems' strengths and weaknesses are varied, even when their overall performance is alike, and detect interesting trends over the years.Our work creates an avenue for further research into more fine-grained evaluation metrics, as well as provides a tool to understand system behaviour.In future work, we plan to extend MuLER to more complex features such as long-distance syntactic dependencies and discourse phenomena.

Limitations
Among MuLER appealing traits is its reliance on existing, accepted and easily changed components.
It is also counts as its limitation, where the base metric is invariant to a trait, MuLER would also be, where masking tagging or scoring is not available (e.g. in endangered languages) the features would not be possible to extract.In general, detecting a feature (e.g.POS tag) is usually harder than evaluating the quality of its generation, MuLER makes this evaluation more accessible.
We showcase MuLER on BLEU and ROUGE as they are still among the most widely adopted metrics in their respective tasks.The concept of MuLER can be straightforwardly extended to LMbased metrics and we intend to explore it in future work.For now, we shared initial results on BERTScore suggesting this is indeed the case.
Some of our analysis builds on manual impression and analysis, qualitative analysis would make a fuller picture.However, evaluation of certain aspects of the new evaluation was unavailable.
Similarly, for some validations, we use synthetic experiments, that make a well-controlled experiment, but sometimes lack some characteristics of natural data.
Overall, we try to evaluate intrinsically, extrinsically by use cases, manually and synthetically to present a full view where the whole is greater than the sum of its parts.
Although we use MuLER to compare between models, it is not clear whether such a comparison is interesting for systems with overall very different performance; if one system's overall performance is very low, then even if it somehow translates a specific feature well, the quality of its output is bad.However, comparing systems with overall similar performance is the more common use case and hence useful; for example, when choosing between systems with top performance to perform a task or for analyzing the differences between systems.

A Scorers used
In this section, we elaborate on the scorers' use and their origin.
Sentiment.Sentiment Analysis is the process of determining whether a piece of text is positive, negative or neutral.We follow the method of Khoo and Johnkhan (2018) that relies on per word score and a rule-based combination (mainly dealing with negation).The method was shown to outperform other lexicons and to work well without the need for neural networks.We selected this method as it strikes a good balance between accuracy and running time.We defer the application of neural metrics to future work.We consider 4 token-level scores which we aggregate into a sentence score by averaging.We ignore words that do not appear in the lexicons.
Concreteness.The Concreteness rating of a word represents to which extent a word is concrete, how perceptible is it.For example, a fruit is less concrete than a banana and tomorrow is more concrete than sometime.The lexicon (Brysbaert et al., 2014) contains 40K lemmas each with a concreteness score.
Valence Arousal and Dominance.In psychology, it is common to discuss three characteristics in how we perceive others (e.g., in recognizing faces (Jones et al., 2021)): valence (pleasure vs. displeasure), arousal (active vs. passive), and dominance (dominant vs. submissive).These were shown to be mostly independent directions of word meaning (Osgood et al., 1957;Russell, 1980Russell, , 2003)).The lexicon (Mohammad, 2018) contains 20K words and their respective scores for each of those axes.

C LM-based Metrics
We perform preliminary experiments using BERTScore, which is a language-model (LM) based metric for measuring the quality of generation tasks.
We use it together with "bert-based-uncased" model.In order to adapt BERTScore to MuLER, we perform alterations to the similarity matrix of the reference and candidate embeddings, that is calculated during the score's computation.To compute max σ (R, C), after the similarity matrix between the un-masked reference and un-masked candidate is computed, we set the ij-th entry to be 1 if both the i-th word in the reference and the j-th word in the candidate is masked (if the masked word is split to multiple tokens by the BERT tokenizer, we set the corresponding entry in the similarity matrix to be 1 for each of them).To compute min σ (R, C), after the similarity matrix between the un-masked reference and un-masked candidate is computed, we set the i-th row to be zeroes if the i-th word in the reference is masked, and the j-th column to be zeroes if the j-th word in the candidate is masked.Indeed, in this setting we also get that min σ (R, C) > min σ (R, C) (this is true for 1000 randomly sampled sentences from the submissions we analyzed).We randomly sampled 5 submissions to WMT-2020 for Chinese-English (Tencent_Translation.1249, Online-B.1605, DeepMind.381, Huoshan_Translate.919 and OPPO.1422).Similar trends to the results obtained by MuLER with BLEU are exhibited.

D Data
We provide the complete MuLER database containing the results for WMT submissions (2014−2020) on all features (see 3) in the supplementary materials (App.§E).We will release it together with our code upon acceptance.

E Supplementary Materials
The complete MuLER database (scores for all WMT's submissions (2014−2020)) and the tagged manual analysis are in the supplementary materials submitted with the paper.

Table 5: Examples of minimal paraphrases
The technician told the customer that she could pay with cash.The technician told the customer that he could pay with cash.
The supervisor gave the employee feedback on her stellar performance.The supervisor gave the employee feedback on his stellar performance.

E.2 WinoGender
WinoGender (Rudinger et al., 2018) cosists of sentences that differ only by the gender of one pronoun in the sentence, see examples in Table 6.

F Manual Analysis
We perform a small-scale manual analysis to validate MuLER does indicate the quality of performance on a certain feature.We chose 5 systems from different years and language pairs (see Table 10 for full details).We compare pairs of systems that are roughly equal in their overall performance (under BLEU), but greatly differ on a given feature f , under MuLER (see §4.3).One of the authors annotated the data.For every pair of submissions, the data was shuffled such that the sentences were side by side without knowing in advance which is the better system.

G Negative MuLER
Intuitively, we expect to always gain by masking a certain proportion of a given feature in the text (i.e, positive MuLER score).However, there are edge cases in which max(R, C) − BLEU (R, C) is negative.It can be due to a mistake of the tagger or the sentence structure (for example, a word in the reference that is a noun is used in the candidate as a verb, etc.).In table 7 we present examples for such cases.NOUN is being used for NOUN in NOUN NOUN.
Nitromethane is used, for example, drag racing.
NOUN is used, for NOUN, drag NOUN.
The film will premiere in Finland in September 2015.
The NOUN will premiere in Finland in September 2015.
The film will have its Finnish premiere in September 2015.
The NOUN will have its Finnish NOUN in September 2015.
Its unpredictability unsettled people's nerves.
Its unpredictability made people nervous.
Its NOUN made NOUN nervous.
Our whole house moved, we were trembling with fear.
Our whole NOUN moved, we were trembling with NOUN.
We need the whole of our house moved: vapisimme fear.
We need the NOUN of our NOUN moved: NOUN NOUN.

H graphs
We supply here multiple graphs that were mentioned in the text.The rest of the analysis graphs could be found in the supplementary files.For each feature we calculate its average uniqueness, defined as the number of unique times the feature appears in the text, divided by the total times it appears in the text.Table 10: Manual Analysis.system A is the system with a lower MuLER score (i.e, better performance on the feature).A=B/A>B/A<B indicates the number of sentences where the translation of the feature was of the same quality between system A and B (or better/worse accordingly).BLEU indices A/B is the BLEU score of system A/B on sentences in the reference and the output that contain the feature.

Figure 1 :
Figure 1: Illustration of MuLER for the feature NOUN.Two masking strategies are employed on the reference and the candidate -Oracle masking max(R, C), and anti-oracle masking min(R, C). σ is the task's metric (e.g.BLEU, ROUGE).

Figure 3 :
Figure 3: Similarity of Measures.Correlation between BLEU and -MuLER per feature (column) and source language (row).Positive values suggest better systems by BLEU better translate the feature.

Figure 4 :
Figure 4: MuLER vs. max(R,C) minus min(R,C) calculated on selected POS-tags.All submissions to WMT (2014 − 2020) for German-English.Next to each POStag is the correlation between all x-axis and y-axis values for the POS-tag.

Figure 5 :
Figure 5: MuLER vs. max minus min calculated on named entities.All submissions to WMT (2017 − 2020) for Chinese-English.Next to each entity is the correlation between x-axis and y-axis values for the entity.

Figure 6 :
Figure 6: POS-tag MuLER vs. BLEU.All submissions to WMT (2014 − 2020) for Russian-English.Next to each POS-tag is the correlation between all x-axis and y-axis values for the POS-tag.

Figure 7 :
Figure 7: MuLER for summarization.MuLER score is calculated for various features, under ROUGE.We compare 3 models; t5 small, t5 base and distill BART.

Figure 8 :
Figure 8: Hallucination scores of verbs, nouns and auxiliaries for minimal syntactic paraphrases.
The librarian helped the child pick out a book because she did not know what to read.The librarian helped the child pick out a book because he did not know what to read.
Figure 10: Similarity of Measures.Represents correlation of score achievements, e.g.positive values between BLEU and MuLER suggest that BLEU increases as MuLER decreases and vice versa.

Figure 11 :
Figure 11: Frequency of MuLER entities.For each language pair we chose the submission with the best BLEU score (from WMT 2014 − 2020) and calculated the average frequency for each feature.

Figure 12 :
Figure12: Uniqeness of MuLER entities.For each language pair we choose the submission with the best BLEU score (from WMT 2014 − 2020).For each feature we calculate its average uniqueness, defined as the number of unique times the feature appears in the text, divided by the total times it appears in the text.

Table 1 :
Example sentences from WMT's submissions.System A has a lower MuLER score than system B. We indicate whether the chosen feature is consistent or inconsistent with the reference.

Table 3 :
Robustness to Feature Frequency.Presented here are 3 submissions from WMT 2019, translation from German to English (see Table15for more results).We compare between MuLER and abl-MuLER (MuLER's numerator -an ablated version of MuLER) with 50%/100% of nouns/verbs masked.
Elena Voita, Rico Sennrich, and Ivan Titov.2020.Analyzing the source and target contributions to predictions in neural machine translation.arXiv preprint arXiv:2010.10907.

Table 6 :
Female-Male pairs from the WinoGender dataset

Table 8 :
Example sentences from WMT's submissions.System A has a lower MuLER score than system B. We indicate whether the chosen feature is consistent or inconsistent with the reference.

Table 9 :
Features we use in the paper.

Table 11 :
Range and Monotonicity of MuLER.Presented here are MuLER scores on nouns and verbs in 5 randomly chosen systems from WMT. Oracle (O) and Anti-Oracle (AO) masking strategies vs. hybrid masking strategy (as described in §5) at 50 − 50 split (50% of noun/verb is masked with O-strategy, and the rest with AO-strategy).

Table 12 :
Range and Monotonicity of MuLER.Presented here are MuLER scores on nouns and verbs in 5 randomly chosen systems from WMT. Oracle (O) and Anti-Oracle (AO) masking strategies vs. hybrid masking strategy (as described in §5) at 40 − 60 split (40% of noun/verb is masked with O-strategy, and the rest with AO-strategy

Table 13 :
Range and Monotonicity of MuLER.Presented here are MuLER scores on nouns and verbs in 5 randomly chosen systems from WMT. Oracle (O) and Anti-Oracle (AO) masking strategies vs. hybrid masking strategy (as described in §5) at 30 − 70 split (30% of noun/verb is masked with O-strategy, and the rest with AO-strategy