Difficulty-Aware Machine Translation Evaluation

The high-quality translation results produced by machine translation (MT) systems still pose a huge challenge for automatic evaluation. Current MT evaluation pays the same attention to each sentence component, while the questions of real-world examinations (e.g., university examinations) have different difficulties and weightings. In this paper, we propose a novel difficulty-aware MT evaluation metric, expanding the evaluation dimension by taking translation difficulty into consideration. A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function, and conversely. Experimental results on the WMT19 English-German Metrics shared tasks show that our proposed method outperforms commonly used MT metrics in terms of human correlation. In particular, our proposed method performs well even when all the MT systems are very competitive, which is when most existing metrics fail to distinguish between them. The source code is freely available at https://github.com/NLP2CT/Difficulty-Aware-MT-Evaluation.


Introduction
The human labor needed to evaluate machine translation (MT) evaluation is expensive. To alleviate this, various automatic evaluation metrics are continuously being introduced to correlate with human judgements. Unfortunately, cutting-edge MT systems are too close in performance and generation style for such metrics to rank systems. Even for a metric whose correlation is reliable in most cases, empirical research has shown that it poorly correlates with human ratings when evaluating competitive systems (Ma et al., 2019;Mathur et al., 2020), * Equal contribution † Corresponding author limiting the development of MT systems. Current MT evaluation still faces the challenge of how to better evaluate the overlap between the reference and the model hypothesis taking into consideration adequacy and fluency, where all the evaluation units are treated the same, i.e., all the matching scores have an equal weighting. However, in real-world examinations, the questions vary in their difficulty. Those questions which are easily answered by most subjects tend to have low weightings, while those which are hard to answer have high weightings. A subject who is able to solve the more difficult questions can receive a high final score and gain a better ranking. MT evaluation is also a kind of examination. For bridging the gap between human examination and MT evaluation, it is advisable to incorporate a difficulty dimension into the MT evaluation metric.
In this paper, we take translation difficulty into account in MT evaluation and test the effectiveness on a representative MT metric BERTScore (Zhang et al., 2020) to verify the feasibility. More specifically, the difficulty is first determined across the systems with the help of pairwise similarity, and then exploited as the weight in the final score function for distinguishing the contribution of different sub-units. Experimental results on the WMT19 English↔German evaluation task show that difficulty-aware BERTScore has a better correlation than do the existing metrics. Moreover, it agrees very well with the human rankings when evaluating competitive systems.

Related Work
The existing MT evaluation metrics can be categorized into the following types according to their underlying matching sub-units: n-gram based (Papineni et al., 2002;Doddington, 2002;Lin and Och, 2004;Han et al., 2012;Popović, 2015), edit-distance based (Snover et al., 2006;Leusch et al., 2006), alignment-based (Banerjee and Lavie, 2005), embedding-based (Zhang et al., 2020;Chow et al., 2019;Lo, 2019) and end-to-end based (Sellam et al., 2020). BLEU (Papineni et al., 2002) is widely used as a vital criterion in the comparison of MT system performance but its reliability has been doubted on entering neural machine translation age (Shterionov et al., 2018;Mathur et al., 2020). Due to the fact that BLEU and its variants only assess surface linguistic features, some metrics leveraging contextual embedding and end-toend training bring semantic information into the evaluation, which further improves the correlation with human judgement. Among them, BERTScore (Zhang et al., 2020) has achieved a remarkable performance across MT evaluation benchmarks balancing speed and correlation. In this paper, we choose BERTScore as our testbed.

Motivation
In real-world examinations, the questions are empirically divided into various levels of difficulty. Since the difficulty varies from question to question, the corresponding role a question plays in the evaluation does also. Simple question, which can be answered by most of the subjects, usually receive of a low weighting. But a difficult question, which has more discriminative power, can only be answered by a small number of good subjects, and thus receives a higher weighting. Motivated by this evaluation mechanism, we measure difficulty of a translation by viewing the MT systems and sub-units of the sentence as the subjects and questions, respectively. From this perspective, the impact of the sentence-level sub-units on the evaluation results supported a differentiation. Those sub-units that may be incorrectly translated by most systems (e.g., polysemy) should have a higher weight in the assessment, while easier-totranslate sub-units (e.g., the definite article) should receive less weight.

Difficulty-Aware BERTScore
In this part, we aim to answer two questions: 1) how to automatically collect the translation difficulty from BERTScore; and 2) how to integrate the difficulty into the score function. Figure 1 presents an overall illustration.
Pairwise Similarity Traditional n-gram overlap cannot extract semantic similarity, word embedding provides a means of quantifying the degree of overlap, which allows obtaining more accurate difficulty information. Since BERT is a strong language model, it can be utilized as a contextual embedding O BERT (i.e., the output of BERT) for obtaining the representations of the reference t and the hypothesis h. Given a specific hypothesis token h and reference token t, the similarity score sim(t, h) is computed as follows: Subsequently, a similarity matrix is constructed by pairwise calculating the token similarity. Then the token-level matching score is obtained by greedily  Table 1: Absolute correlations with system-level human judgments on WMT19 metrics shared task. For each metric, higher values are better. Difficulty-aware BERTScore consistently outperforms vanilla BERTScore across different evaluation metrics and translation directions, especially when the evaluated systems are very competitive (i.e., evaluating on the top 30% systems).
searching for the maximal similarity in the matrix, which will be further taken into account in sentencelevel score aggregation.

Difficulty Calculation
The calculation of difficulty can be tailored for different metrics based on the overlap matching score. In this case, BERTScore evaluates the token-level overlap status by the pairwise semantic similarity, thus the token-level similarity is viewed as the bedrock of difficulty calculation. For instance, if one token (like "cat") in the reference may only find identical or synonymous substitutions in a few MT system outputs, then the corresponding translation difficulty weight ought to be larger than for other reference tokens, which further indicates that it is more valuable for evaluating the translation capability. Combined with BERTScore mechanism, it is implemented by averaging the token similarities across systems. Given K systems and their corresponding generated hypotheses h 1 , h 2 , ..., h K , the difficulty of a specific token t in the reference t is formulated as An example is shown in Figure 1: the entity "cat" is improperly translated to "monkey" and "puppy", resulting in a lower pairwise similarity of the token "cat", which indicates higher translation difficulty. Therefore, by incorporating the translation difficulty into the evaluation process, the token "cat" is more contributive while the other words like "cute" are less important in the overall score.
Score Function Due to the fact that the translation generated by a current NMT model is fluent enough but not adequate yet, F -score which takes into account the Precision and Recall, is more appropriate to aggregate the matching scores, instead of only considering precision. We thus follow vanilla BERTScore in using F-score as the final score. The proposed method directly assigns difficulty weights to the counterpart of the similarity score without any hyperparameter: For any h / ∈ t, we simply let d(h) = 1, i.e., retaining the original calculation. The motivation is that the human assessor keeps their initial matching judgement if the test taker produces a unique but reasonable alternative answer. We regard DA-F BERT as the DA-BERTScore in the following part.
There are many variants of our proposed method: 1) designing more elaborate difficulty function (Liu et al., 2020;Zhan et al., 2021); 2) applying a smoothing function to the difficulty distribution; and 3) using other kinds of F -score, e.g., F 0.5 -score. The aim of this paper is not to explore this whole space but simply to show that a straightforward implementation works well for MT evaluation.

Experiments
Data The WMT19 English↔German (En↔De) evaluation tasks are challenging due to the large discrepancy between human and automated assessments in terms of reporting the best system (Bojar et al., 2018; Barrault et al., 2019;Freitag et al., 2020). To sufficiently validate the effectiveness of Table 2: Agreement of system ranking with human judgement on the top 30% systems (k=6) of WMT19 En→De Metrics task. ⇑/⇓ denotes that the rank given by the evaluation metric is higher/lower than human judgement, and denotes that the given rank is equal to human ranking. DA-BERTScore successfully ranks the best system that the other metrics failed. Besides, it also shows the lowest rank difference. our approach, we choose these tasks as our evaluation subjects. There are 22 systems for En→De and 16 for De→En. Each system has its corresponding human assessment results. The experiments were centered on the correlation with system-level human ratings.
Comparing Metrics In order to compare with the metrics that have different underlying evaluation mechanism, four representative metrics: BLEU (Papineni et al., 2002), TER (Snover et al., 2006), METEOR (Banerjee and Lavie, 2005;Denkowski and Lavie, 2014), BERTScore (Zhang et al., 2020), which are correspondingly driven by n-gram, edit distance, word alignment and embedding similarity, are involved in the comparison experiments without losing popularity. For ensuring reproducibility, the original 12 and widely used implementation 3 was used in the experiments.

Main Results
Following the correlation criterion adopted by the WMT official organization, Pearson's correlation r is used for validating the system-1 https://www.cs.cmu.edu/ alavie/METEOR/index.html 2 https://github.com/Tiiiger/bert score 3 https://github.com/mjpost/sacrebleu level correlation with human ratings. In addition, two rank-correlations Spearman's ρ and original Kendall's τ are also used to examine the agreement with human ranking, as has been done in recent research (Freitag et al., 2020). Table 1 lists the results. DA-BERTScore achieves competitive correlation results and further improves the correlation of BERTScore. In addition to the results on all systems, we also present the results on the top 30% systems where the calculated difficulty is more reliable and our approach should be more effective. The result confirms our intuition that DA-BERTScore can significantly improve the correlations under the competitive scenario, e.g., improving the |r| score from 0.204 to 0.974 on En→De and 0.271 to 0.693 on De→En. Figure 2 compares the Kendall's correlation variation of the top-K systems. Echoing previous research, the vast majority of metrics fail to correlate with human ranking and even perform negative correlation when K is lower than 6, meaning that the current metrics are ineffective when facing competitive systems. With the help of difficulty weights, the degradation in the correlation is alleviated, e.g., improving τ score from 0.07 to 0.73 for BERTScore (K = 6). These results indicate the effectiveness of our approach, establishing the necessity for adding difficulty. Table 2 presents a case study on the En→De task. Existing metrics consistently select MSRA's system as the best system, which shows a large divergence from human judgement. DA-BERTScore ranks it the same as human (4th) because most of its translations have low difficulty, thus lower weights are applied in the scores. Encouragingly, DA-BERTScore ranks Facebook's system as the best one, which implies that it overco-   mes more challenging translation difficulties. This testifies to the importance and effectiveness of considering translation difficulty in MT evaluation. Table 3 presents two cases, illustrating that our proposed difficulty-aware method successfully identifies the omission errors ignored by BERTScore. In the first case, the Facebook's system correctly translates the token "right", and in the second case, uses the substitute "Soldaten am Boden" which is lexically similar to the ground-truth token "Bodensoldaten". Although the MSRA's system suffers word omissions in the two cases, its hypotheses receive the higher ranking given by BERTScore, which is inconsistent with human judgements. The reason might be that the semantic of the hypothesis is highly close to the reference, thus the slight lexical difference is hard to be found when calculating the similarity score. By distinguishing the difficulty of the reference tokens, DA-BERTScore successfully makes the evaluation focus on the difficult parts, and eventually correct the score of the Facebook's system, thus giving the right rankings.

Distribution of Difficulty Weights
The difficulty weights can reflect the translation ability of a group of MT systems. If the systems in a group are of higher translation ability, the calculated dif-ficulty weights will be smaller. Starting from this intuition, we visualize the distribution of difficulty weights as shown in Figure 3. Clearly, we can see that the difficulty weights are centrally distributed at lower values, indicating that most of the tokens can be correctly translated by all the MT systems. For the difficulty weights calculated on the top 30% systems, the whole distribution skews to zero since these competitive systems have better translation ability and thus most of the translations are easy for them. This confirms that the difficulty weight produced by our approach is reasonable.

Conclusion and Future Work
This paper introduces the conception of difficulty into machine translation evaluation, and verifies our assumption with a representative metric BERTScore. Experimental results on the WMT19 English↔German metric tasks show that our approach achieves a remarkable correlation with human assessment, especially for evaluating competitive systems, revealing the importance of incorporating difficulty into machine translation evaluation. Further analyses show that our proposed difficultyaware BERTScore can strengthen the evaluation of word omission problems and generate reasonable distributions of difficulty weights.
Future works include: 1) optimizing the difficulty calculation; 2) applying to other MT metrics; and 3) testing on other generation tasks, e.g., speech recognition and text summarization.