Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics

Chi-kiu Lo, Rebecca Knowles, Cyril Goutte


Abstract
While many new automatic metrics for machine translation evaluation have been proposed in recent years, BLEU scores are still used as the primary metric in the vast majority of MT research papers. There are many reasons that researchers may be reluctant to switch to new metrics, from external pressures (reviewers, prior work) to the ease of use of metric toolkits. Another reason is a lack of intuition about the meaning of novel metric scores. In this work, we examine “rules of thumb” about metric score differences and how they do (and do not) correspond to human judgments of statistically significant differences between systems. In particular, we show that common rules of thumb about BLEU score differences do not in fact guarantee that human annotators will find significant differences between systems. We also show ways in which these rules of thumb fail to generalize across translation directions or domains.
Anthology ID:
2023.mtsummit-research.16
Volume:
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Month:
September
Year:
2023
Address:
Macau SAR, China
Editors:
Masao Utiyama, Rui Wang
Venue:
MTSummit
SIG:
Publisher:
Asia-Pacific Association for Machine Translation
Note:
Pages:
186–199
Language:
URL:
https://aclanthology.org/2023.mtsummit-research.16
DOI:
Bibkey:
Cite (ACL):
Chi-kiu Lo, Rebecca Knowles, and Cyril Goutte. 2023. Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 186–199, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):
Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics (Lo et al., MTSummit 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.mtsummit-research.16.pdf