Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent

Markus Freitag; Nitika Mathur; Chi-kiu Lo; Eleftherios Avramidis; Ricardo Rei; Brian Thompson; Tom Kocmi; Frédéric Blain; Daniel Deutsch; Craig Stewart; Chrysoula Zerva; Sheila Castilho; Alon Lavie; George Foster

doi:10.18653/v1/2023.wmt-1.51

Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent

Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, George Foster

Abstract

This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year’s success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks. We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics.

Anthology ID:: 2023.wmt-1.51
Volume:: Proceedings of the Eighth Conference on Machine Translation
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Philipp Koehn, Barry Haddow, Tom Kocmi, Christof Monz
Venue:: WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 578–628
Language:
URL:: https://aclanthology.org/2023.wmt-1.51/
DOI:: 10.18653/v1/2023.wmt-1.51
Bibkey:
Cite (ACL):: Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, and George Foster. 2023. Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent. In Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore. Association for Computational Linguistics.
Cite (Informal):: Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent (Freitag et al., WMT 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.wmt-1.51.pdf

PDF Cite Search Fix data