Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLMs than on Encoder-Decoder Systems

Eleftherios Avramidis, Shushen Manakhimova, Vivien Macketanz, Sebastian Möller


Abstract
This year’s MT metrics challenge set submission by DFKI expands on previous years’ linguistically motivated challenge sets. It includes 137,000 items extracted from 100 MT systems for the two language directions (English to German, English to Russian), covering more than 100 linguistically motivated phenomena organized into 14 linguistic categories. The metrics with the statistically significant best performance in our linguistically motivated analysis are MetricX-24-Hybrid and MetricX-24 for English to German, and MetricX-24 for English to Russian. Metametrics and XCOMET are in the next ranking positions in both language pairs. Metrics are more accurate in detecting linguistic errors in translations by large language models (LLMs) than in translations based on the encoder-decoder neural machine translation (NMT) architecture. Some of the most difficult phenomena for the metrics to score are the transitive past progressive, multiple connectors, and the ditransitive simple future I for English to German, and pseudogapping, contact clauses, and cleft sentences for English to Russian. Despite its overall low performance, the LLM-based metric Gemba performs best in scoring German negation errors.
Anthology ID:
2024.wmt-1.37
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
517–528
Language:
URL:
https://aclanthology.org/2024.wmt-1.37
DOI:
Bibkey:
Cite (ACL):
Eleftherios Avramidis, Shushen Manakhimova, Vivien Macketanz, and Sebastian Möller. 2024. Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLMs than on Encoder-Decoder Systems. In Proceedings of the Ninth Conference on Machine Translation, pages 517–528, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLMs than on Encoder-Decoder Systems (Avramidis et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.37.pdf