Refined Assessment for Translation Evaluation: Rethinking Machine Translation Evaluation in the Era of Human-Level Systems

Dmitry Popov, Vladislav Negodin, Ekaterina Enikeeva, Iana Matrosova, Nikolay Karpachev, Max Ryabinin


Abstract
As machine translation systems approach human-level quality, traditional evaluation methodologies struggle to detect subtle translation errors. We critically examine limitations in current gold-standard approaches (MQM and ESA), including inconsistencies from variable annotator expertise, excessive categorization complexity, coarse severity granularity, accuracy bias over fluency, and time constraints. To address this issue, we introduce a high-quality dataset consisting of human evaluations for English–Russian translations from WMT24, created by professional linguists. We show that expert assessments without time pressure yield substantially different results from standard evaluations. To enable consistent and rich annotation by these experts, we developed the RATE (Refined Assessment for Translation Evaluation) protocol. RATE provides a streamlined error taxonomy, expanded severity ratings, and multidimensional scoring balancing accuracy and fluency, facilitating deeper analysis of MT outputs. Our analysis, powered by this expert dataset, reveals that state-of-the-art MT systems may have surpassed human translations in accuracy while still lagging in fluency – a critical distinction obscured by existing accuracy-biased metrics. Our findings highlight that advancing MT evaluation requires not only better protocols but crucially, high-quality annotations from skilled linguists.
Anthology ID:
2025.findings-emnlp.1203
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22079–22095
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.1203/
DOI:
Bibkey:
Cite (ACL):
Dmitry Popov, Vladislav Negodin, Ekaterina Enikeeva, Iana Matrosova, Nikolay Karpachev, and Max Ryabinin. 2025. Refined Assessment for Translation Evaluation: Rethinking Machine Translation Evaluation in the Era of Human-Level Systems. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22079–22095, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Refined Assessment for Translation Evaluation: Rethinking Machine Translation Evaluation in the Era of Human-Level Systems (Popov et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.1203.pdf
Checklist:
 2025.findings-emnlp.1203.checklist.pdf