Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

Benjamin Marie, Atsushi Fujita, Raphael Rubino


Abstract
This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.
Anthology ID:
2021.acl-long.566
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:
August
Year:
2021
Address:
Online
Editors:
Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7297–7306
Language:
URL:
https://aclanthology.org/2021.acl-long.566
DOI:
10.18653/v1/2021.acl-long.566
Award:
 Outstanding Paper
Bibkey:
Cite (ACL):
Benjamin Marie, Atsushi Fujita, and Raphael Rubino. 2021. Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7297–7306, Online. Association for Computational Linguistics.
Cite (Informal):
Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers (Marie et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-long.566.pdf
Optional supplementary material:
 2021.acl-long.566.OptionalSupplementaryMaterial.zip
Video:
 https://aclanthology.org/2021.acl-long.566.mp4
Code
 benjamin-marie/meta_evaluation_mt +  additional community code