The limits of n-gram translation evaluation metrics

Christopher Culy, Susanne Z. Riehemann


Abstract
N-gram measures of translation quality, such as BLEU and the related NIST metric, are becoming increasingly important in machine translation, yet their behaviors are not fully understood. In this paper we examine the performance of these metrics on professional human translations into German of two literary genres, the Bible and Tom Sawyer. The most surprising result is that some machine translations outscore some professional human translations. In addition, it can be difficult to distinguish some other human translations from machine translations with only two reference translations; with four reference translations it is much easier. Our results lead us to conclude that much care must be taken in using n-gram measures in formal evaluations of machine translation quality, though they are still valuable as part of the iterative development cycle.
Anthology ID:
2003.mtsummit-papers.10
Volume:
Proceedings of Machine Translation Summit IX: Papers
Month:
September 23-27
Year:
2003
Address:
New Orleans, USA
Venue:
MTSummit
SIG:
Publisher:
Note:
Pages:
Language:
URL:
https://aclanthology.org/2003.mtsummit-papers.10
DOI:
Bibkey:
Cite (ACL):
Christopher Culy and Susanne Z. Riehemann. 2003. The limits of n-gram translation evaluation metrics. In Proceedings of Machine Translation Summit IX: Papers, New Orleans, USA.
Cite (Informal):
The limits of n-gram translation evaluation metrics (Culy & Riehemann, MTSummit 2003)
Copy Citation:
PDF:
https://aclanthology.org/2003.mtsummit-papers.10.pdf