A Call for Clarity in Reporting BLEU Scores

Matt Post


Abstract
The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to “the” BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this.
Anthology ID:
W18-6319
Volume:
Proceedings of the Third Conference on Machine Translation: Research Papers
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
186–191
Language:
URL:
https://aclanthology.org/W18-6319
DOI:
10.18653/v1/W18-6319
Bibkey:
Cite (ACL):
Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
A Call for Clarity in Reporting BLEU Scores (Post, WMT 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-6319.pdf
Poster:
 W18-6319.Poster.pdf