Agree to Disagree: Analysis of Inter-Annotator Disagreements in Human Evaluation of Machine Translation Output

Maja Popović


Abstract
This work describes an analysis of inter-annotator disagreements in human evaluation of machine translation output. The errors in the analysed texts were marked by multiple annotators under guidance of different quality criteria: adequacy, comprehension, and an unspecified generic mixture of adequacy and fluency. Our results show that different criteria result in different disagreements, and indicate that a clear definition of quality criterion can improve the inter-annotator agreement. Furthermore, our results show that for certain linguistic phenomena which are not limited to one or two words (such as word ambiguity or gender) but span over several words or even entire phrases (such as negation or relative clause), disagreements do not necessarily represent “errors” or “noise” but are rather inherent to the evaluation process. %These disagreements are caused by differences in error perception and/or the fact that there is no single correct translation of a text so that multiple solutions are possible. On the other hand, for some other phenomena (such as omission or verb forms) agreement can be easily improved by providing more precise and detailed instructions to the evaluators.
Anthology ID:
2021.conll-1.18
Volume:
Proceedings of the 25th Conference on Computational Natural Language Learning
Month:
November
Year:
2021
Address:
Online
Venues:
CoNLL | EMNLP
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
234–243
Language:
URL:
https://aclanthology.org/2021.conll-1.18
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.conll-1.18.pdf