A Reproduction Study of an Annotation-based Human Evaluation of MT Outputs

Maja Popović, Anya Belz


Abstract
In this paper we report our reproduction study of the Croatian part of an annotation-based human evaluation of machine-translated user reviews (Popovic, 2020). The work was carried out as part of the ReproGen Shared Task on Reproducibility of Human Evaluation in NLG. Our aim was to repeat the original study exactly, except for using a different set of evaluators. We describe the experimental design, characterise differences between original and reproduction study, and present the results from each study, along with analysis of the similarity between them. For the six main evaluation results of Major/Minor/All Comprehension error rates and Major/Minor/All Adequacy error rates, we find that (i) 4/6 system rankings are the same in both studies, (ii) the relative differences between systems are replicated well for Major Comprehension and Adequacy (Pearson’s > 0.9), but not for the corresponding Minor error rates (Pearson’s 0.36 for Adequacy, 0.67 for Comprehension), and (iii) the individual system scores for both types of Minor error rates had a higher degree of reproducibility than the corresponding Major error rates. We also examine inter-annotator agreement and compare the annotations obtained in the original and reproduction studies.
Anthology ID:
2021.inlg-1.31
Volume:
Proceedings of the 14th International Conference on Natural Language Generation
Month:
August
Year:
2021
Address:
Aberdeen, Scotland, UK
Editors:
Anya Belz, Angela Fan, Ehud Reiter, Yaji Sripada
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
293–300
Language:
URL:
https://aclanthology.org/2021.inlg-1.31
DOI:
10.18653/v1/2021.inlg-1.31
Bibkey:
Cite (ACL):
Maja Popović and Anya Belz. 2021. A Reproduction Study of an Annotation-based Human Evaluation of MT Outputs. In Proceedings of the 14th International Conference on Natural Language Generation, pages 293–300, Aberdeen, Scotland, UK. Association for Computational Linguistics.
Cite (Informal):
A Reproduction Study of an Annotation-based Human Evaluation of MT Outputs (Popović & Belz, INLG 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.inlg-1.31.pdf