On reporting scores and agreement for error annotation tasks

Maja Popović; Anja Belz

doi:10.18653/v1/2022.gem-1.26

On reporting scores and agreement for error annotation tasks

Abstract

This work examines different ways of aggregating scores for error annotation in MT outputs: raw error counts, error counts normalised over total number of words (word percentage’), and error counts normalised over total number of errors (error percentage’). We use each of these three scores to calculate inter-annotator agreement in the form of Krippendorff’s alpha and Pearson’s r and compare the obtained numbers, overall and separately for different types of errors. While each score has its advantages depending on the goal of the evaluation, we argue that the best way of estimating inter-annotator agreement using such numbers are raw counts. If the annotation process ensures that the total number of words cannot differ among the annotators (for example, due to adding omission symbols), normalising over number of words will lead to the same conclusions. In contrast, total number of errors is very subjective because different annotators often perceive different amount of errors in the same text, therefore normalising over this number can indicate lower agreements.

Anthology ID:: 2022.gem-1.26
Volume:: Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates (Hybrid)
Editors:: Antoine Bosselut, Khyathi Chandu, Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Yacine Jernite, Jekaterina Novikova, Laura Perez-Beltrachini
Venue:: GEM
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 306–315
Language:
URL:: https://aclanthology.org/2022.gem-1.26/
DOI:: 10.18653/v1/2022.gem-1.26
Bibkey:
Cite (ACL):: Maja Popović and Anya Belz. 2022. On reporting scores and agreement for error annotation tasks. In Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 306–315, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):: On reporting scores and agreement for error annotation tasks (Popović & Belz, GEM 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.gem-1.26.pdf

PDF Cite Search Fix data