Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System

Rudali Huidrom; Ondřej Dušek; Zdeněk Kasner; Thiago Castro Ferreira; Anja Belz

Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System

Rudali Huidrom, Ondřej Dušek, Zdeněk Kasner, Thiago Castro Ferreira, Anya Belz

Abstract

In this paper, we present the results of two reproduction studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against reference outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evaluation. In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of reproducibility depending on result type, data and labelling task. Our resources are available and open-sourced.

Anthology ID:: 2022.inlg-genchal.9
Volume:: Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
Month:: July
Year:: 2022
Address:: Waterville, Maine, USA and virtual meeting
Editors:: Samira Shaikh, Thiago Ferreira, Amanda Stent
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 52–61
Language:
URL:: https://aclanthology.org/2022.inlg-genchal.9/
DOI:
Bibkey:
Cite (ACL):: Rudali Huidrom, Ondřej Dušek, Zdeněk Kasner, Thiago Castro Ferreira, and Anya Belz. 2022. Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System. In Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges, pages 52–61, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System (Huidrom et al., INLG 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.inlg-genchal.9.pdf

PDF Cite Search Fix data