The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results

Anja Belz; Anastasia Shimorina; Shubham Agarwal; Ehud Reiter

doi:10.18653/v1/2021.inlg-1.24

The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results

Anya Belz, Anastasia Shimorina, Shubham Agarwal, Ehud Reiter

Abstract

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation. Much of the work on reproducibility has so far focused on metric scores, with reproducibility of human evaluation results receiving far less attention. As part of a research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the first shared task on reproducibility of human evaluations, ReproGen 2021. This paper describes the shared task in detail, summarises results from each of the reproduction studies submitted, and provides further comparative analysis of the results. Out of nine initial team registrations, we received submissions from four teams. Meta-analysis of the four reproduction studies revealed varying degrees of reproducibility, and allowed very tentative first conclusions about what types of evaluation tend to have better reproducibility.

Anthology ID:: 2021.inlg-1.24
Volume:: Proceedings of the 14th International Conference on Natural Language Generation
Month:: August
Year:: 2021
Address:: Aberdeen, Scotland, UK
Editors:: Anya Belz, Angela Fan, Ehud Reiter, Yaji Sripada
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 249–258
Language:
URL:: https://aclanthology.org/2021.inlg-1.24/
DOI:: 10.18653/v1/2021.inlg-1.24
Bibkey:
Cite (ACL):: Anya Belz, Anastasia Shimorina, Shubham Agarwal, and Ehud Reiter. 2021. The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results. In Proceedings of the 14th International Conference on Natural Language Generation, pages 249–258, Aberdeen, Scotland, UK. Association for Computational Linguistics.
Cite (Informal):: The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results (Belz et al., INLG 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.inlg-1.24.pdf

PDF Cite Search Fix data