The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results

Anya Belz, Anastasia Shimorina, Shubham Agarwal, Ehud Reiter


Abstract
The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation. Much of the work on reproducibility has so far focused on metric scores, with reproducibility of human evaluation results receiving far less attention. As part of a research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the first shared task on reproducibility of human evaluations, ReproGen 2021. This paper describes the shared task in detail, summarises results from each of the reproduction studies submitted, and provides further comparative analysis of the results. Out of nine initial team registrations, we received submissions from four teams. Meta-analysis of the four reproduction studies revealed varying degrees of reproducibility, and allowed very tentative first conclusions about what types of evaluation tend to have better reproducibility.
Anthology ID:
2021.inlg-1.24
Volume:
Proceedings of the 14th International Conference on Natural Language Generation
Month:
August
Year:
2021
Address:
Aberdeen, Scotland, UK
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
249–258
Language:
URL:
https://aclanthology.org/2021.inlg-1.24
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.inlg-1.24.pdf