(Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems

Craig Thomson, Anya Belz


Abstract
Human evaluation is widely considered the most reliable form of evaluation in NLP, but recent research has shown it to be riddled with mistakes, often as a result of manual execution of tasks. This paper argues that such mistakes could be avoided if we were to automate, as much as is practical, the process of performing experiments for human evaluation of NLP systems. We provide a simple methodology that can improve both the transparency and reproducibility of experiments. We show how the sequence of component processes of a human evaluation can be defined in advance, facilitating full or partial automation, detailed preregistration of the process, and research transparency and repeatability.
Anthology ID:
2024.inlg-main.22
Volume:
Proceedings of the 17th International Natural Language Generation Conference
Month:
September
Year:
2024
Address:
Tokyo, Japan
Editors:
Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
272–279
Language:
URL:
https://aclanthology.org/2024.inlg-main.22
DOI:
Bibkey:
Cite (ACL):
Craig Thomson and Anya Belz. 2024. (Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems. In Proceedings of the 17th International Natural Language Generation Conference, pages 272–279, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
(Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems (Thomson & Belz, INLG 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.inlg-main.22.pdf