ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG

Irene Mondella; Huiyuan Lai; Malvina Nissim

ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG

Irene Mondella, Huiyuan Lai, Malvina Nissim

Abstract

In spite of the core role human judgement plays in evaluating the performance of NLP systems, the way human assessments are elicited in NLP experiments, and to some extent the nature of human judgement itself, pose challenges to the reliability and validity of human evaluation. In the context of the larger ReproHum project, aimed at running large scale multi-lab reproductions of human judgement, we replicated the understandability assessment by humans on several generated outputs of simplified text described in the paper “Neural Text Simplification of Clinical Letters with a Domain Specific Phrase Table” by Shardlow and Nawaz, appeared in the Proceedings of ACL 2019. Although we had to implement a series of modifications compared to the original study, which were necessary to run our human evaluation on exactly the same data, we managed to collect assessments and compare results with the original study. We obtained results consistent with those of the reference study, confirming their findings. The paper is complete with as much information as possible to foster and facilitate future reproduction.

Anthology ID:: 2024.humeval-1.24
Volume:: Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Simone Balloccu, Anya Belz, Rudali Huidrom, Ehud Reiter, Joao Sedoc, Craig Thomson
Venues:: HumEval | WS
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 261–268
Language:
URL:: https://aclanthology.org/2024.humeval-1.24
DOI:
Bibkey:
Cite (ACL):: Irene Mondella, Huiyuan Lai, and Malvina Nissim. 2024. ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG. In Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024, pages 261–268, Torino, Italia. ELRA and ICCL.
Cite (Informal):: ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG (Mondella et al., HumEval-WS 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.humeval-1.24.pdf

PDF Cite Search