ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text

Tanvi Dinkar, Gavin Abercrombie, Verena Rieser


Abstract
ReproHum is a large multi-institution project designed to examine the reproducibility of human evaluations of natural language processing. As part of the second phase of the project, we attempt to reproduce an evaluation of the fluency of continuations generated by a pre-trained language model compared to a range of baselines. Working within the constraints of the project, with limited information about the original study, and without access to their participant pool, or the responses of individual participants, we find that we are not able to reproduce the original results. Our participants display a greater tendency to prefer one of the system responses, avoiding a judgement of ‘equal fluency’ more than in the original study. We also conduct further evaluations: we elicit ratings from (1) a broader range of participants; (2) from the same participants at different times; and (3) with an altered definition of fluency. Results of these experiments suggest that the original evaluation collected too few ratings, and that the task formulation may be quite ambiguous. Overall, although we were able to conduct a re-evaluation study, we conclude that the original evaluation was not comprehensive enough to make truly meaningful comparisons
Anthology ID:
2024.humeval-1.14
Volume:
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Simone Balloccu, Anya Belz, Rudali Huidrom, Ehud Reiter, Joao Sedoc, Craig Thomson
Venues:
HumEval | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
145–152
Language:
URL:
https://aclanthology.org/2024.humeval-1.14
DOI:
Bibkey:
Cite (ACL):
Tanvi Dinkar, Gavin Abercrombie, and Verena Rieser. 2024. ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text. In Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024, pages 145–152, Torino, Italia. ELRA and ICCL.
Cite (Informal):
ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text (Dinkar et al., HumEval-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.humeval-1.14.pdf