Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead

Neslihan Iskender, Tim Polzehl, Sebastian Möller


Abstract
Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, many researchers use human evaluation as gold standard without questioning the reliability or investigating the factors that might affect the reliability of the human evaluation. As a result, there is a lack of best practices for reliable human summarization evaluation grounded by empirical evidence. To investigate human evaluation reliability, we conduct a series of human evaluation experiments, provide an overview of participant demographics, task design, experimental set-up and compare the results from different experiments. Based on our empirical analysis, we provide guidelines to ensure the reliability of expert and non-expert evaluations, and we determine the factors that might affect the reliability of the human evaluation.
Anthology ID:
2021.humeval-1.10
Volume:
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
Month:
April
Year:
2021
Address:
Online
Editors:
Anya Belz, Shubham Agarwal, Yvette Graham, Ehud Reiter, Anastasia Shimorina
Venue:
HumEval
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
86–96
Language:
URL:
https://aclanthology.org/2021.humeval-1.10
DOI:
Bibkey:
Cite (ACL):
Neslihan Iskender, Tim Polzehl, and Sebastian Möller. 2021. Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 86–96, Online. Association for Computational Linguistics.
Cite (Informal):
Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead (Iskender et al., HumEval 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.humeval-1.10.pdf
Code
 nesliskender/reliability_humeval_summarization