What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think

David M. Howcroft, Verena Rieser


Abstract
Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.
Anthology ID:
2021.emnlp-main.703
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8932–8939
Language:
URL:
https://aclanthology.org/2021.emnlp-main.703
DOI:
10.18653/v1/2021.emnlp-main.703
Bibkey:
Cite (ACL):
David M. Howcroft and Verena Rieser. 2021. What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8932–8939, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think (Howcroft & Rieser, EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.703.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.703.mp4