Item Response Theory for Efficient Human Evaluation of Chatbots

João Sedoc, Lyle Ungar


Abstract
Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.
Anthology ID:
2020.eval4nlp-1.3
Volume:
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
Month:
November
Year:
2020
Address:
Online
Editors:
Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, Eduard Hovy
Venue:
Eval4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21–33
Language:
URL:
https://aclanthology.org/2020.eval4nlp-1.3
DOI:
10.18653/v1/2020.eval4nlp-1.3
Bibkey:
Cite (ACL):
João Sedoc and Lyle Ungar. 2020. Item Response Theory for Efficient Human Evaluation of Chatbots. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 21–33, Online. Association for Computational Linguistics.
Cite (Informal):
Item Response Theory for Efficient Human Evaluation of Chatbots (Sedoc & Ungar, Eval4NLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.eval4nlp-1.3.pdf
Optional supplementary material:
 2020.eval4nlp-1.3.OptionalSupplementaryMaterial.pdf
Video:
 https://slideslive.com/38939718