Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation

Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yixian Liu, Kewei Tu


Abstract
Open-domain dialogue generation has gained increasing attention in Natural Language Processing. Its evaluation requires a holistic means. Human ratings are deemed as the gold standard. As human evaluation is inefficient and costly, an automated substitute is highly desirable. In this paper, we propose holistic evaluation metrics that capture different aspects of open-domain dialogues. Our metrics consist of (1) GPT-2 based context coherence between sentences in a dialogue, (2) GPT-2 based fluency in phrasing, (3) n-gram based diversity in responses to augmented queries, and (4) textual-entailment-inference based logical self-consistency. The empirical validity of our metrics is demonstrated by strong correlations with human judgments. We open source the code and relevant materials.
Anthology ID:
2020.acl-main.333
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3619–3629
Language:
URL:
https://aclanthology.org/2020.acl-main.333
DOI:
10.18653/v1/2020.acl-main.333
Bibkey:
Cite (ACL):
Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yixian Liu, and Kewei Tu. 2020. Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3619–3629, Online. Association for Computational Linguistics.
Cite (Informal):
Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation (Pang et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.333.pdf
Video:
 http://slideslive.com/38929412
Code
 alexzhou907/dialogue_evaluation