A Comprehensive Assessment of Dialog Evaluation Metrics

Yi-Ting Yeh, Maxine Eskenazi, Shikib Mehri


Abstract
Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets. In this paper, 23 different automatic evaluation metrics are evaluated on 10 different datasets. Furthermore, the metrics are assessed in different settings, to better qualify their respective strengths and weaknesses. This comprehensive assessment offers several takeaways pertaining to dialog evaluation metrics in general. It also suggests how to best assess evaluation metrics and indicates promising directions for future work.
Anthology ID:
2021.eancs-1.3
Volume:
The First Workshop on Evaluations and Assessments of Neural Conversation Systems
Month:
November
Year:
2021
Address:
Online
Venues:
EANCS | EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15–33
Language:
URL:
https://aclanthology.org/2021.eancs-1.3
DOI:
10.18653/v1/2021.eancs-1.3
Bibkey:
Cite (ACL):
Yi-Ting Yeh, Maxine Eskenazi, and Shikib Mehri. 2021. A Comprehensive Assessment of Dialog Evaluation Metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33, Online. Association for Computational Linguistics.
Cite (Informal):
A Comprehensive Assessment of Dialog Evaluation Metrics (Yeh et al., EANCS 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eancs-1.3.pdf
Code
 exe1023/DialEvalMetrics
Data
BookCorpusDailyDialogDailyDialog++FEDUSR-PersonaChatUSR-TopicalChat