Unsupervised Evaluation of Interactive Dialog with DialoGPT

Shikib Mehri, Maxine Eskenazi


Abstract
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.
Anthology ID:
2020.sigdial-1.28
Volume:
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
July
Year:
2020
Address:
1st virtual meeting
Editors:
Olivier Pietquin, Smaranda Muresan, Vivian Chen, Casey Kennington, David Vandyke, Nina Dethlefs, Koji Inoue, Erik Ekstedt, Stefan Ultes
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
225–235
Language:
URL:
https://aclanthology.org/2020.sigdial-1.28
DOI:
10.18653/v1/2020.sigdial-1.28
Bibkey:
Cite (ACL):
Shikib Mehri and Maxine Eskenazi. 2020. Unsupervised Evaluation of Interactive Dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–235, 1st virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Evaluation of Interactive Dialog with DialoGPT (Mehri & Eskenazi, SIGDIAL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.sigdial-1.28.pdf
Video:
 https://youtube.com/watch?v=lZVNe7XMQ8M
Code
 shikib/fed +  additional community code
Data
FED