Unsupervised Evaluation of Interactive Dialog with DialoGPT

Shikib Mehri, Maxine Eskenazi


Abstract
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.
Anthology ID:
2020.sigdial-1.28
Volume:
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
July
Year:
2020
Address:
1st virtual meeting
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
225–235
Language:
URL:
https://aclanthology.org/2020.sigdial-1.28
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.sigdial-1.28.pdf
Video:
 https://youtube.com/watch?v=lZVNe7XMQ8M