Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Jan Milan Deriu, Mark Cieliebak


Abstract
We present “AutoJudge”, an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these dialogues to train an automated judgement model. Our experiments show that AutoJudge correlates well with the human ratings and can be used to automatically evaluate dialogue systems, even in deployed systems. In a second part, we attempt to apply AutoJudge to improve existing systems. This works well for re-ranking a set of candidate utterances. However, our experiments show that AutoJudge cannot be applied as reward for reinforcement learning, although the metric can distinguish good from bad dialogues. We discuss potential reasons, but state here already that this is still an open question for further research.
Anthology ID:
W19-8654
Volume:
Proceedings of the 12th International Conference on Natural Language Generation
Month:
October–November
Year:
2019
Address:
Tokyo, Japan
Editors:
Kees van Deemter, Chenghua Lin, Hiroya Takamura
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
432–437
Language:
URL:
https://aclanthology.org/W19-8654
DOI:
10.18653/v1/W19-8654
Bibkey:
Cite (ACL):
Jan Milan Deriu and Mark Cieliebak. 2019. Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement. In Proceedings of the 12th International Conference on Natural Language Generation, pages 432–437, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement (Deriu & Cieliebak, INLG 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-8654.pdf
Supplementary attachment:
 W19-8654.Supplementary_Attachment.pdf