Dialogue Evaluation with Offline Reinforcement Learning

Nurul Lubis, Christian Geishauser, Hsien-chin Lin, Carel van Niekerk, Michael Heck, Shutong Feng, Milica Gasic


Abstract
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions. They are ideally evaluated with human users, which however is unattainable to do at every iteration of the development phase. Simulated users could be an alternative, however their development is nontrivial. Therefore, researchers resort to offline metrics on existing human-human corpora, which are more practical and easily reproducible. They are unfortunately limited in reflecting real performance of dialogue systems. BLEU for instance is poorly correlated with human judgment, and existing corpus-based metrics such as success rate overlook dialogue context mismatches. There is still a need for a reliable metric for task-oriented systems with good generalization and strong correlation with human judgements. In this paper, we propose the use of offline reinforcement learning for dialogue evaluation based on static data. Such an evaluator is typically called a critic and utilized for policy optimization. We go one step further and show that offline RL critics can be trained for any dialogue system as external evaluators, allowing dialogue performance comparisons across various types of systems. This approach has the benefit of being corpus- and model-independent, while attaining strong correlation with human judgements, which we confirm via an interactive user trial.
Anthology ID:
2022.sigdial-1.46
Volume:
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
September
Year:
2022
Address:
Edinburgh, UK
Editors:
Oliver Lemon, Dilek Hakkani-Tur, Junyi Jessy Li, Arash Ashrafzadeh, Daniel Hernández Garcia, Malihe Alikhani, David Vandyke, Ondřej Dušek
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
478–489
Language:
URL:
https://aclanthology.org/2022.sigdial-1.46
DOI:
10.18653/v1/2022.sigdial-1.46
Bibkey:
Cite (ACL):
Nurul Lubis, Christian Geishauser, Hsien-chin Lin, Carel van Niekerk, Michael Heck, Shutong Feng, and Milica Gasic. 2022. Dialogue Evaluation with Offline Reinforcement Learning. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 478–489, Edinburgh, UK. Association for Computational Linguistics.
Cite (Informal):
Dialogue Evaluation with Offline Reinforcement Learning (Lubis et al., SIGDIAL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sigdial-1.46.pdf
Video:
 https://youtu.be/PMrVVI4g5mM