USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Shikib Mehri, Maxine Eskenazi


Abstract
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
Anthology ID:
2020.acl-main.64
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
681–707
Language:
URL:
https://aclanthology.org/2020.acl-main.64
DOI:
10.18653/v1/2020.acl-main.64
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.64.pdf
Video:
 http://slideslive.com/38928829
Code
 shikib/usr
Data
ConvAI2PERSONA-CHATTopical-Chat