Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers

Kallirroi Georgila, Carla Gordon, Volodymyr Yanov, David Traum


Abstract
We collected a corpus of dialogues in a Wizard of Oz (WOz) setting in the Internet of Things (IoT) domain. We asked users participating in these dialogues to rate the system on a number of aspects, namely, intelligence, naturalness, personality, friendliness, their enjoyment, overall quality, and whether they would recommend the system to others. Then we asked dialogue observers, i.e., Amazon Mechanical Turkers (MTurkers), to rate these dialogues on the same aspects. We also generated simulated dialogues between dialogue policies and simulated users and asked MTurkers to rate them again on the same aspects. Using linear regression, we developed dialogue evaluation functions based on features from the simulated dialogues and the MTurkers’ ratings, the WOz dialogues and the MTurkers’ ratings, and the WOz dialogues and the WOz participants’ ratings. We applied all these dialogue evaluation functions to a held-out portion of our WOz dialogues, and we report results on the predictive power of these different types of dialogue evaluation functions. Our results suggest that for three conversational aspects (intelligence, naturalness, overall quality) just training evaluation functions on simulated data could be sufficient.
Anthology ID:
2020.lrec-1.91
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
726–734
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.91
DOI:
Bibkey:
Cite (ACL):
Kallirroi Georgila, Carla Gordon, Volodymyr Yanov, and David Traum. 2020. Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 726–734, Marseille, France. European Language Resources Association.
Cite (Informal):
Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers (Georgila et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.91.pdf