GCDF1: A Goal- and Context- Driven F-Score for Evaluating User Models

Alexandru Coca, Bo-Hsiang Tseng, Bill Byrne


Abstract
The evaluation of dialogue systems in interaction with simulated users has been proposed to improve turn-level, corpus-based metrics which can only evaluate test cases encountered in a corpus and cannot measure system’s ability to sustain multi-turn interactions. Recently, little emphasis was put on automatically assessing the quality of the user model itself, so unless correlations with human studies are measured, the reliability of user model based evaluation is unknown. We propose GCDF1, a simple but effective measure of the quality of semantic-level conversations between a goal-driven user agent and a system agent. In contrast with previous approaches we measure the F-score at dialogue level and consider user and system behaviours to improve recall and precision estimation. We facilitate scores interpretation by providing a rich hierarchical structure with information about conversational patterns present in the test data and tools to efficiently query the conversations generated. We apply our framework to assess the performance and weaknesses of a Convlab2 user model.
Anthology ID:
2021.eancs-1.2
Volume:
The First Workshop on Evaluations and Assessments of Neural Conversation Systems
Month:
November
Year:
2021
Address:
Online
Editors:
Wei Wei, Bo Dai, Tuo Zhao, Lihong Li, Diyi Yang, Yun-Nung Chen, Y-Lan Boureau, Asli Celikyilmaz, Alborz Geramifard, Aman Ahuja, Haoming Jiang
Venue:
EANCS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7–14
Language:
URL:
https://aclanthology.org/2021.eancs-1.2
DOI:
10.18653/v1/2021.eancs-1.2
Bibkey:
Cite (ACL):
Alexandru Coca, Bo-Hsiang Tseng, and Bill Byrne. 2021. GCDF1: A Goal- and Context- Driven F-Score for Evaluating User Models. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 7–14, Online. Association for Computational Linguistics.
Cite (Informal):
GCDF1: A Goal- and Context- Driven F-Score for Evaluating User Models (Coca et al., EANCS 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eancs-1.2.pdf
Code
 alexcoca/gcdf1