Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Sarik Ghazarian, Johnny Wei, Aram Galstyan, Nanyun Peng


Abstract
Despite advances in open-domain dialogue systems, automatic evaluation of such systems is still a challenging problem. Traditional reference-based metrics such as BLEU are ineffective because there could be many valid responses for a given context that share no common words with reference responses. A recent work proposed Referenced metric and Unreferenced metric Blended Evaluation Routine (RUBER) to combine a learning-based metric, which predicts relatedness between a generated response and a given query, with reference-based metric; it showed high correlation with human judgments. In this paper, we explore using contextualized word embeddings to compute more accurate relatedness scores, thus better evaluation metrics. Experiments show that our evaluation metrics outperform RUBER, which is trained on static embeddings.
Anthology ID:
W19-2310
Volume:
Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Antoine Bosselut, Asli Celikyilmaz, Marjan Ghazvininejad, Srinivasan Iyer, Urvashi Khandelwal, Hannah Rashkin, Thomas Wolf
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
82–89
Language:
URL:
https://aclanthology.org/W19-2310
DOI:
10.18653/v1/W19-2310
Bibkey:
Cite (ACL):
Sarik Ghazarian, Johnny Wei, Aram Galstyan, and Nanyun Peng. 2019. Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 82–89, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings (Ghazarian et al., NAACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-2310.pdf
Data
DailyDialog