FFAEval: Evaluating Dialogue System via Free-For-All Ranking

Zeyao Ma; Zijun Yao; Jing Zhang; Jifan Yu; Xiaohan Zhang; Juanzi Li; Jie Tang

doi:10.18653/v1/2023.findings-emnlp.1049

FFAEval: Evaluating Dialogue System via Free-For-All Ranking

Zeyao Ma, Zijun Yao, Jing Zhang, Jifan Yu, Xiaohan Zhang, Juanzi Li, Jie Tang

Abstract

Evaluating open-domain dialogue systems is currently an open question. Automatic evaluation metrics have shown poor correlation with human assessment in dialogue generation tasks. Human evaluation, which involves annotators for multi-dimension scoring, is trustworthy but time-consuming. In this work, we propose FFAEval, a reliable and efficient human evaluation framework using Free-For-All ranking approach. By sharing the dialogue history, the framework enables annotators to converse with multiple dialogue systems simultaneously in a single-blind, multi-turn manner. The subsequent free-for-all allows annotators to select the most favourable model in each turn from among all the participating dialogue systems. The final performance of each model is represented by calculating the TrueSkill score derived from the free-for-all competition. Our empirical study on English and Chinese dialogue systems demonstrates that FFAEval achieves a strong correlation with score-based human assessment compared to existing evaluation methods. We further prove the efficiency and stability of our framework in additional experiments. The source code and data are available on Github.

Anthology ID:: 2023.findings-emnlp.1049
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15672–15684
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.1049/
DOI:: 10.18653/v1/2023.findings-emnlp.1049
Bibkey:
Cite (ACL):: Zeyao Ma, Zijun Yao, Jing Zhang, Jifan Yu, Xiaohan Zhang, Juanzi Li, and Jie Tang. 2023. FFAEval: Evaluating Dialogue System via Free-For-All Ranking. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15672–15684, Singapore. Association for Computational Linguistics.
Cite (Informal):: FFAEval: Evaluating Dialogue System via Free-For-All Ranking (Ma et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.1049.pdf

PDF Cite Search Fix data