Measure only what is measurable: towards conversation requirements for evaluating task-oriented dialogue systems

Emiel Van Miltenburg; Anouck Braggaar; Emmelyn Croes; Florian Kunneman; Christine Liebrecht; Gabriella Martijn

Measure only what is measurable: towards conversation requirements for evaluating task-oriented dialogue systems

Emiel Van Miltenburg, Anouck Braggaar, Emmelyn Croes, Florian Kunneman, Christine Liebrecht, Gabriella Martijn

Abstract

Chatbots for customer service have been widely studied in many different fields, ranging from Natural Language Processing (NLP) to Communication Science. These fields have developed different evaluation practices to assess chatbot performance (e.g., fluency, task success) and to measure the impact of chatbot usage on the user’s perception of the organisation controlling the chatbot (e.g., brand attitude) as well as their willingness to enter a business transaction or to continue to use the chatbot in the future (i.e., purchase intention, reuse intention). While NLP researchers have developed many automatic measures of success, other fields mainly use questionnaires to compare different chatbots. This paper explores the extent to which we can bridge the gap between the two, and proposes a research agenda to further explore this question.

Anthology ID:: 2025.gem-1.18
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shmueli Scheuer, Gabriel Stanovsky, Oyvind Tafjord
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 231–238
Language:
URL:: https://aclanthology.org/2025.gem-1.18/
DOI:
Bibkey:
Cite (ACL):: Emiel Van Miltenburg, Anouck Braggaar, Emmelyn Croes, Florian Kunneman, Christine Liebrecht, and Gabriella Martijn. 2025. Measure only what is measurable: towards conversation requirements for evaluating task-oriented dialogue systems. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 231–238, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Measure only what is measurable: towards conversation requirements for evaluating task-oriented dialogue systems (Van Miltenburg et al., GEM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.gem-1.18.pdf

PDF Cite Search Fix data