When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

Tanya Shourya; Yingfan Wang; Zhaoyi Joey Hou; Shamik Roy; Vinayshekhar Bannihatti Kumar; Rashmi Gangadharaiah

When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

Tanya Shourya, Yingfan Wang, Zhaoyi Joey Hou, Shamik Roy, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

Abstract

Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents’ tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues—such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.

Anthology ID:: 2026.gem-main.72
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 862–892
Language:
URL:: https://aclanthology.org/2026.gem-main.72/
DOI:
Bibkey:
Cite (ACL):: Tanya Shourya, Yingfan Wang, Zhaoyi Joey Hou, Shamik Roy, Vinayshekhar Bannihatti Kumar, and Rashmi Gangadharaiah. 2026. When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 862–892, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue (Shourya et al., GEM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.gem-main.72.pdf

PDF Cite Search Fix data