RAIDEN Benchmark: Evaluating Role-playing Conversational Agents with Measurement-Driven Custom Dialogues

Bowen Wu, Kaili Sun, Ziwei Bai, Ying Li, Baoxun Wang


Abstract
As Large-scale Language Models (LLMs) advance, the development of engaging Role-Playing Conversational Agents (RPCAs) has gained prominence. Despite this progress, there is a notable absence of benchmarks designed around dialogues, rather than question-answering formats, to assess the effectiveness of RPCA interactions. This paper introduces the RAIDEN benchmark, containing a comprehensive dataset specifically developed for RPCA evaluation, comprising over 40,000 multi-turn utterances across 135 characters. The benchmark focuses on assessing particular dimensions at different stages of a conversation, facilitated through interactions conducted by annotators. This approach allows the evaluation phase to concentrate on specific response dimensions, and thus subjectivity in dialogue evaluation is reduced. To further enhance objectivity, evaluators compare responses from two different models rather than assessing a single response in isolation. Besides, we introduce RPCAJudger, a specialized judging LLM tailored for automatic RPCA evaluation. The evaluations conducted by RPCAJudger closely mirror human judgments, and its API-free methodology serves to prevent potential data leakage. All the models and all non-private leaderboard data will be made publicly available.
Anthology ID:
2025.coling-main.735
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11086–11106
Language:
URL:
https://aclanthology.org/2025.coling-main.735/
DOI:
Bibkey:
Cite (ACL):
Bowen Wu, Kaili Sun, Ziwei Bai, Ying Li, and Baoxun Wang. 2025. RAIDEN Benchmark: Evaluating Role-playing Conversational Agents with Measurement-Driven Custom Dialogues. In Proceedings of the 31st International Conference on Computational Linguistics, pages 11086–11106, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
RAIDEN Benchmark: Evaluating Role-playing Conversational Agents with Measurement-Driven Custom Dialogues (Wu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.735.pdf