Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System

Ruoyu Liu; Kui Xue; Xiaofan Zhang; Shaoting Zhang

Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System

Ruoyu Liu, Kui Xue, Xiaofan Zhang, Shaoting Zhang

Abstract

This study focuses on evaluating proactive communication and diagnostic capabilities of medical Large Language Models (LLMs), which directly impact their effectiveness in patient consultations. In typical medical scenarios, doctors often ask a set of questions to gain a comprehensive understanding of patients’ conditions. We argue that single-turn question-answering tasks such as MultiMedQA are insufficient for evaluating LLMs’ medical consultation abilities. To address this limitation, we developed an evaluation benchmark called Multi-turn Medical Dialogue Evaluation (MMD-Eval), specifically designed to evaluate the proactive communication and diagnostic capabilities of medical LLMs during consultations. Considering the high cost and potential for hallucinations in LLMs, we innovatively trained a task-oriented dialogue system to simulate patients engaging in dialogues with the medical LLMs using our structured medical records dataset. This approach enabled us to generate multi-turn dialogue data. Subsequently, we evaluate the communication skills and medical expertise of the medical LLMs. All resources associated with this study will be made publicly available.

Anthology ID:: 2025.coling-main.325
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4871–4896
Language:
URL:: https://aclanthology.org/2025.coling-main.325/
DOI:
Bibkey:
Cite (ACL):: Ruoyu Liu, Kui Xue, Xiaofan Zhang, and Shaoting Zhang. 2025. Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System. In Proceedings of the 31st International Conference on Computational Linguistics, pages 4871–4896, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System (Liu et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.325.pdf

PDF Cite Search Fix data