Ruoyu Liu


2025

pdf bib
Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System
Ruoyu Liu | Kui Xue | Xiaofan Zhang | Shaoting Zhang
Proceedings of the 31st International Conference on Computational Linguistics

This study focuses on evaluating proactive communication and diagnostic capabilities of medical Large Language Models (LLMs), which directly impact their effectiveness in patient consultations. In typical medical scenarios, doctors often ask a set of questions to gain a comprehensive understanding of patients’ conditions. We argue that single-turn question-answering tasks such as MultiMedQA are insufficient for evaluating LLMs’ medical consultation abilities. To address this limitation, we developed an evaluation benchmark called Multi-turn Medical Dialogue Evaluation (MMD-Eval), specifically designed to evaluate the proactive communication and diagnostic capabilities of medical LLMs during consultations. Considering the high cost and potential for hallucinations in LLMs, we innovatively trained a task-oriented dialogue system to simulate patients engaging in dialogues with the medical LLMs using our structured medical records dataset. This approach enabled us to generate multi-turn dialogue data. Subsequently, we evaluate the communication skills and medical expertise of the medical LLMs. All resources associated with this study will be made publicly available.