AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Zhihao Fan; Lai Wei; Jialong Tang; Wei Chen; Wang Siyuan; Zhongyu Wei; Fei Huang

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, Fei Huang

Abstract

Artificial intelligence has significantly revolutionized healthcare, particularly through large language models (LLMs) that demonstrate superior performance in static medical question answering benchmarks. However, evaluating the potential of LLMs for real-world clinical applications remains challenging due to the intricate nature of doctor-patient interactions. To address this, we introduce AI Hospital, a multi-agent framework emulating dynamic medical interactions between Doctor as player and NPCs including Patient and Examiner. This setup allows for more practical assessments of LLMs in simulated clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and multiple evaluation strategies to quantify the performance of LLM-driven Doctor agents on symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance medical interaction capabilities through iterative discussions. Despite improvements, current LLMs (including GPT-4) still exhibit significant performance gaps in multi-turn interactive scenarios compared to non-interactive scenarios. Our findings highlight the need for further research to bridge these gaps and improve LLMs’ clinical decision-making capabilities. Our data, code, and experimental results are all open-sourced at https://github.com/LibertFan/AI_Hospital.

Anthology ID:: 2025.coling-main.680
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10183–10213
Language:
URL:: https://aclanthology.org/2025.coling-main.680/
DOI:
Bibkey:
Cite (ACL):: Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, and Fei Huang. 2025. AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator (Fan et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.680.pdf

PDF Cite Search Fix data