LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Ming Zhang; Yujiong Shen; Zelin Li; Huayu Sha; Binze Hu; Yuhui Wang; Chenhao Huang; Shichun Liu; Jingqi Tong; Changhao Jiang; Mingxu Chai; Zhiheng Xi; Shihan Dou; Tao Gui; Qi Zhang; Xuan-Jing Huang (黄萱菁)

doi:10.18653/v1/2025.findings-emnlp.263

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang

Abstract

Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains.

Anthology ID:: 2025.findings-emnlp.263
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4888–4914
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.263/
DOI:: 10.18653/v1/2025.findings-emnlp.263
Bibkey:
Cite (ACL):: Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 4888–4914, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.263.pdf
Checklist:: 2025.findings-emnlp.263.checklist.pdf

PDF Cite Search Checklist Fix data