Binze Hu
2025
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Ming Zhang
|
Yujiong Shen
|
Zelin Li
|
Huayu Sha
|
Binze Hu
|
Yuhui Wang
|
Chenhao Huang
|
Shichun Liu
|
Jingqi Tong
|
Changhao Jiang
|
Mingxu Chai
|
Zhiheng Xi
|
Shihan Dou
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains.
Search
Fix author
Co-authors
- Mingxu Chai 1
- Shihan Dou 1
- Tao Gui 1
- Chenhao Huang 1
- Xuan-Jing Huang (黄萱菁) 1
- show all...