Youxia Zhao
2026
MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills
Zonghai Yao | Zihao Zhang | Chaolong Tang | Xingyu Bian | Youxia Zhao | Zhichao Yang | Junda Wang | Huixue Zhou | Won Seok Jang | Feiyun Ouyang | Hong Yu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Zonghai Yao | Zihao Zhang | Chaolong Tang | Xingyu Bian | Youxia Zhao | Zhichao Yang | Junda Wang | Huixue Zhou | Won Seok Jang | Feiyun Ouyang | Hong Yu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education’s Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks—LLM-as-medical-student and LLM-as-CS-examiner—designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs’ clinical capabilities for both open- and closed-source LLMs.