Using LLMs to Grade Clinical Reasoning for Medical Students in Virtual Patient Dialogues

Jonathan Schiött; William Ivegren; Alexander Borg; Ioannis Parodis; Gabriel Skantze

Using LLMs to Grade Clinical Reasoning for Medical Students in Virtual Patient Dialogues

Jonathan Schiött, William Ivegren, Alexander Borg, Ioannis Parodis, Gabriel Skantze

Abstract

This paper presents an evaluation of the use of large language models (LLMs) for grading clinical reasoning during rheumatology medical history virtual patient (VP) simulations. The study explores the feasibility of using state-of-the-art LLMs, including both general-purpose models, with various prompting strategies such as zero-shot, analysis-first, and chain-of-thought prompting, as well as reasoning models. The performance of these models in grading transcribed dialogues from VP simulations conducted on a Furhat robot was evaluated against human expert annotations. Human experts initially achieved a 65% inter-rater agreement, which resulted in a pooled Cohen’s Kappa of 0.71 and 82.3% correctness. The best LLM, o3-mini, achieved a pooled Kappa of 0.68 and 81.5% correctness, with response times under 30 seconds, compared to approximately 6 minutes for human grading. These results indicate the possibility that automatic assessments can approach human reliability under controlled simulation conditions while delivering time and cost efficiencies.

Anthology ID:: 2025.sigdial-1.56
Volume:: Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:: August
Year:: 2025
Address:: Avignon, France
Editors:: Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin
Venue:: SIGDIAL
SIG:: SIGDIAL
Publisher:: Association for Computational Linguistics
Note:
Pages:: 750–763
Language:
URL:: https://aclanthology.org/2025.sigdial-1.56/
DOI:
Bibkey:
Cite (ACL):: Jonathan Schiött, William Ivegren, Alexander Borg, Ioannis Parodis, and Gabriel Skantze. 2025. Using LLMs to Grade Clinical Reasoning for Medical Students in Virtual Patient Dialogues. In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 750–763, Avignon, France. Association for Computational Linguistics.
Cite (Informal):: Using LLMs to Grade Clinical Reasoning for Medical Students in Virtual Patient Dialogues (Schiött et al., SIGDIAL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.sigdial-1.56.pdf

PDF Cite Search Fix data