Jonathan Schiött


2025

pdf bib
Using LLMs to Grade Clinical Reasoning for Medical Students in Virtual Patient Dialogues
Jonathan Schiött | William Ivegren | Alexander Borg | Ioannis Parodis | Gabriel Skantze
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

This paper presents an evaluation of the use of large language models (LLMs) for grading clinical reasoning during rheumatology medical history virtual patient (VP) simulations. The study explores the feasibility of using state-of-the-art LLMs, including both general-purpose models, with various prompting strategies such as zero-shot, analysis-first, and chain-of-thought prompting, as well as reasoning models. The performance of these models in grading transcribed dialogues from VP simulations conducted on a Furhat robot was evaluated against human expert annotations. Human experts initially achieved a 65% inter-rater agreement, which resulted in a pooled Cohen’s Kappa of 0.71 and 82.3% correctness. The best LLM, o3-mini, achieved a pooled Kappa of 0.68 and 81.5% correctness, with response times under 30 seconds, compared to approximately 6 minutes for human grading. These results indicate the possibility that automatic assessments can approach human reliability under controlled simulation conditions while delivering time and cost efficiencies.