Sina Chole Benker


2025

We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively.