Dogus Darici


2025

pdf bib
How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
Julie Jung | Max Lu | Sina Chole Benker | Dogus Darici
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively.