Dogus Darici
2025
How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
Julie Jung
|
Max Lu
|
Sina Chole Benker
|
Dogus Darici
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively.