Dogus Darici

2025

pdf bib abs
How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
Julie Jung | Max Lu | Sina Chole Benker | Dogus Darici
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively.

Co-authors

Sina Chole Benker 1
Julie Jung 1
Max Lu 1

Venues

aimecon1

Fix author