How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment

Julie Jung, Max Lu, Sina Chole Benker, Dogus Darici


Abstract
We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively.
Anthology ID:
2025.aimecon-main.28
Volume:
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Month:
October
Year:
2025
Address:
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Editors:
Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
Venue:
AIME-Con
SIG:
Publisher:
National Council on Measurement in Education (NCME)
Note:
Pages:
265–273
Language:
URL:
https://aclanthology.org/2025.aimecon-main.28/
DOI:
Bibkey:
Cite (ACL):
Julie Jung, Max Lu, Sina Chole Benker, and Dogus Darici. 2025. How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers, pages 265–273, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
Cite (Informal):
How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment (Jung et al., AIME-Con 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.aimecon-main.28.pdf