When Humans Can’t Agree, Neither Can Machines: The Promise and Pitfalls of LLMs for Formative Literacy Assessment

Owen Henkel, Kirk Vanacore, Bill Roberts


Abstract
Story retell assessments provide valuable insights into reading comprehension but face implementation barriers due to time-intensive administration and scoring. This study examines whether Large Language Models (LLMs) can reliably replicate human judgment in grading story retells. Using a novel dataset, we conduct three complementary studies examining LLM performance across different rubric systems, agreement patterns, and reasoning alignment. We find that LLMs (a) achieve near-human reliability with appropriate rubric design, (b) perform well on easy-to-grade cases but poorly on ambiguous ones, (c) produce explanations for their grades that are plausible for straightforward cases but unreliable for complex ones, and (d) different LLMs display consistent “grading personalities” (systematically scoring harder or easier across all student responses). These findings support hybrid assessment architectures where AI handles routine scoring, enabling more frequent formative assessment while directing teacher expertise toward students requiring nuanced support.
Anthology ID:
2025.aimecon-sessions.8
Volume:
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers
Month:
October
Year:
2025
Address:
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Editors:
Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
Venue:
AIME-Con
SIG:
Publisher:
National Council on Measurement in Education (NCME)
Note:
Pages:
69–78
Language:
URL:
https://aclanthology.org/2025.aimecon-sessions.8/
DOI:
Bibkey:
Cite (ACL):
Owen Henkel, Kirk Vanacore, and Bill Roberts. 2025. When Humans Can’t Agree, Neither Can Machines: The Promise and Pitfalls of LLMs for Formative Literacy Assessment. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers, pages 69–78, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
Cite (Informal):
When Humans Can’t Agree, Neither Can Machines: The Promise and Pitfalls of LLMs for Formative Literacy Assessment (Henkel et al., AIME-Con 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.aimecon-sessions.8.pdf