Comparing AI tools and Human Raters in Predicting Reading Item Difficulty

Hongli Li, Roula Aldib, Chad Marchong, Kevin Fan


Abstract
This study compares AI tools and human raters in predicting the difficulty of reading comprehension items without response data. Predictions from AI models (ChatGPT, Gemini, Claude, and DeepSeek) and human raters are evaluated against empirical difficulty values derived from student responses. Findings will inform AI’s potential to support test development.
Anthology ID:
2025.aimecon-wip.10
Volume:
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress
Month:
October
Year:
2025
Address:
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Editors:
Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
Venue:
AIME-Con
SIG:
Publisher:
National Council on Measurement in Education (NCME)
Note:
Pages:
84–89
Language:
URL:
https://aclanthology.org/2025.aimecon-wip.10/
DOI:
Bibkey:
Cite (ACL):
Hongli Li, Roula Aldib, Chad Marchong, and Kevin Fan. 2025. Comparing AI tools and Human Raters in Predicting Reading Item Difficulty. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress, pages 84–89, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
Cite (Informal):
Comparing AI tools and Human Raters in Predicting Reading Item Difficulty (Li et al., AIME-Con 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.aimecon-wip.10.pdf