Evaluating LLM-Based Automated Essay Scoring: Accuracy, Fairness, and Validity

Yue Huang, Joshua Wilson


Abstract
This study evaluates large language models (LLMs) for automated essay scoring (AES), comparing prompt strategies and fairness across student groups. We found that well-designed prompting helps LLMs approach traditional AES performance, but both differ from human scores for ELLs—the traditional model shows larger overrall gaps, while LLMs show subtler disparities.
Anthology ID:
2025.aimecon-wip.9
Volume:
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress
Month:
October
Year:
2025
Address:
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Editors:
Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
Venue:
AIME-Con
SIG:
Publisher:
National Council on Measurement in Education (NCME)
Note:
Pages:
71–83
Language:
URL:
https://aclanthology.org/2025.aimecon-wip.9/
DOI:
Bibkey:
Cite (ACL):
Yue Huang and Joshua Wilson. 2025. Evaluating LLM-Based Automated Essay Scoring: Accuracy, Fairness, and Validity. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress, pages 71–83, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
Cite (Informal):
Evaluating LLM-Based Automated Essay Scoring: Accuracy, Fairness, and Validity (Huang & Wilson, AIME-Con 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.aimecon-wip.9.pdf