Simulating Rating Scale Responses with LLMs for Early-Stage Item Evaluation

Onur Demirkaya, Hsin-Ro Wei, Evelyn Johnson


Abstract
This study explores the use of large language models to simulate human responses to Likert-scale items. A DeBERTa-base model fine-tuned with item text and examinee ability emulates a graded response model (GRM). High alignment with GRM probabilities and reasonable threshold recovery support LLMs as scalable tools for early-stage item evaluation.
Anthology ID:
2025.aimecon-main.41
Volume:
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Month:
October
Year:
2025
Address:
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Editors:
Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
Venue:
AIME-Con
SIG:
Publisher:
National Council on Measurement in Education (NCME)
Note:
Pages:
385–392
Language:
URL:
https://aclanthology.org/2025.aimecon-main.41/
DOI:
Bibkey:
Cite (ACL):
Onur Demirkaya, Hsin-Ro Wei, and Evelyn Johnson. 2025. Simulating Rating Scale Responses with LLMs for Early-Stage Item Evaluation. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers, pages 385–392, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
Cite (Informal):
Simulating Rating Scale Responses with LLMs for Early-Stage Item Evaluation (Demirkaya et al., AIME-Con 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.aimecon-main.41.pdf