Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment

Tazin Afrin, Le An Ha, Victoria Yaneva, Keelan Evanini, Steven Go, Kristine DeRuchie, Michael Heilig


Abstract
This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.
Anthology ID:
2025.aimecon-main.19
Volume:
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Month:
October
Year:
2025
Address:
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Editors:
Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
Venue:
AIME-Con
SIG:
Publisher:
National Council on Measurement in Education (NCME)
Note:
Pages:
172–182
Language:
URL:
https://aclanthology.org/2025.aimecon-main.19/
DOI:
Bibkey:
Cite (ACL):
Tazin Afrin, Le An Ha, Victoria Yaneva, Keelan Evanini, Steven Go, Kristine DeRuchie, and Michael Heilig. 2025. Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers, pages 172–182, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
Cite (Informal):
Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment (Afrin et al., AIME-Con 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.aimecon-main.19.pdf