Steven Go


2025

pdf bib
Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment
Tazin Afrin | Le An Ha | Victoria Yaneva | Keelan Evanini | Steven Go | Kristine DeRuchie | Michael Heilig
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.