Michael Heilig

2025

This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.

Co-authors

Tazin Afrin 1
Kristine DeRuchie 1
Keelan Evanini 1
Steven Go 1
Le An Ha 1

Victoria Yaneva 1

Venues

AIME-Con1

Fix author