Peter Baldwin

2025

Towards Reliable Generation of Clinical Chart Items: A Counterfactual Reasoning Approach with Large Language Models
Jiaxuan Li | Saed Rezayi | Peter Baldwin | Polina Harik | Victoria Yaneva
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

This study explores GPT-4 for generating clinical chart items in medical education using three prompting strategies. Expert evaluations found many items usable or promising. The counterfactual approach enhanced novelty, and item quality improved with high-surprisal examples. This is the first investigation of LLMs for automated clinical chart item generation.

pdf bib abs

Using Generative AI to Develop a Common Metric in Item Response Theory
Peter Baldwin
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

We propose a method for linking independently calibrated item response theory (IRT) scales using large language models to generate shared parameter estimates across forms. Applied to medical licensure data, the approach reliably recovers slope values across all conditions and yields accurate intercepts when cross-form differences in item difficulty are small.

pdf bib abs

Implicit Biases in Large Vision–Language Models in Classroom Contexts
Peter Baldwin
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress

Using a counterfactual, adversarial, audit-style approach, we tested whether ChatGPT-4o evaluates classroom lectures differently based on teacher demographics. The model was told only to rate lecture excerpts embedded within classroom images—without reference to the images themselves. Despite this, ratings varied systematically by teacher race and sex, revealing implicit bias.

pdf bib abs

Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems
Sahar Yarmohammadtoosky | Yiyun Zhou | Victoria Yaneva | Peter Baldwin | Saed Rezayi | Brian Clauser | Polina Harik
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

This study examines vulnerabilities in transformer-based automated short-answer grading systems used in medical education, with a focus on how these systems can be manipulated through adversarial gaming strategies. Our research identifies three main types of gaming strategies that exploit the system’s weaknesses, potentially leading to false positives. To counteract these vulnerabilities, we implement several adversarial training methods designed to enhance the system’s robustness. Our results indicate that these methods significantly reduce the susceptibility of grading systems to such manipulations, especially when combined with ensemble techniques like majority voting and Ridge regression, which further improve the system’s defense against sophisticated adversarial inputs. Additionally, employing large language models suchasGPT-4with varied prompting techniques has shown promise in recognizing and scoring gaming strategies effectively. The findings underscore the importance of continuous improvements in AI-driven educational tools to ensure their reliability and fairness in high-stakes settings.

pdf bib abs

This paper presents an automated scoring approach for a formative assessment tool aimed at helping learner physicians enhance their communication skills through simulated patient interactions. The system evaluates transcribed learner responses by detecting key communicative behaviors, such as acknowledgment, empathy, and clarity. Built on an adapted version of the ACTA scoring framework, the model achieves a mean binary F1 score of 0.94 across 8 clinical scenarios. A central contribution of this work is the investigation of how to balance scoring accuracy with scalability. We demonstrate that synthetic training data offers a promising path toward reducing reliance on large, annotated datasets—making automated scoring more accurate and scalable.

2024

pdf bib abs

This paper reports findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions. The task was organized as part of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA’24), held in conjunction with NAACL 2024, and called upon the research community to contribute solutions to the problem of modeling difficulty and response time for clinical multiple-choice questions (MCQs). A set of 667 previously used and now retired MCQs from the United States Medical Licensing Examination (USMLE®) and their corresponding difficulties and mean response times were made available for experimentation. A total of 17 teams submitted solutions and 12 teams submitted system report papers describing their approaches. This paper summarizes the findings from the shared task and analyzes the main approaches proposed by the participants.

2021

pdf bib abs

Using Linguistic Features to Predict the Response Process Complexity Associated with Answering Clinical MCQs
Victoria Yaneva | Daniel Jurich | Le An Ha | Peter Baldwin
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

This study examines the relationship between the linguistic characteristics of a test item and the complexity of the response process required to answer it correctly. Using data from a large-scale medical licensing exam, clustering methods identified items that were similar with respect to their relative difficulty and relative response-time intensiveness to create low response process complexity and high response process complexity item classes. Interpretable models were used to investigate the linguistic features that best differentiated between these classes from a descriptive and predictive framework. Results suggest that nuanced features such as the number of ambiguous medical terms help explain response process complexity beyond superficial item characteristics such as word count. Yet, although linguistic features carry signal relevant to response process complexity, the classification of individual items remains challenging.

2020

pdf bib abs

Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning
Kang Xue | Victoria Yaneva | Christopher Runyon | Peter Baldwin
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper investigates whether transfer learning can improve the prediction of the difficulty and response time parameters for 18,000 multiple-choice questions from a high-stakes medical exam. The type the signal that best predicts difficulty and response time is also explored, both in terms of representation abstraction and item component used as input (e.g., whole item, answer options only, etc.). The results indicate that, for our sample, transfer learning can improve the prediction of item difficulty when response time is used as an auxiliary task but not the other way around. In addition, difficulty was best predicted using signal from the item stem (the description of the clinical case), while all parts of the item were important for predicting the response time.

pdf bib abs

Predicting Item Survival for Multiple Choice Questions in a High-Stakes Medical Exam
Victoria Yaneva | Le An Ha | Peter Baldwin | Janet Mee
Proceedings of the Twelfth Language Resources and Evaluation Conference

One of the most resource-intensive problems in the educational testing industry relates to ensuring that newly-developed exam questions can adequately distinguish between students of high and low ability. The current practice for obtaining this information is the costly procedure of pretesting: new items are administered to test-takers and then the items that are too easy or too difficult are discarded. This paper presents the first study towards automatic prediction of an item’s probability to “survive” pretesting (item survival), focusing on human-produced MCQs for a medical exam. Survival is modeled through a number of linguistic features and embedding types, as well as features inspired by information retrieval. The approach shows promising first results for this challenging new application and for modeling the difficulty of expert-knowledge questions.

2019

pdf bib abs

Predicting the Difficulty of Multiple Choice Questions in a High-stakes Medical Exam
Le An Ha | Victoria Yaneva | Peter Baldwin | Janet Mee
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Predicting the construct-relevant difficulty of Multiple-Choice Questions (MCQs) has the potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, we propose a method for estimating the difficulty of MCQs from a high-stakes medical exam, where all questions were deliberately written to a common reading level. To accomplish this, we extract a large number of linguistic features and embedding types, as well as features quantifying the difficulty of the items for an automatic question-answering system. The results show that the proposed approach outperforms various baselines with a statistically significant difference. Best results were achieved when using the full feature set, where embeddings had the highest predictive power, followed by linguistic features. An ablation study of the various types of linguistic features suggested that information from all levels of linguistic processing contributes to predicting item difficulty, with features related to semantic ambiguity and the psycholinguistic properties of words having a slightly higher importance. Owing to its generic nature, the presented approach has the potential to generalize over other exams containing MCQs.