Sharmin Sultana
2026
Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty
Sravanthi Machcha | Sushrita Yerra | Sahil Gupta | Aishwarya Sahoo | Sharmin Sultana | Hong Yu | Zonghai Yao
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Sravanthi Machcha | Sushrita Yerra | Sahil Gupta | Aishwarya Sahoo | Sharmin Sultana | Hong Yu | Zonghai Yao
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Current evaluation of large language models (LLMs) overwhelmingly prioritizes accuracy; however, in real-world and safety-critical applications, the ability to abstain when uncertain is equally vital for trustworthy deployment. We introduce a unified benchmark and evaluation protocol for abstention in medical multiple-choice question answering (MCQA), integrating conformal prediction, adversarial question perturbations, and explicit abstention options. Our systematic evaluation of both open- and closed-source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain when uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting brings little improvement. These findings highlight the central role of abstention mechanisms for trustworthy LLM deployment and offer practical guidance for improving safety in high-stakes applications.
2025
Do Large Language Models Know When Not to Answer in Medical QA?
Sravanthi Machcha | Sushrita Yerra | Sharmin Sultana | Hong Yu | Zonghai Yao
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)
Sravanthi Machcha | Sushrita Yerra | Sharmin Sultana | Hong Yu | Zonghai Yao
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)
Uncertainty awareness is essential for large language models (LLMs), particularly in safety-critical domains such as medicine where erroneous or hallucinatory outputs can cause harm. Yet most evaluations remain centered on accuracy, offering limited insight into model confidence and its relation to abstention. In this work, we present preliminary experiments that combine conformal prediction with abstention-augmented and perturbed variants of medical QA datasets. Our early results suggest a positive link between uncertainty estimates and abstention decisions, with this effect amplified under higher difficulty and adversarial perturbations. These findings highlight abstention as a practical handle for probing model reliability in medical QA.
Chatbot To Help Patients Understand Their Health
Won Seok Jang | Hieu Tran | Manav Shaileshkumar Mistry | Sai Kiran Gandluri | Yifan Zhang | Sharmin Sultana | Sunjae Kwon | Yuan Zhang | Zonghai Yao | Hong Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Won Seok Jang | Hieu Tran | Manav Shaileshkumar Mistry | Sai Kiran Gandluri | Yifan Zhang | Sharmin Sultana | Sunjae Kwon | Yuan Zhang | Zonghai Yao | Hong Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Patients must possess the knowledge necessary to actively participate in their care. To this end, we developed NoteAid-Chatbot, a conversational AI designed to help patients better understand their health through a novel framework of learning as conversation. We introduce a new learning paradigm that leverages a multi-agent large language model (LLM) and reinforcement learning (RL) framework—without relying on costly human-generated training data. Specifically, NoteAid-Chatbot was built on a lightweight 3-billion-parameter LLaMA 3.2 model using a two-stage training approach: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education—such as clarity, relevance, and structured dialogue—even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains—broadening the applicability of RL-based alignment methods.