Proceedings of the 7th Clinical Natural Language Processing Workshop

Asma Ben Abacha, Steven Bethard, Danielle Bitterman, Tristan Naumann, Kirk Roberts (Editors)


Anthology ID:
2025.clinicalnlp-1
Month:
October
Year:
2025
Address:
Virtual
Venues:
ClinicalNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2025.clinicalnlp-1/
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2025.clinicalnlp-1.pdf

pdf bib
Proceedings of the 7th Clinical Natural Language Processing Workshop
Asma Ben Abacha | Steven Bethard | Danielle Bitterman | Tristan Naumann | Kirk Roberts

pdf bib
Overview of the 2025 Shared Task on Chemotherapy Treatment Timeline Extraction
Jiarui Yao | Harry Hochheiser | WonJin Yoon | Eli T Goldner | Guergana K Savova

Extracting patient treatment timelines from clinical notes is a complex task involving identification of relevant events, temporal expressions, and temporal relations in individual documents and developing cross-document summaries. The 2025 Shared Task on Chemotherapy Treatment Timeline Extraction builds upon the initial 2024 challenge, using data from 57,530 breast and ovarian cancer patients and 15,946 melanoma patients. Participants were provided with a subset annotated for treatment entities, temporal expressions, temporal relations, and timelines for each patient. This training data was used to addressed two subtasks. Subtask 1 focused on extracting temporal relations and creating timelines, given documents and gold-standard events and temporal expressions. Sutask 2 involved development of an end-to-end system involving extraction of entities, temporal expressions, and relations, and construction of timelines, given only the Electronic Health Record notes. Five teams participated, submitting eight entries for Subtask 1 and twelve for Subtask 2. Supervised fine-tuning remains a productive approach albeit with a shift of supervised fine-tuning of very large language models compared to the 2024 task edition. Even with the much more “strict” evaluation metric, the best results are comparable to the best less strict 2024 relaxed-to-month results.

pdf bib
Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations
Jean-Philippe Corbeil | Asma Ben Abacha | Jerome Tremblay | Phillip Swazinna | Akila Jeeson Daniel | Miguel Del-Agua | Francois Beaulieu

Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants’ solutions.

pdf bib
Overview of the MEDIQA-WV 2025 Shared Task on Woundcare Visual Question Answering
Wen-wai Yim | Asma Ben Abacha | Meliha Yetisgen | Fei Xia

Electronic messaging through patient portals facilitates remote care, connecting patients with doctors through asynchronous communication. While convenient, this new modality places an additional burden on physicians, requiring them to provide remote care as well as to see patients in clinic. Technology that can automatically draft responses for physician review is a promising way to improve clinical efficiency. Here, building on the 2024 MEDIQA Multilingual Multi-modal Medical Answer Generation (MEDIQA-M3G) challenge on dermatology, we present the 2025 MEDIQA Woundcare Visual Question Answering (MEDIQA-WV) shared task focusing on generating clinical responses to patient text and image queries. Three teams participated and submitted a total of fourteen systems. In this paper, we describe the task, datasets, as well as the participating systems and their results. We hope that this work can inspire future research on wound care visual question answering.

pdf bib
Team NLP4Health at ChemoTimelines 2025: Finetuning Large Language Models for Temporal Relation Extractions from Clinical Notes
Zhe Zhao | V.G.Vinod Vydiswaran

Extracting chemotherapy timelines from clinical narratives is a challenging task, but critical for cancer research and practice. In this paper, we present our approach and the research investigation we conducted to participate in Subtask 1 of the ChemoTimelines 2025 shared task on predicting temporal relations between pre-identified events and time expressions. We evaluated multiple fine-tuned large language models for the task. We used supervised fine-tuning strategies for Llama3-8B model to classify temporal relations. Further, we set up zero-shot prompting for Qwen3-14B to normalize time expressions. We also pre-trained and fine-tuned a Llama3-3B model using unlabeled notes and achieved results comparable with the fine-tuned Llama3-8B model. Our results demonstrate the effectiveness of fine-tuning and continual pre-training strategies in adapting large language models to domain-specific tasks.

pdf bib
TEAM UAB at Chemotherapy Timelines 2025: Integrating Encoders and Large Language Models for Chemotherapy Timelines Generation
Vijay Raj Jain | Chris Coffee | Kaiwen He | Remy Cron | Micah D. Cochran | Luis Mansilla-Gonzalez | Akhil Nadimpalli | Danish Murad | John D Osborne

Reconstructing the timeline of Systemic Anticancer Therapy (SACT) or “chemotherapy” from heterogeneous Electronic Health Record(EHR) notes is a challenging task. Rapid developments in Large Language Models (LLMs), including a range of architectural improvements and post-training refinements since the 2024 Chemotherapy Timelines Task could make this task more tractable. We evaluated the performance of 4 recently released LLMs (GPT-4.1-mini, Phi4 and 2 Qwen3 models) on this task. Our results indicate that even witha variety of prompt optimization and synthetic data training, more work is still needed to see a useful application of this work.

pdf bib
UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction
Tianmai M. Zhang | Zhaoyi Sun | Sihang Zeng | Chenxi Li | Neil F. Abernethy | Barbara D. Lam | Fei Xia | Meliha Yetisgen

The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2—generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.

pdf bib
MasonNLP at MEDIQA-OE 2025: Assessing Large Language Models for Structured Medical Order Extraction
A H M Rezaul Karim | Ozlem Uzuner

Medical order extraction is essential for structuring actionable clinical information, supporting decision-making, and enabling downstream applications such as documentation and workflow automation. Orders may be embedded in diverse sources, including electronic health records, discharge summaries, and multi-turn doctor–patient dialogues, and can span categories such as medications, laboratory tests, imaging studies, and follow-up actions. The MEDIQA-OE 2025 shared task focuses on extracting structured medical orders from extended conversational transcripts, requiring the identification of order type, description, reason, and provenance. We present the MasonNLP submission, which ranked 5th among 17 participating teams with 105 total submissions. Our approach uses a general-purpose, instruction-tuned LLaMA-4 17B model without domain-specific fine-tuning, guided by a single in-context example. This few-shot configuration achieved an average F1 score of 37.76, with notable improvements in reason and provenance accuracy. These results demonstrate that large, non-domain-specific LLMs, when paired with effective prompt engineering, can serve as strong, scalable baselines for specialized clinical NLP tasks.

pdf bib
EXL Health AI Lab at MEDIQA-OE 2025: Evaluating Prompting Strategies with MedGemma for Medical Order Extraction
Abhinand Balachandran | Bavana Durgapraveen | Gowsikkan Sikkan Sudhagar | Vidhya Varshany J S | Sriram Rajkumar

The accurate extraction of medical orders fromdoctor-patient conversations is a critical taskfor reducing clinical documentation burdensand ensuring patient safety. This paper detailsour team’s submission to the MEDIQA-OE-2025Shared Task. We investigate the performanceof MedGemma, a new domain-specific opensource language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforwardone-shot approach, a reasoning-focused ReActframework, and a multi-step agentic workflow.Our experiments reveal that while more complex frameworks like ReAct and agentic flowsare powerful, the simpler one-shot promptingmethod achieved the highest performance onthe official validation set. We posit that on manually annotated transcripts, complex reasoningchains can lead to “overthinking” and introduce noise, making a direct approach more robust and efficient. Our work provides valuableinsights into selecting appropriate promptingstrategies for clinical information extraction invaried data conditions.

pdf bib
PNLP at MEDIQA-OE 2025: A Zero-Shot Prompting Strategy with Gemini for Medical Order Extraction
Parth Mehta

Medical order extraction from doctor-patient conversations presents a critical challenge in reducing clinical documentation burden and ensuring accurate capture of patient care instructions. This paper describes our system for the MEDIQA-OE 2025 shared task using the ACI-Bench and PriMock57 datasets, which achieved second place on the public leaderboard with an average score of 0.6014 across four metrics: description ROUGE-1 F1, reason ROUGE-1 F1, order-type strict F1, and provenance multi-label F1. Unlike traditional approaches that rely on fine-tuned biomedical language models, we demonstrate that a carefully engineered zero-shot prompting strategy using Gemini 2.5 Pro can achieve competitive performance without requiring model training or GPU resources. Our approach employs a deterministic state-machine prompt design incorporating chain-of-thought reasoning, self-verification protocols, and structured JSON output generation. The system particularly excels in reason extraction, achieving 0.4130 ROUGE-1 F1, the highest among the top performing teams. Our results suggest that advanced prompt engineering can effectively bridge the gap between general-purpose large language models and specialized clinical NLP tasks, offering a computationally efficient and immediately deployable alternative to traditional fine-tuning approaches with significant implications for resource-constrained healthcare settings.

pdf bib
MasonNLP at MEDIQA-WV 2025: Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
A H M Rezaul Karim | Ozlem Uzuner

Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs—a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking— provides a simple and effective baseline for multimodal clinical NLP tasks.

pdf bib
EXL Health AI Lab at MEDIQA-WV 2025: Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering
Bavana Durgapraveen | Sornaraj Sivasankaran | Abhinand Balachandran | Sriram Rajkumar

The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded, and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.