Julian Friedrich


2025

pdf bib
MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters
Amin Dada | Osman Koras | Marie Bauer | Amanda Butler | Kaleb Smith | Jens Kleesiek | Julian Friedrich
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

While increasing patients’ access to medical documents improves medical care, this benefit is limited by varying health literacy levels and complex medical terminology. Large language models (LLMs) offer solutions by simplifying medical information. However, evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline combining LLM-based question-answer generation with manual quality checks. We use this dataset to evaluate various LLMs on patient-oriented question-answering. Our findings reveal that general-purpose LLMs frequently surpass biomedical-adapted models, while automated metrics correlate with human judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the development of LLMs to enhance patient understanding and ultimately improve care outcomes.

pdf bib
Does Biomedical Training Lead to Better Medical Performance?
Amin Dada | Osman Alperen Koraş | Marie Bauer | Jean-Philippe Corbeil | Amanda Butler Contreras | Constantin Marc Seibold | Kaleb E Smith | Julian Friedrich | Jens Kleesiek
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Large Language Models (LLMs) hold significant potential for improving healthcare applications, with biomedically adapted models promising enhanced performance on medical tasks. However, the effectiveness of biomedical domain adaptation for clinical tasks remains uncertain. In this study, we conduct a direct comparison of 12 biomedically adapted models and their general-domain base counterparts across six clinical tasks. Our results reveal that 11 out of 12 biomedical models exhibit performance declines, challenging prior findings that reported positive effects of biomedical adaptation. Notably, previous positive results primarily relied on multiple-choice evaluations, which may not reflect performance in real-world clinical applications. To promote reproducibility and further research, we open-source our evaluation pipeline, providing a resource for the development of models with practical benefits in healthcare settings.