2025
pdf
bib
abs
Automated Structured Radiology Report Generation
Jean-Benoit Delbrouck
|
Justin Xu
|
Johannes Moll
|
Alois Thomas
|
Zhihong Chen
|
Sophie Ostmeier
|
Asfandyar Azhar
|
Kelvin Zhenghao Li
|
Andrew Johnston
|
Christian Bluethgen
|
Eduardo Pontes Reis
|
Mohamed S Muneer
|
Maya Varma
|
Curtis Langlotz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists’ workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT’s hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.
pdf
bib
abs
CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback
Dennis Hein
|
Zhihong Chen
|
Sophie Ostmeier
|
Justin Xu
|
Maya Varma
|
Eduardo Pontes Reis
|
Arne Edward Michalson Md
|
Christian Bluethgen
|
Hyun Joo Shin
|
Curtis Langlotz
|
Akshay S Chaudhari
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Radiologists play a crucial role in translating medical images into actionable reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-language models (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning. Meanwhile, additional preference fine-tuning in the post-training pipeline has become standard practice in the general domain. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback at scale. To address this challenge, we propose an automated pipeline for preference feedback, focusing on chest X-ray radiology report generation (RRG). Specifically, our method leverages publicly available datasets containing pairs of images and radiologist-written reference reports with reference-based metrics, or Judges, eliminating the need for *additional radiologist feedback*. We investigate reward overoptimization via length exploitation in this setting and introduce a length-controlled version of the GREEN score. Our best-performing setup achieves state-of-the-art CheXbert scores on the MIMIC-CXR dataset for the RRG task while on average maintaining robust performance across six additional image perception and reasoning tasks.
2024
pdf
bib
abs
Overview of the First Shared Task on Clinical Text Generation: RRG24 and “Discharge Me!”
Justin Xu
|
Zhihong Chen
|
Andrew Johnston
|
Louis Blankemeier
|
Maya Varma
|
Jason Hom
|
William J. Collins
|
Ankit Modi
|
Robert Lloyd
|
Benjamin Hopkins
|
Curtis Langlotz
|
Jean-Benoit Delbrouck
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
Recent developments in natural language generation have tremendous implications for healthcare. For instance, state-of-the-art systems could automate the generation of sections in clinical reports to alleviate physician workload and streamline hospital documentation. To explore these applications, we present a shared task consisting of two subtasks: (1) Radiology Report Generation (RRG24) and (2) Discharge Summary Generation (“Discharge Me!”). RRG24 involves generating the ‘Findings’ and ‘Impression’ sections of radiology reports given chest X-rays. “Discharge Me!” involves generating the ‘Brief Hospital Course’ and '‘Discharge Instructions’ sections of discharge summaries for patients admitted through the emergency department. “Discharge Me!” submissions were subsequently reviewed by a team of clinicians. Both tasks emphasize the goal of reducing clinician burnout and repetitive workloads by generating documentation. We received 201 submissions from across 8 teams for RRG24, and 211 submissions from across 16 teams for “Discharge Me!”.
pdf
bib
abs
RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports
Jean-Benoit Delbrouck
|
Pierre Chambon
|
Zhihong Chen
|
Maya Varma
|
Andrew Johnston
|
Louis Blankemeier
|
Dave Van Veen
|
Tan Bui
|
Steven Truong
|
Curtis Langlotz
Findings of the Association for Computational Linguistics: ACL 2024
In order to enable extraction of structured clinical data from unstructured radiology reports, we introduce RadGraph-XL, a large-scale, expert-annotated dataset for clinical entity and relation extraction. RadGraph-XL consists of 2,300 radiology reports, which are annotated with over 410,000 entities and relations by board-certified radiologists. Whereas previous approaches focus solely on chest X-rays, RadGraph-XL includes data from four anatomy-modality pairs - chest CT, abdomen/pelvis CT, brain MR, and chest X-rays. Then, in order to automate structured information extraction, we use RadGraph-XL to train transformer-based models for clinical entity and relation extraction. Our evaluations include comprehensive ablation studies as well as an expert reader study that evaluates trained models on out-of-domain data. Results demonstrate that our model surpasses the performance of previous methods by up to 52% and notably outperforms GPT-4 in this domain. We release RadGraph-XL as well as our trained model to foster further innovation and research in structured clinical information extraction.
pdf
bib
abs
GREEN: Generative Radiology Report Evaluation and Error Notation
Sophie Ostmeier
|
Justin Xu
|
Zhihong Chen
|
Maya Varma
|
Louis Blankemeier
|
Christian Bluethgen
|
Arne Edward Michalson Md
|
Michael Moseley
|
Curtis Langlotz
|
Akshay S Chaudhari
|
Jean-Benoit Delbrouck
Findings of the Association for Computational Linguistics: EMNLP 2024
Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to its medical nature. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches.
2023
pdf
bib
abs
Toward Expanding the Scope of Radiology Report Summarization to Multiple Anatomies and Modalities
Zhihong Chen
|
Maya Varma
|
Xiang Wan
|
Curtis Langlotz
|
Jean-Benoit Delbrouck
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Radiology report summarization (RRS) is a growing area of research. Given the Findings section of a radiology report, the goal is to generate a summary (called an Impression section) that highlights the key observations and conclusions of the radiology study. However, RRS currently faces essential limitations. First, many prior studies conduct experiments on private datasets, preventing reproduction of results and fair comparisons across different systems and solutions. Second, most prior approaches are evaluated solely on chest X-rays. To address these limitations, we propose a dataset (MIMIC-RRS) involving three new modalities and seven new anatomies based on the MIMIC-III and MIMIC-CXR datasets. We then conduct extensive experiments to evaluate the performance of models both within and across modality-anatomy pairs in MIMIC-RRS. In addition, we evaluate their clinical efficacy via RadGraph, a factual correctness metric.
pdf
bib
abs
Overview of the RadSum23 Shared Task on Multi-modal and Multi-anatomical Radiology Report Summarization
Jean-Benoit Delbrouck
|
Maya Varma
|
Pierre Chambon
|
Curtis Langlotz
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks
Radiology report summarization is a growing area of research. Given the Findings and/or Background sections of a radiology report, the goal is to generate a summary (called an Impression section) that highlights the key observations and conclusions of the radiology study. Recent efforts have released systems that achieve promising performance as measured by widely used summarization metrics such as BLEU and ROUGE. However, the research area of radiology report summarization currently faces two important limitations. First, most of the results are reported on private datasets. This limitation prevents the ability to reproduce results and fairly compare different systems and solutions. Secondly, to the best of our knowledge, most research is carried out on chest X-rays. To palliate these two limitations, we propose a radiology report summarization (RadSum) challenge on i) a new dataset of eleven different modalities and anatomies pairs based on the MIMIC-III database ii) a multimodal report summarization dataset based on MIMIC-CXR enhanced with a brand-new test-set from Stanford Hospital. In total, we received 112 submissions across 11 teams.
2022
pdf
bib
abs
ViLMedic: a framework for research at the intersection of vision and language in medical AI
Jean-benoit Delbrouck
|
Khaled Saab
|
Maya Varma
|
Sabri Eyuboglu
|
Pierre Chambon
|
Jared Dunnmon
|
Juan Zambrano
|
Akshay Chaudhari
|
Curtis Langlotz
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
There is a growing need to model interactions between data modalities (e.g., vision, language) — both to improve AI predictions on existing tasks and to enable new applications. In the recent field of multimodal medical AI, integrating multiple modalities has gained widespread popularity as multimodal models have proven to improve performance, robustness, require less training samples and add complementary information. To improve technical reproducibility and transparency for multimodal medical tasks as well as speed up progress across medical AI, we present ViLMedic, a Vision-and-Language medical library. As of 2022, the library contains a dozen reference implementations replicating the state-of-the-art results for problems that range from medical visual question answering and radiology report generation to multimodal representation learning on widely adopted medical datasets. In addition, ViLMedic hosts a model-zoo with more than twenty pretrained models for the above tasks designed to be extensible by researchers but also simple for practitioners. Ultimately, we hope our reproducible pipelines can enable clinical translation and create real impact. The library is available at 
https://github.com/jbdel/vilmedic.
2021
pdf
bib
abs
Cross-Domain Data Integration for Named Entity Disambiguation in Biomedical Text
Maya Varma
|
Laurel Orr
|
Sen Wu
|
Megan Leszczynski
|
Xiao Ling
|
Christopher Ré
Findings of the Association for Computational Linguistics: EMNLP 2021
Named entity disambiguation (NED), which involves mapping textual mentions to structured entities, is particularly challenging in the medical domain due to the presence of rare entities. Existing approaches are limited by the presence of coarse-grained structural resources in biomedical knowledge bases as well as the use of training datasets that provide low coverage over uncommon resources. In this work, we address these issues by proposing a cross-domain data integration method that transfers structural knowledge from a general text knowledge base to the medical domain. We utilize our integration scheme to augment structural resources and generate a large biomedical NED dataset for pretraining. Our pretrained model with injected structural knowledge achieves state-of-the-art performance on two benchmark medical NED datasets: MedMentions and BC5CDR. Furthermore, we improve disambiguation of rare entities by up to 57 accuracy points.
2020
pdf
bib
abs
Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning
Rachel Gardner
|
Maya Varma
|
Clare Zhu
|
Ranjay Krishna
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Datasets extracted from social networks and online forums are often prone to the pitfalls of natural language, namely the presence of unstructured and noisy data. In this work, we seek to enable the collection of high-quality question-answer datasets from social media by proposing a novel task for automated quality analysis and data cleaning: question-answer (QA) plausibility. Given a machine or user-generated question and a crowd-sourced response from a social media user, we determine if the question and response are valid; if so, we identify the answer within the free-form response. We design BERT-based models to perform the QA plausibility task, and we evaluate the ability of our models to generate a clean, usable question-answer dataset. Our highest-performing approach consists of a single-task model which determines the plausibility of the question, followed by a multi-task model which evaluates the plausibility of the response as well as extracts answers (Question Plausibility AUROC=0.75, Response Plausibility AUROC=0.78, Answer Extraction F1=0.665).