Christian Bluethgen


2024

pdf bib
GREEN: Generative Radiology Report Evaluation and Error Notation
Sophie Ostmeier | Justin Xu | Zhihong Chen | Maya Varma | Louis Blankemeier | Christian Bluethgen | Arne Edward Michalson Md | Michael Moseley | Curtis Langlotz | Akshay S Chaudhari | Jean-Benoit Delbrouck
Findings of the Association for Computational Linguistics: EMNLP 2024

Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to its medical nature. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches.

2023

pdf bib
RadAdapt: Radiology Report Summarization via Lightweight Domain Adaptation of Large Language Models
Dave Van Veen | Cara Van Uden | Maayane Attias | Anuj Pareek | Christian Bluethgen | Malgorzata Polacin | Wah Chiu | Jean-Benoit Delbrouck | Juan Zambrano Chaves | Curtis Langlotz | Akshay Chaudhari | John Pauly
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

We systematically investigate lightweight strategies to adapt large language models (LLMs) for the task of radiology report summarization (RRS). Specifically, we focus on domain adaptation via pretraining (on natural language, biomedical text, or clinical text) and via discrete prompting or parameter-efficient fine-tuning. Our results consistently achieve best performance by maximally adapting to the task via pretraining on clinical text and fine-tuning on RRS examples. Importantly, this method fine-tunes a mere 0.32% of parameters throughout the model, in contrast to end-to-end fine-tuning (100% of parameters). Additionally, we study the effect of in-context examples and out-of-distribution (OOD) training before concluding with a radiologist reader study and qualitative analysis. Our findings highlight the importance of domain adaptation in RRS and provide valuable insights toward developing effective natural language processing solutions for clinical tasks.

2022

pdf bib
Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards
Jean-Benoit Delbrouck | Pierre Chambon | Christian Bluethgen | Emily Tsai | Omar Almusa | Curtis Langlotz
Findings of the Association for Computational Linguistics: EMNLP 2022

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. These systems have achieved promising performance as measured by widely used NLG metrics such as BLEU and CIDEr. However, the current systems face important limitations. First, they present an increased complexity in architecture that offers only marginal improvements on NLG metrics. Secondly, these systems that achieve high performance on these metrics are not always factually complete or consistent due to both inadequate training and evaluation. Recent studies have shown the systems can be substantially improved by using new methods encouraging 1) the generation of domain entities consistent with the reference and 2) describing these entities in inferentially consistent ways. So far, these methods rely on weakly-supervised approaches (rule-based) and named entity recognition systems that are not specific to the chest X-ray domain. To overcome this limitation, we propose a new method, the RadGraph reward, to further improve the factual completeness and correctness of generated radiology reports. More precisely, we leverage the RadGraph dataset containing annotated chest X-ray reports with entities and relations between entities. On two open radiology report datasets, our system substantially improves the scores up to 14.2% and 25.3% on metrics evaluating the factual correctness and completeness of reports.