RadEval: A framework for radiology text evaluation

Justin Xu; Xi Zhang; Javid Abderezaei; Julie Bauml; Roger Boodoo; Fatemeh Haghighi; Ali Ganjizadeh; Eric Brattain; Dave Van Veen; Zaiqiao Meng; David W Eyre; Jean-Benoit Delbrouck

doi:10.18653/v1/2025.emnlp-demos.40

RadEval: A framework for radiology text evaluation

Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David W Eyre, Jean-Benoit Delbrouck

Abstract

We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics - from classic n‐gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM‐based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder - demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.

Anthology ID:: 2025.emnlp-demos.40
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 546–557
Language:
URL:: https://aclanthology.org/2025.emnlp-demos.40/
DOI:: 10.18653/v1/2025.emnlp-demos.40
Bibkey:
Cite (ACL):: Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David W Eyre, and Jean-Benoit Delbrouck. 2025. RadEval: A framework for radiology text evaluation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 546–557, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: RadEval: A framework for radiology text evaluation (Xu et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-demos.40.pdf

PDF Cite Search Fix data